Unix & Linux
bash parallelism gnu-parallel
Updated Tue, 19 Jul 2022 17:47:45 GMT

BASH: parallel run


I have a bash script that takes as input three arrays with equal length: METHODS, INFILES and OUTFILES.

This script will let METHODS[i] solves problem INFILES[i] and saves the result to OUTFILES[i], for all indices i (0 <= i <= length-1).

Each element in METHODSis a string of the form:

$HOME/program/solver -a <method>

where solver is a program that can be called as follows:

$HOME/program/solver -a <method> -m <input file> -o <output file> --timeout <timeout in seconds>

The script solves all the problems in parallel and set the runtime limit for each instance to 1 hour (some methods can solve some problems very quickly though), as follows:

#!/bin/bash
source METHODS
source INFILES
source OUTFILES
start=`date +%s`
## Solve in PARALLEL
for index in ${!OUTFILES[*]}; do 
    (alg=${METHODS[$index]}
    infile=${INFILES[$index]}
    outfile=${OUTFILES[$index]}
    ${!alg}  -m $infile -o $outfile --timeout 3600) &
done
wait
end=`date +%s`
runtime=$((end-start))
echo "Total runtime = $runtime (s)"
echo "Total number of processes = ${#OUTFILES[@]}"

In the above I have length = 619. I submitted this bash to a cluster with 70 available processors, which should take at maximum 9 hours to finish all the tasks. This is not the case in reality, however. When using the top command to investigate, I found that only two or three processes are running (state = R) while all the others are sleeping (state = D).

What am I doing wrong please?

Furthermore, I have learnt that GNU parallel would be much better for running parallel jobs. How can I use it for the above task?

Thank you very much for your help!

Update: My first try with GNU parallel:

The idea is to write all the commands to a file and then use GNU parallel to execute them:

#!/bin/bash
source METHODS
source INFILES
source OUTFILES
start=`date +%s`    
## Write to file
firstline=true
for index in ${!OUTFILES[*]}; do 
    (alg=${METHODS[$index]}
    infile=${INFILES[$index]}
    outfile=${OUTFILES[$index]}
    if [ "$firstline" = true ] ; then
        echo "${!alg}  -m $infile -o $outfile --timeout 3600" > commands.txt
        firstline=false
    else
        echo "${!alg}  -m $infile -o $outfile --timeout 3600" >> commands.txt
    fi
done
## Solve in PARALLEL
time parallel :::: commands.txt
end=`date +%s`
runtime=$((end-start))
echo "Total runtime = $runtime (s)"
echo "Total number of processes = ${#OUTFILES[@]}"

What do you think?

Update 2: I'm using GNU parallel and having the same problem. Here's the output of top:

top - 02:05:25 up 178 days,  8:16,  2 users,  load average: 62.59, 59.90, 53.29
Tasks: 596 total,   7 running, 589 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.9%us,  0.9%sy,  0.0%ni, 63.3%id, 22.9%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  264139632k total, 260564864k used,  3574768k free,     4564k buffers
Swap: 268420092k total, 80593460k used, 187826632k free,    53392k cached
  PID USER     PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME+   COMMAND
28542 khue     20   0 7012m 5.6g 1816 R  100  2.2  12:50.22 opengm_min_sum
28553 khue     20   0 11.6g  11g 1668 R  100  4.4  17:37.37 opengm_min_sum
28544 khue     20   0 13.6g 8.6g 2004 R  100  3.4  12:41.67 opengm_min_sum
28549 khue     20   0 13.6g 8.7g 2000 R  100  3.5   2:54.36 opengm_min_sum
28551 khue     20   0 11.6g  11g 1668 R  100  4.4  19:48.36 opengm_min_sum
28528 khue     20   0 6934m 4.9g 1732 R   29  1.9   1:01.13 opengm_min_sum
28563 khue     20   0 7722m 6.7g 1680 D    2  2.7   0:56.74 opengm_min_sum
28566 khue     20   0 8764m 7.9g 1680 D    2  3.1   1:00.13 opengm_min_sum
28530 khue     20   0 5686m 4.8g 1732 D    1  1.9   0:56.23 opengm_min_sum
28534 khue     20   0 5776m 4.6g 1744 D    1  1.8   0:53.46 opengm_min_sum
28539 khue     20   0 6742m 5.0g 1732 D    1  2.0   0:58.95 opengm_min_sum
28548 khue     20   0 5776m 4.7g 1744 D    1  1.9   0:55.67 opengm_min_sum
28559 khue     20   0 8258m 7.1g 1680 D    1  2.8   0:57.90 opengm_min_sum
28564 khue     20   0 10.6g  10g 1680 D    1  4.0   1:08.75 opengm_min_sum
28529 khue     20   0 5686m 4.4g 1732 D    1  1.7   1:05.55 opengm_min_sum
28531 khue     20   0 4338m 3.6g 1724 D    1  1.4   0:57.72 opengm_min_sum
28533 khue     20   0 6064m 5.2g 1744 D    1  2.1   1:05.19 opengm_min_sum

(opengm_min_sum is the solver above)

I guess that some processes consume so much resource that the others do not have anything left and enter the D state?




Solution

Summary of the comments: The machine is fast but doesn't have enough memory to run everything in parallel. In addition the problem needs to read a lot of data and the disk bandwidth is not enough, so the cpus are idle most of the time waiting for data.

Rearranging the tasks helps.

Not yet investigated compressing the data to see if it can improve the effective disk I/O bandwidth.





Comments (2)

  • +0 – It appears that the main issue comes from the software, but your comments and suggestions were very helpful. Could you be more specific on "compressing the data"? (any insight on how to do that). Thanks. — Feb 09, 2017 at 21:57  
  • +0 – @Khue It depends on how you use the data. If you essentially just read it as one long stream then compressing it will probably help. If you can compress your file to a third of the original size (quite reasonable for text file with gzip) then you only need to read a third of the number of bytes from disk. You do need to process the compressed stream, but as we have seen you are not lacking in cpu. Compressing the stream in effect would make your disks 3 times faster. There is no one size fits all approach here. — Feb 10, 2017 at 00:17