I have a bash script that takes as input three arrays with equal length: METHODS
, INFILES
and OUTFILES
.
This script will let METHODS[i]
solves problem INFILES[i]
and saves the result to OUTFILES[i]
, for all indices i
(0 <= i <= length-1
).
Each element in METHODS
is a string of the form:
$HOME/program/solver -a <method>
where solver is a program that can be called as follows:
$HOME/program/solver -a <method> -m <input file> -o <output file> --timeout <timeout in seconds>
The script solves all the problems in parallel and set the runtime limit for each instance to 1 hour (some methods can solve some problems very quickly though), as follows:
#!/bin/bash
source METHODS
source INFILES
source OUTFILES
start=`date +%s`
## Solve in PARALLEL
for index in ${!OUTFILES[*]}; do
(alg=${METHODS[$index]}
infile=${INFILES[$index]}
outfile=${OUTFILES[$index]}
${!alg} -m $infile -o $outfile --timeout 3600) &
done
wait
end=`date +%s`
runtime=$((end-start))
echo "Total runtime = $runtime (s)"
echo "Total number of processes = ${#OUTFILES[@]}"
In the above I have length = 619
. I submitted this bash to a cluster with 70 available processors, which should take at maximum 9 hours to finish all the tasks. This is not the case in reality, however. When using the top
command to investigate, I found that only two or three processes are running (state = R
) while all the others are sleeping (state = D
).
What am I doing wrong please?
Furthermore, I have learnt that GNU parallel would be much better for running parallel jobs. How can I use it for the above task?
Thank you very much for your help!
Update: My first try with GNU parallel:
The idea is to write all the commands to a file and then use GNU parallel to execute them:
#!/bin/bash
source METHODS
source INFILES
source OUTFILES
start=`date +%s`
## Write to file
firstline=true
for index in ${!OUTFILES[*]}; do
(alg=${METHODS[$index]}
infile=${INFILES[$index]}
outfile=${OUTFILES[$index]}
if [ "$firstline" = true ] ; then
echo "${!alg} -m $infile -o $outfile --timeout 3600" > commands.txt
firstline=false
else
echo "${!alg} -m $infile -o $outfile --timeout 3600" >> commands.txt
fi
done
## Solve in PARALLEL
time parallel :::: commands.txt
end=`date +%s`
runtime=$((end-start))
echo "Total runtime = $runtime (s)"
echo "Total number of processes = ${#OUTFILES[@]}"
What do you think?
Update 2: I'm using GNU parallel and having the same problem. Here's the output of top
:
top - 02:05:25 up 178 days, 8:16, 2 users, load average: 62.59, 59.90, 53.29
Tasks: 596 total, 7 running, 589 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.9%us, 0.9%sy, 0.0%ni, 63.3%id, 22.9%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 264139632k total, 260564864k used, 3574768k free, 4564k buffers
Swap: 268420092k total, 80593460k used, 187826632k free, 53392k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28542 khue 20 0 7012m 5.6g 1816 R 100 2.2 12:50.22 opengm_min_sum
28553 khue 20 0 11.6g 11g 1668 R 100 4.4 17:37.37 opengm_min_sum
28544 khue 20 0 13.6g 8.6g 2004 R 100 3.4 12:41.67 opengm_min_sum
28549 khue 20 0 13.6g 8.7g 2000 R 100 3.5 2:54.36 opengm_min_sum
28551 khue 20 0 11.6g 11g 1668 R 100 4.4 19:48.36 opengm_min_sum
28528 khue 20 0 6934m 4.9g 1732 R 29 1.9 1:01.13 opengm_min_sum
28563 khue 20 0 7722m 6.7g 1680 D 2 2.7 0:56.74 opengm_min_sum
28566 khue 20 0 8764m 7.9g 1680 D 2 3.1 1:00.13 opengm_min_sum
28530 khue 20 0 5686m 4.8g 1732 D 1 1.9 0:56.23 opengm_min_sum
28534 khue 20 0 5776m 4.6g 1744 D 1 1.8 0:53.46 opengm_min_sum
28539 khue 20 0 6742m 5.0g 1732 D 1 2.0 0:58.95 opengm_min_sum
28548 khue 20 0 5776m 4.7g 1744 D 1 1.9 0:55.67 opengm_min_sum
28559 khue 20 0 8258m 7.1g 1680 D 1 2.8 0:57.90 opengm_min_sum
28564 khue 20 0 10.6g 10g 1680 D 1 4.0 1:08.75 opengm_min_sum
28529 khue 20 0 5686m 4.4g 1732 D 1 1.7 1:05.55 opengm_min_sum
28531 khue 20 0 4338m 3.6g 1724 D 1 1.4 0:57.72 opengm_min_sum
28533 khue 20 0 6064m 5.2g 1744 D 1 2.1 1:05.19 opengm_min_sum
(opengm_min_sum
is the solver
above)
I guess that some processes consume so much resource that the others do not have anything left and enter the D state?
Summary of the comments: The machine is fast but doesn't have enough memory to run everything in parallel. In addition the problem needs to read a lot of data and the disk bandwidth is not enough, so the cpus are idle most of the time waiting for data.
Rearranging the tasks helps.
Not yet investigated compressing the data to see if it can improve the effective disk I/O bandwidth.