Unix & Linux
shell parallelism
Updated Sat, 23 Jul 2022 22:18:27 GMT

Executing piped commands in parallel


Consider the following scenario. I have two programs A and B. Program A outputs to stdout lines of strings while program B process lines from stdin. The way to use these two programs is of course:

foo@bar:~$ A | B

Now I've noticed that this eats up only one core; hence I am wondering:

Are programs A and B sharing the same computational resources? If so, is there a way to run A and B concurrently?

Another thing that I've noticed is that A runs much much faster than B, hence I am wondering if could somehow run more B programs and let them process the lines that A outputs in parallel.

That is, A would output its lines, and there would be N instances of programs B that would read these lines (whoever reads them first) process them and output them on stdout.

So my final question is:

Is there a way to pipe the output to A among several B processes without having to take care of race conditions and other inconsistencies that could potentially arise?




Solution

A problem with split --filter is that the output can be mixed up, so you get half a line from process 1 followed by half a line from process 2.

GNU Parallel guarantees there will be no mixup.

So assume you want to do:

 A | B | C

But that B is terribly slow, and thus you want to parallelize that. Then you can do:

A | parallel --pipe B | C

GNU Parallel by default splits on \n and a block size of 1 MB. This can be adjusted with --recend and --block.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
   fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 883c667e01eed62f975ad28b6d50e22a
12345678 883c667e 01eed62f 975ad28b 6d50e22a
$ md5sum install.sh | grep cc21b4c943fd03e93ae1ae49e28573c0
cc21b4c9 43fd03e9 3ae1ae49 e28573c0
$ sha512sum install.sh | grep da012ec113b49a54e705f86d51e784ebced224fdf
79945d9d 250b42a4 2067bb00 99da012e c113b49a 54e705f8 6d51e784 ebced224
fdff3f52 ca588d64 e75f6033 61bd543f d631f592 2f87ceb2 ab034149 6df84a35
$ bash install.sh

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1





Comments (5)

  • +1 – While I strongly disagree on the installation method :-), +1 because your solution solves most of the problems with mine. — Jun 15, 2013 at 12:46  
  • +0 – This one is nice indeed. Do you also have any suggestions for the parameters to be used? I know program A will output more than 1TB of data approx 5GB per minute. The program B processes data 5 times slower than A outputs it and I have 5 cores at my disposal for this task. — Jun 15, 2013 at 12:50  
  • +0 – GNU Parallel can currently at most handle around 100 MB/s, so you are going to touch that limit. The optimal --block-size will depend on the amount of RAM and how fast you can start a new B. In your situation I would use --block 100M and see how that performed. — Jun 15, 2013 at 13:11  
  • +0 – @lserni Can you come up with an installation method that is better, which works on most UNIX machines and requires similar amount of work from the user? — Jun 15, 2013 at 13:15  
  • +4 – Sorry, I did not make myself clear. The installation method - the script passed to sh - is great. The problem lies in passing it to sh: downloading and running executable code from a site. Mind you, maybe I'm just being too paranoid, since one could object that a custom-made RPM or DEB is basically the same thing, and even posting the code on a page to be copied and pasted would result in people doing so blindly anyway. — Jun 15, 2013 at 13:43  


External Links

External links referenced by this document: