Now, let's imagine we have 1 large file (e.g. 30 GB) that needs to be converted, line by line. Say we have a script, convert.sh
, that does this <task>
. We can pipe contents of this file to stdin for parallel to take in and work with in chunks such as
<stdin> | parallel --pipe --block <block size> -k <task> > output.txt
where <stdin>
can originate from anything such as cat <file>
.
As a reproducible example, our task will be nl -n rz
. Take any file, mine will be data.bz2
, and pass it to <stdin>
bzcat data.bz2 | nl | parallel --pipe --block 10M -k nl -n rz | gzip > ouptput.gz
The above example takes <stdin>
from bzcat data.bz2 | nl
, where I included nl
just as a proof of concept that the final output output.gz
will be saved in the order it was received. Then, parallel
divides the <stdin>
into chunks of size 10 MB, and for each chunk it passes it through nl -n rz
where it just appends a numbers rightly justified (see nl --help
for further details). The options --pipe
tells parallel
to split <stdin>
into multiple jobs and -- block
specifies the size of the blocks. The option -k
specifies that ordering must be maintained.
Your final output should look something like
000001 1 <data>
000002 2 <data>
000003 3 <data>
000004 4 <data>
000005 5 <data>
...
000587 552409 <data>
000588 552410 <data>
000589 552411 <data>
000590 552412 <data>
000591 552413 <data>
My original file had 552,413 lines. The first column represents the parallel jobs, and the second column represents the original line numbering that was passed to parallel
in chunks. You should notice that the order in the second column (and rest of the file) is maintained.