Using GNU Parallel With Split

Topo picture Topo · Feb 28, 2013 · Viewed 7.2k times · Source

I'm loading a pretty gigantic file to a postgresql database. To do this I first use split in the file to get smaller files (30Gb each) and then I load each smaller file to the database using GNU Parallel and psql copy.

The problem is that it takes about 7 hours to split the file, and then it starts to load a file per core. What I need is a way to tell split to print the file name to std output each time it finishes writing a file so I can pipe it to Parallel and it starts loading the files at the time split finish writing it. Something like this:

split -l 50000000 2011.psv carga/2011_ | parallel ./carga_postgres.sh {}

I have read the split man pages and I can't find anything. Is there a way to do this with split or any other tool?

Answer

Thor picture Thor · Feb 28, 2013

You could let parallel do the splitting:

<2011.psv parallel --pipe -N 50000000 ./carga_postgres.sh

Note, that the manpage recommends using --block over -N, this will still split the input at record separators, \n by default, e.g.:

<2011.psv parallel --pipe --block 250M ./carga_postgres.sh

Testing --pipe and -N

Here's a test that splits a sequence of 100 numbers into 5 files:

seq 100 | parallel --pipe -N23 'cat > /tmp/parallel_test_{#}'

Check result:

wc -l /tmp/parallel_test_[1-5]

Output:

 23 /tmp/parallel_test_1
 23 /tmp/parallel_test_2
 23 /tmp/parallel_test_3
 23 /tmp/parallel_test_4
  8 /tmp/parallel_test_5
100 total