How can I cat multiple files together into one without intermediary file?

Wing picture Wing · Nov 1, 2010 · Viewed 71.1k times · Source

Here is the problem I'm facing:

  • I am string processing a text file ~100G in size.
  • I'm trying to improve the runtime by splitting the file into many hundreds of smaller files and processing them in parallel.
  • In the end I cat the resulting files back together in order.

The file read/write time itself takes hours, so I would like to find a way to improve the following:

cat file1 file2 file3 ... fileN >> newBigFile
  1. This requires double the diskspace as file1 ... fileN takes up 100G, and then newBigFile takes another 100Gb, and then file1... fileN gets removed.

  2. The data is already in file1 ... fileN, doing the cat >> incurs read and write time when all I really need is for the hundreds of files to reappear as 1 file...

Answer

Jay Hacker picture Jay Hacker · Jun 27, 2011

If you don't need random access into the final big file (i.e., you just read it through once from start to finish), you can make your hundreds of intermediate files appear as one. Where you would normally do

$ consume big-file.txt

instead do

$ consume <(cat file1 file2 ... fileN)

This uses Unix process substitution, sometimes also called "anonymous named pipes."

You may also be able to save time and space by splitting your input and doing the processing at the same time; GNU Parallel has a --pipe switch that will do precisely this. It can also reassemble the outputs back into one big file, potentially using less scratch space as it only needs to keep number-of-cores pieces on disk at once. If you are literally running your hundreds of processes at the same time, Parallel will greatly improve your efficiency by letting you tune the amount of parallelism to your machine. I highly recommend it.