Here is the problem I'm facing:
The file read/write time itself takes hours, so I would like to find a way to improve the following:
cat file1 file2 file3 ... fileN >> newBigFile
This requires double the diskspace as file1
... fileN
takes up 100G, and then newBigFile
takes another 100Gb, and then file1
... fileN
gets removed.
The data is already in file1
... fileN
, doing the cat >>
incurs read
and write time when all I really need is for the hundreds of files to
reappear as 1 file...
If you don't need random access into the final big file (i.e., you just read it through once from start to finish), you can make your hundreds of intermediate files appear as one. Where you would normally do
$ consume big-file.txt
instead do
$ consume <(cat file1 file2 ... fileN)
This uses Unix process substitution, sometimes also called "anonymous named pipes."
You may also be able to save time and space by splitting your input and doing the processing at the same time; GNU Parallel has a --pipe switch that will do precisely this. It can also reassemble the outputs back into one big file, potentially using less scratch space as it only needs to keep number-of-cores pieces on disk at once. If you are literally running your hundreds of processes at the same time, Parallel will greatly improve your efficiency by letting you tune the amount of parallelism to your machine. I highly recommend it.