sorting large text data

fodon picture fodon · Aug 16, 2011 · Viewed 11.3k times · Source

I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?

I have tried hive. I would like to see if this can be done faster using python.

Answer

urschrei picture urschrei · Aug 16, 2011

Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.

Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file. Example:

sort -t $'\t' -k 4 -o sorted.txt input.txt

Will sort input.txt on its 4th field, and output the result to sorted.txt