I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
I have tried hive. I would like to see if this can be done faster using python.
Have you considered using the *nix sort
program? in raw terms, it'll probably be faster than most Python scripts.
Use -t $'\t'
to specify that it's tab-separated, -k n
to specify the field, where n
is the field number, and -o outputfile
if you want to output the result to a new file.
Example:
sort -t $'\t' -k 4 -o sorted.txt input.txt
Will sort input.txt
on its 4th field, and output the result to sorted.txt