I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.
My Approach:
I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.
Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.
Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.
I would go with a map reduce
approach:
hash tables
)