Count word frequency of huge text file

java algorithm data-structures text-files word-count

vikky.rk · Feb 7, 2013 · Viewed 12k times · Source

I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.

My Approach:

Sort the given file - external sort
Count the frequency of each word sequentially, store the count in another file (along with the word)
Sort the output file based of frequency count - external sort.

I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.

Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.

Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.

Answer

I would go with a map reduce approach:

Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.

Count word frequency of huge text file

Answer

Related questions