Count word frequency of huge text file

vikky.rk picture vikky.rk · Feb 7, 2013 · Viewed 12k times · Source

I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.

My Approach:

  1. Sort the given file - external sort
  2. Count the frequency of each word sequentially, store the count in another file (along with the word)
  3. Sort the output file based of frequency count - external sort.

I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.

Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.

Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.

Answer

ogzd picture ogzd · Feb 7, 2013

I would go with a map reduce approach:

  1. Distribute your text file on nodes, assuming each text in a node can fit into RAM.
  2. Calculate each word frequency within the node. (using hash tables )
  3. Collect each result in a master node and combine them all.