The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence

Morgan Cheng picture Morgan Cheng · Oct 9, 2008 · Viewed 95k times · Source

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this.

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time.

  2. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm.

  3. After sorting, we just take the first K words. This takes O(K) time.

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

Answer

Chihung Yu picture Chihung Yu · Mar 12, 2014

This can be done in O(n) time

Solution 1:

Steps:

  1. Count words and hash it, which will end up in the structure like this

    var hash = {
      "I" : 13,
      "like" : 3,
      "meow" : 3,
      "geek" : 3,
      "burger" : 2,
      "cat" : 1,
      "foo" : 100,
      ...
      ...
    
  2. Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size

  3. Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:

      0   1      2            3                  100
    [[ ],[cat],[burger],[like, meow, geek],[]...[foo]]
    
  4. Then just traverse the array from the end, and collect the k words.

Solution 2:

Steps:

  1. Same as above
  2. Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
  3. After traversing through the array, we just convert the min heap to array and return the array.