I was trying to count duplicate words over a list of 230 thousand words.I used python dictionary to do so. The code is given below:
for words in word_list:
if words in word_dict.keys():
word_dict[words] += 1
else:
word_dict[words] = 1
The above code took 3 minutes!. I ran the same code over 1.5 million words and it was running for more than 25 minutes and I lost my patience and terminated. Then I found that I can use the following code from here (also shown below). The result was so surprising, it completed within seconds!. So my question is what is the faster way to do this operation?. I guess the dictionary creation process must be taking O(N) time. How was the Counter method able to complete this process in seconds and create an exact dictionary of word as key and frequency as it's value?
from collections import Counter
word_dict = Counter(word_list)
It's because of this:
if words in word_dict.keys():
.keys()
returns a list of all the keys. Lists take linear time to scan, so your program was running in quadratic time!
Try this instead:
if words in word_dict:
Also, if you're interested, you can see the Counter
implementation for yourself. It's written in regular Python.