I am using NLTK and trying to get the word phrase count up to a certain length for a particular document as well as the frequency of each phrase. I tokenize the string to get the data list.
from nltk.util import ngrams
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import *
data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"]
bigrams = ngrams(data, 2)
bigrams_c = {}
for b in bigrams:
if b not in bigrams_c:
bigrams_c[b] = 1
else:
bigrams_c[b] += 1
the above code gives and output like this:
(('is', 'this'), 1)
(('test', 'this'), 2)
(('a', 'test'), 3)
(('this', 'is'), 4)
(('is', 'not'), 1)
(('real', 'not'), 2)
(('is', 'real'), 2)
(('not', 'a'), 3)
which is partially what I am looking for.
My question is, is there a more convenient way to do this for say up to phrases that are 4 or 5 in length without duplicating this code only to change the count variable?
Since you tagged this nltk
, here's how to do it using the nltk
's methods, which have some more features than the ones in the standard python collection.
from nltk import ngrams, FreqDist
all_counts = dict()
for size in 2, 3, 4, 5:
all_counts[size] = FreqDist(ngrams(data, size))
Each element of the dictionary all_counts
is a dictionary of ngram frequencies. For example, you can get the five most common trigrams like this:
all_counts[3].most_common(5)