In this documentation, there is example using nltk.collocations.BigramAssocMeasures()
, BigramCollocationFinder
,nltk.collocations.TrigramAssocMeasures()
, and TrigramCollocationFinder
.
There is example method find nbest based on pmi for bigram and trigram. example:
finder = BigramCollocationFinder.from_words(
... nltk.corpus.genesis.words('english-web.txt'))
>>> finder.nbest(bigram_measures.pmi, 10)
I know that BigramCollocationFinder
and TrigramCollocationFinder
inherit from AbstractCollocationFinder.
While BigramAssocMeasures()
and TrigramAssocMeasures()
inherit from NgramAssocMeasures.
How can I use the methods(e.g. nbest()
) in AbstractCollocationFinder
and NgramAssocMeasures
for 4-gram, 5-gram, 6-gram, ...., n-gram (like using bigram and trigram easily)?
Should I create class which inherit AbstractCollocationFinder
?
Thanks.
If you want to find the grams beyond 2 or 3 grams you can use scikit package and Freqdist function to get the count for these grams. I tried doing this with nltk.collocations, but I dont think we can find out more than 3-grams score into it. So I rather decided to go with count of grams. I hope this can help u a little bit. Thankz
here is the code
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
query = "This document gives a very short introduction to machine learning problems"
vect = CountVectorizer(ngram_range=(1,4))
analyzer = vect.build_analyzer()
listNgramQuery = analyzer(query)
listNgramQuery.reverse()
print "listNgramQuery=", listNgramQuery
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print "\nNgramQueryWeights=", NgramQueryWeights
This will give output as
listNgramQuery= [u'to machine learning problems', u'introduction to machine learning', u'short introduction to machine', u'very short introduction to', u'gives very short introduction', u'document gives very short', u'this document gives very', u'machine learning problems', u'to machine learning', u'introduction to machine', u'short introduction to', u'very short introduction', u'gives very short', u'document gives very', u'this document gives', u'learning problems', u'machine learning', u'to machine', u'introduction to', u'short introduction', u'very short', u'gives very', u'document gives', u'this document', u'problems', u'learning', u'machine', u'to', u'introduction', u'short', u'very', u'gives', u'document', u'this']
NgramQueryWeights= <FreqDist: u'document': 1, u'document gives': 1, u'document gives very': 1, u'document gives very short': 1, u'gives': 1, u'gives very': 1, u'gives very short': 1, u'gives very short introduction': 1, u'introduction': 1, u'introduction to': 1, ...>