Print 10 most frequently occurring words of a text that including and excluding stopwords

user2064809 picture user2064809 · Feb 8, 2015 · Viewed 26.7k times · Source

I got the question from here with my changes. I have following code:

from nltk.corpus import stopwords
>>> def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I print the 10 most frequently occurring words of a text that 1)including and 2)excluding stopwords?

Answer

igorushi picture igorushi · Feb 8, 2015

There is a FreqDist function in nltk

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    

to extract 10 most common:

mostCommon= allWordDist.most_common(10).keys()