How to find the most common words using spacy?

Harry Loyd picture Harry Loyd · May 16, 2016 · Viewed 15.6k times · Source

I'm using spacy with python and its working fine for tagging each word but I was wondering if it was possible to find the most common words in a string. Also is it possible to get the most common nouns, verbs, adverbs and so on?

There's a count_by function included but I cant seem to get it to run in any meaningful way.

Answer

Paras Dahal picture Paras Dahal · Jan 2, 2017

I recently had to count frequency of all the tokens in a text file. You can filter out words to get POS tokens you like using the pos_ attribute. Here is a simple example:

import spacy
from collections import Counter
nlp = spacy.load('en')
doc = nlp(u'Your text here')
# all tokens that arent stop words or punctuations
words = [token.text for token in doc if token.is_stop != True and token.is_punct != True]

# noun tokens that arent stop words or punctuations
nouns = [token.text for token in doc if token.is_stop != True and token.is_punct != True and token.pos_ == "NOUN"]

# five most common tokens
word_freq = Counter(words)
common_words = word_freq.most_common(5)

# five most common noun tokens
noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(5)