What is the best stemming method in Python?

PeYoTlL picture PeYoTlL · Jul 9, 2014 · Viewed 56k times · Source

I tried all the nltk methods for stemming but it gives me weird results with some words.

Examples

It often cut end of words when it shouldn't do it :

  • poodle => poodl
  • article articl

or doesn't stem very good :

  • easily and easy are not stemmed in the same word
  • leaves, grows, fairly are not stemmed

Do you know other stemming libs in python, or a good dictionary?

Thank you

Answer

Spaceghost picture Spaceghost · Jul 9, 2014

The results you are getting are (generally) expected for a stemmer in English. You say you tried "all the nltk methods" but when I try your examples, that doesn't seem to be the case.

Here are some examples using the PorterStemmer

import nltk
ps = nltk.stemmer.PorterStemmer()
ps.stem('grows')
'grow'
ps.stem('leaves')
'leav'
ps.stem('fairly')
'fairli'

The results are 'grow', 'leav' and 'fairli' which, even if they are what you wanted, are stemmed versions of the original word.

If we switch to the Snowball stemmer, we have to provide the language as a parameter.

import nltk
sno = nltk.stem.SnowballStemmer('english')
sno.stem('grows')
'grow'
sno.stem('leaves')
'leav'
sno.stem('fairly')
'fair'

The results are as before for 'grows' and 'leaves' but 'fairly' is stemmed to 'fair'

So in both cases (and there are more than two stemmers available in nltk), words that you say are not stemmed, in fact, are. The LancasterStemmer will return 'easy' when provided with 'easily' or 'easy' as input.

Maybe you really wanted a lemmatizer? That would return 'article' and 'poodle' unchanged.

import nltk
lemma = nltk.wordnet.WordNetLemmatizer()
lemma.lemmatize('article')
'article'
lemma.lemmatize('leaves')
'leaf'