How do I do word Stemming or Lemmatization?

manixrock picture manixrock · Apr 21, 2009 · Viewed 134.1k times · Source

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.

My test words are: "cats running ran cactus cactuses cacti community communities", and both get less than half right.

See also:

Answer

theycallmemorty picture theycallmemorty · May 4, 2009

If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet.

Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by:

>>> import nltk
>>> nltk.download('wordnet')

You only have to do this once. Assuming that you have now downloaded the corpus, it works like this:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

There are other lemmatizers in the nltk.stem module, but I haven't tried them myself.