Getting the root word using the Wordnet Lemmatizer

Shanika Ediriweera picture Shanika Ediriweera · Sep 3, 2016 · Viewed 7.5k times · Source

I need to find a common root word matched for all related words for a keyword extractor.

How to convert words into the same root using the python nltk lemmatizer?

  • Eg:
    1. generalized, generalization -> general
    2. optimal, optimized -> optimize (maybe)
    3. configure, configuration, configured -> configure

The python nltk lemmatizer gives 'generalize', for 'generalized' and 'generalizing' when part of speech(pos) tag parameter is used but not for 'generalization'.

Is there a way to do this?

Answer

Ani Menon picture Ani Menon · Sep 3, 2016

Use SnowballStemmer:

>>> from nltk.stem.snowball import SnowballStemmer
>>> stemmer = SnowballStemmer("english")
>>> print(stemmer.stem("generalized"))
general
>>> print(stemmer.stem("generalization"))
general

Note: Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

A general issue I have seen with lemmatizers is that it identifies even bigger words as lemmas.

Example: In WordNet Lemmatizer(checked in NLTK),

  • Genralized => Generalize
  • Generalization => Generalization
  • Generalizations => Generalization

POS tag was not given as input in the above cases, so it was always considered noun.