I need to find a common root word matched for all related words for a keyword extractor.
How to convert words into the same root using the python nltk lemmatizer?
The python nltk lemmatizer gives 'generalize', for 'generalized' and 'generalizing' when part of speech(pos) tag parameter is used but not for 'generalization'.
Is there a way to do this?
Use SnowballStemmer:
>>> from nltk.stem.snowball import SnowballStemmer
>>> stemmer = SnowballStemmer("english")
>>> print(stemmer.stem("generalized"))
>>> print(stemmer.stem("generalization"))
Note: Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
A general issue I have seen with lemmatizers is that it identifies even bigger words as lemmas.
Example: In WordNet Lemmatizer(checked in NLTK),
POS tag was not given as input in the above cases, so it was always considered noun.