I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,
selected -> select
Which is right.
However, involved !-> involve
and horsing !-> horse
unless I explicitly input the 'v' (Verb) attribute.
On the python terminal, I get the right output but not in my code:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'
The relevant section of the code is this:
for l in LDA_Row[0].split('+'):
w=str(l.split('*')[1])
word=lmtzr.lemmatize(w)
wordv=lmtzr.lemmatize(w,'v')
print wordv, word
# if word is not wordv:
# print word, wordv
The whole code is here.
What is the problem?
The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize()
, the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39
To resolve the problem, always POS-tag your data before lemmatizing, e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... if not wntag:
... lemma = word
... else:
... lemma = wnl.lemmatize(word, wntag)
... print lemma
...
This
be
a
foo
bar
sentence
Note that 'is -> be', i.e.
>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
To answer the question with words from your examples:
>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... lemma = wnl.lemmatize(word, wntag) if wntag else word
... print lemma
...
These
sentence
involve
some
horse
around
Note that there are some quirks with WordNetLemmatizer:
Also NLTK's default POS tagger is under-going some major changes to improve accuracy:
And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66