I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.
My question is what is the best shot inorder to perform the above lemmatization accurately?
I did the pos tagging using nltk.pos_tag
and I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?
Also do I have to train nltk.pos_tag()
with a tagged corpus or can I use it directly on my data to evaluate?
First of all, you can use nltk.pos_tag()
directly without training it.
The function will load a pretrained tagger from a file. You can see the file name
with nltk.tag._POS_TAGGER
:
nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'
As it was trained with the Treebank corpus, it also uses the Treebank tag set.
The following function would map the treebank tags to WordNet part of speech names:
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
You can then use the return value with the lemmatizer:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'
Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError
.