custom tagging with nltk

SpliFF picture SpliFF · May 7, 2011 · Viewed 15.5k times · Source

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:

>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]

In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.

So anyway, my question is one of: Is there a better tagger for this type of grammar? Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form? Is there a way to train a tagger? Is there a better way altogether?

Answer

Jacob picture Jacob · May 7, 2011

One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:

>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)

Then you get

>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]

This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.