I am trying to do POS tagging using the spaCy module in Python.
Here is my code for the same
from spacy.en import English, LOCAL_DATA_DIR
import spacy.en
import os
data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)
nlp = English(parser=False, tagger=True, entity=False)
def print_fine_pos(token):
return (token.tag_)
def pos_tags(sentence):
sentence = unicode(sentence, "utf-8")
tokens = nlp(sentence)
tags = []
for tok in tokens:
tags.append((tok,print_fine_pos(tok)))
return tags
a = "we had crispy dosa"
print pos_tags(a)
Output:
[(We , u'PRP'), (had , u'VBD'), (crispy , u'NN'), (dosa, u'NN')]
Here it returns crispy as a noun instead of an adjective. However, if I use a test sentence like
a="we had crispy fries"
It recognizes that crispy is an adjective. Here is the output:
[(we , u'PRP'), (had , u'VBD'), (crispy , u'JJ'), (fries, u'NNS')]
I think the primary reason why crispy wasn't tagged as an adjective in the first case was because dosa was tagged as 'NN' whereas fries was tagged as 'NNS' in the second case.
Is there any way I can get crispy to be tagged as an adjective in the second case too?
TL;DR: You should accept the occasional error.
Details:
Spacy's tagger is statistical, meaning that the tags you get are its best estimate based on the data it was shown during training. I would guess those data did not contain the word dosa
. The tagger had to guess, and guessed wrong. There isn't an easy way to correct its output, because it is not using rules or anything you can modify easily. The model has been trained on a standard corpus of English, which may be quite different to the kind of language you are using it for (domain). If error rate is too high for your purposes, you can re-train the model using domain-specific data. This will be very laborious though. Ask yourself what you are trying to achieve and whether 3% error rate in PoS tagging is the worst of your problems.
In general, you shouldn't judge the performance of a statistical system on a case-by-case basis. The accuracy of modern English PoS taggers is around 97%, which is roughly the same as the average human. You will inevitably get some errors. However, the errors of the model will not be the same as the human errors, as the two have "learnt" how to solve the problem in a different way. Sometimes the model will get confused by things you and I consider obvious, e.g. your example. This doesn't mean it is bad overall, or that PoS tagging is your real problem.