NLTK for Named Entity Recognition

Darth.Vader picture Darth.Vader · Oct 11, 2013 · Viewed 21.1k times · Source

I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out:

sentence = "Let's meet tomorrow at 9 pm";
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print nltk.ne_chunk(pos_tags, binary=True)

I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code:

(S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN)

Can someone help me understand if I am missing something or NLTK is just not mature enough to tag time and date properly. Thanks!

Answer

Viktor Vojnovski picture Viktor Vojnovski · Oct 16, 2013

The default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus (http://catalog.ldc.upenn.edu/LDC2005T09). It has not been trained to recognise dates and times, so you need to train your own classifier if you want to do that.

Have a look at http://mattshomepage.com/articles/2016/May/23/nltk_nec/, the whole process is explained very well.

Also, there is a module called timex in nltk_contrib which might help you with your needs. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py