I need to classify words into their parts of speech. Like a verb, a noun, an adverb etc.. I used the
nltk.word_tokenize() #to identify word in a sentence
nltk.pos_tag() #to identify the parts of speech
nltk.ne_chunk() #to identify Named entities.
The out put of this is a tree. Eg
>>> sentence = "I am Jhon from America"
>>> sent1 = nltk.word_tokenize(sentence )
>>> sent2 = nltk.pos_tag(sent1)
>>> sent3 = nltk.ne_chunk(sent2, binary=True)
>>> sent3
Tree('S', [('I', 'PRP'), ('am', 'VBP'), Tree('NE', [('Jhon', 'NNP')]), ('from', 'IN'), Tree('NE', [('America', 'NNP')])])
When accessing the element in this tree, i did it as follows:
>>> sent3[0]
('I', 'PRP')
>>> sent3[0][0]
'I'
>>> sent3[0][1]
'PRP'
But when accessing a Named Entity:
>>> sent3[2]
Tree('NE', [('Jhon', 'NNP')])
>>> sent3[2][0]
('Jhon', 'NNP')
>>> sent3[2][1]
Traceback (most recent call last):
File "<pyshell#121>", line 1, in <module>
sent3[2][1]
File "C:\Python26\lib\site-packages\nltk\tree.py", line 139, in __getitem__
return list.__getitem__(self, index)
IndexError: list index out of range
I got the above error.
What i want is to get the output as 'NE' similar to the previous 'PRP' so i cant identify which word is a Named Entity. Is there any way of doing this with NLTK in python?? If so please post the command. Or is there a function in the tree library to do this? I need the node value 'NE'
This answer may be off base, and in which case I'll delete it, as I don't have NLTK installed here to try it, but I think you can just do:
>>> sent3[2].node
'NE'
sent3[2][0]
returns the first child of the tree, not the node itself
Edit: I tried this when I got home, and it does indeed work.