NLTK Named Entity recognition to a Python list

Zlo picture Zlo · Aug 5, 2015 · Viewed 51.8k times · Source

I used NLTK's ne_chunk to extract named entities from a text:

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."


nltk.ne_chunk(my_sent, binary=True)

But I can't figure out how to save these entities to a list? E.g. –

print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')

Thanks.

Answer

alvas picture alvas · Aug 5, 2015

nltk.ne_chunk returns a nested nltk.tree.Tree object so you would have to traverse the Tree object to get to the NEs.

Take a look at Named Entity Recognition with Regular Expression: NLTK

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>> 
>>> def get_continuous_chunks(text):
...     chunked = ne_chunk(pos_tag(word_tokenize(text)))
...     continuous_chunk = []
...     current_chunk = []
...     for i in chunked:
...             if type(i) == Tree:
...                     current_chunk.append(" ".join([token for token, pos in i.leaves()]))
...             if current_chunk:
...                     named_entity = " ".join(current_chunk)
...                     if named_entity not in continuous_chunk:
...                             continuous_chunk.append(named_entity)
...                             current_chunk = []
...             else:
...                     continue
...     return continuous_chunk
... 
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']


>>> my_sent = "How's the weather in New York and Brooklyn"
>>> get_continuous_chunks(my_sent)
['New York', 'Brooklyn']