Entity Extraction/Recognition with free tools while feeding Lucene Index

Karussell picture Karussell · Sep 17, 2011 · Viewed 17.5k times · Source

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.

E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.

My questions:

  • Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
  • Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
  • How can they be integrated with Lucene?

Here are some questions related to that subject:

Answer

John Lehmann picture John Lehmann · Sep 19, 2011

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.