How to use DBPedia to extract Tags/Keywords from content?

Pritam Raut picture Pritam Raut · Jan 20, 2011 · Viewed 9.9k times · Source

I am exploring how I can use Wikipedia's taxonomy information to extract Tags/Keywords from my content.

I found articles about DBPedia. DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web.

Has anyone used their web services? Do you know how they work and how reliable it is?

Answer

John Lehmann picture John Lehmann · Jan 20, 2011

DBpedia is a fantastic, high quality resource. In order to turn your content into a set of relevant DBpedia concepts, however, you will need to accurately identify them in your text, which involves at least two steps:

  1. Identify DBpedia concepts in your content: This includes recognizing concept names (and alternate names) in text, and also disambiguating among all possible meanings of each phrase. The term "Sun" may refer to dozens of possible concepts according to its disambiguation page including a star, newspapers, person names, etc. This involves entity identification, classification, and linking.

  2. Identify which of those concepts are interesting: For example, do you want the concept "Definite article" showing up when text includes the term "the" (which The redirects to)?

You may want to consider a preexisting text analytics library or service, which supports entity linking to DBpedia. One great tool for topic indexing is Maui, which was developed by Alyona Medelyan during her PhD. Another great open source solution is Wikipedia Miner by David Milne at the same university.

Two commercial services which provide linking to DBpedia concepts are Zemanta and Extractiv (allow some level of free use). DBpedia spotlight option. Others which may provide these capabilities are listed at: https://stackoverflow.com/questions/2119279/is-there-a-better-tool-than-opencalais

Disclosure: I [used to] work at Extractiv (defunct), which is powered by Language Computer Corporation's NLP.