how to parse html with nutch and index specific tag to solr?

Amir picture Amir · Sep 9, 2012 · Viewed 9.3k times · Source

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:

<div id=something>
      me specific tag
</div>

indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.

any idea?

Answer

Babu picture Babu · Apr 14, 2013

I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.

Here are some tips to plugin:

  • read http://wiki.apache.org/nutch/WritingPluginExample, here you can find how to make your plugin very simply
  • in your plugin extend the ParseFilter and IndexingFilter.
  • in YourParseFilter you can use NodeWalker to find your specific div
  • your parsed informations put into page metadata like this

    page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));

  • in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument

    doc.add("your_specific_tag", value);

  • most important!!!!!

  • put your_specific_tag to fileds of:

    • Solr config file schema.xml (and restart Solr)

    field name="your_specific_tag" type="string" stored="true" indexed="true"

    • Nutch config file schema.xml (don't know if it is realy neccessary)
    • Nutch config file solrindex-mapping.xml

    field dest="your_specific_tag" source="your_specific_tag"