i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way to crawl another html tag to solr that isn't meta?(plugin or anyway) like this:
<div id=something>
me specific tag
</div>
indeed i want to add a field to solr (something) that have value of "me specific tag" in this page.
any idea?
I made my own plugin for something similar you want to. The config file for mapping NutchDocument to SolrDocument is in $NUTCH_HOME/conf/solrindex-mapping.xml. Here you can add your own tags. But still you have to fill your own tags somewhere.
Here are some tips to plugin:
your parsed informations put into page metadata like this
page.putToMetadata(new Utf8("yourKEY"), ByteBuffer.wrap(YourByteArrayParsedFromMetaData));
in YourIndexingFilter add the metadata from page (page.getMetadata) to NutchDocument
doc.add("your_specific_tag", value);
most important!!!!!
put your_specific_tag to fileds of:
field name="your_specific_tag" type="string" stored="true" indexed="true"
field dest="your_specific_tag" source="your_specific_tag"