Solr indexing following a Nutch crawl fails, reports "Job Failed"

rldrummer picture rldrummer · Feb 7, 2014 · Viewed 8.6k times · Source

I have a site hosted on my local machine that I am attempting to crawl with Nutch and index in Solr (both also on my local machine). I installed Solr 4.6.1 and Nutch 1.7 per the instructions given on the Nutch site (http://wiki.apache.org/nutch/NutchTutorial), and I have Solr running in my browser without issue.

I am running the following command:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 1 -topN 2

The crawl is working fine, but when it attemps to put the data into Solr, it fails with the following output:

Indexer: starting at 2014-02-06 16:29:28
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : use authentication (default false)
    solr.auth : username for authentication
    solr.auth.password : password for authentication


Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I went to the Nutch logs directory and tailed the hadoop.log file, it shows this:

2014-02-06 16:29:28,920 INFO  solr.SolrIndexWriter - Indexing 1 documents
2014-02-06 16:29:28,921 INFO  httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server localhost failed to respond
2014-02-06 16:29:28,921 INFO  httpclient.HttpMethodDirector - Retrying request
2014-02-06 16:29:28,924 WARN  mapred.LocalJobRunner - job_local331896790_0009
java.io.IOException
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:173)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:159)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Connection reset
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
    ... 6 more
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:168)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
    at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
    at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
    at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
    at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
    at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422)

Yet, I am still able to access Solr in my browser just fine. This is my first try at Solr/Nutch - any help from those with more knowledge would be much appreciated. Thanks.

Answer

Diaa picture Diaa · Feb 14, 2014

This happens when not all required fields from nutch are in the schema.xml of solr. Did you add the fields from Nutch's schema.xml?

If you add in the section "fields" the following, things should work:

<field name="id" type="string" stored="true" indexed="true"/>
<!-- core fields -->
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>

<!-- fields for index-basic plugin -->
<field name="host" type="string" stored="false" indexed="true"/>
<field name="url" type="url" stored="true" indexed="true"
    required="true"/>
<field name="content" type="text_general" stored="false" indexed="true"/>
<field name="title" type="text_general" stored="true" indexed="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>

<!-- fields for index-anchor plugin -->
<field name="anchor" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for index-more plugin -->
<field name="type" type="string" stored="true" indexed="true"
    multiValued="true"/>
<field name="contentLength" type="long" stored="true"
    indexed="false"/>
<field name="lastModified" type="date" stored="true"
    indexed="false"/>
<field name="date" type="date" stored="true" indexed="true"/>

<!-- fields for languageidentifier plugin -->
<field name="lang" type="string" stored="true" indexed="true"/>

<!-- fields for subcollection plugin -->
<field name="subcollection" type="string" stored="true"
    indexed="true" multiValued="true"/>

<!-- fields for feed plugin (tag is also used by microformats-reltag)-->
<field name="author" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
<field name="feed" type="string" stored="true" indexed="true"/>
<field name="publishedDate" type="date" stored="true"
    indexed="true"/>
<field name="updatedDate" type="date" stored="true"
    indexed="true"/>

<!-- fields for creativecommons plugin -->
<field name="cc" type="string" stored="true" indexed="true"
    multiValued="true"/>

<!-- fields for tld plugin -->    
<field name="tld" type="string" stored="false" indexed="false"/>