Top "Nutch" questions

Nutch is a well matured, production ready Web crawler.

Insufficient space for shared memory file when I try to run nutch generate command

I have been running nutch crawling commands for the passed 3 weeks and now I get the below error when I …

java jvm nutch
no segments* file found

I need to access a lucene index ( created by crawling several webpages using Nutch) but it is giving the error …

java lucene nutch
Nutch-Cygwin How to set JAVA_HOME

i am trying to run Nutch with Cygwin. I am having problems setting the JAVA_HOME. $ export JAVA_HOME='/…

cygwin nutch
zookeeper unable to open socket to localhost/0:0:0:0:0:0:0:1:2181

I am using zookeeper ensemble for hbase. Zookeeper is running on 3 machines. While HBase is also in fully distributed mode. …

apache hbase nutch apache-zookeeper
Web Cralwer Algorithm: depth?

I'm working on a crawler and need to understand exactly what is meant by "link depth". Take nutch for example: …

algorithm web-crawler nutch
How is an aggregator built?

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or …

web-services aggregation web-crawler nutch
Using Nutch crawler with Solr

Am I able to integrate Apache Nutch crawler with the Solr Index server? Edit: One of our devs came up …

lucene solr nutch
How to Open an Ant project (Nutch Source) at Intellij Idea?

I want to open Nutch 2.1 source file (http://www.eu.apache.org/dist/nutch/2.1/) at Intellij IDEA. Here is an …

ant intellij-idea nutch
An alternative web crawler to Nutch

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution …

search-engine web-crawler nutch
how to parse html with nutch and index specific tag to solr?

i have installed nutch and solr for crawling a website and search in it; as you know we can index …

solr nutch apache-tika