An alternative web crawler to Nutch

wassimans picture wassimans · Nov 24, 2010 · Viewed 9.4k times · Source

I'm trying to build a specialised search engine web site that indexes a limited number of web sites. The solution I came up with is:

  • using Nutch as the web crawler,
  • using Solr as the search engine,
  • the front-end and the site logic is coded with Wicket.

The problem is that I find Nutch quite complex and it's a big piece of software to customise, despite the fact that a detailed documentation (books, recent tutorials.. etc) does just not exist.

Questions now:

  1. Any constructive criticism about the hole idea of the site?
  2. Is there a good yet simple alternative to Nutch (as the crawling part of the site)?

Thanks

Answer

nate c picture nate c · Nov 24, 2010

Scrapy is a python library that crawls web sites. It is fairly small (compared to Nutch) and designed for limited site crawls. It has a Django type MVC style that I found pretty easy to customize.