Anyone know of a good Python based web crawler that I could use?

Matt picture Matt · Jan 7, 2009 · Viewed 95.8k times · Source

I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.

Thanks in advance. Also, this is my first SO question!

Answer

RexE picture RexE · Jan 7, 2009
  • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
  • Twill is a simple scripting language built on top of Mechanize
  • BeautifulSoup + urllib2 also works quite nicely.
  • Scrapy looks like an extremely promising project; it's new.