How to generate the start_urls dynamically in crawling?

user1215269 picture user1215269 · Feb 17, 2012 · Viewed 18k times · Source

I am crawling a site which may contain a lot of start_urls, like:

http://www.a.com/list_1_2_3.htm

I want to populate start_urls like [list_\d+_\d+_\d+\.htm], and extract items from URLs like [node_\d+\.htm] during crawling.

Can I use CrawlSpider to realize this function? And how can I generate the start_urls dynamically in crawling?

Answer

juraseg picture juraseg · Apr 30, 2012

The best way to generate URLs dynamically is to override the start_requests method of the spider:

from scrapy.http.request import Request

def start_requests(self):
      with open('urls.txt', 'rb') as urls:
          for url in urls:
              yield Request(url, self.parse)