I am crawling a site which may contain a lot of start_urls
, like:
http://www.a.com/list_1_2_3.htm
I want to populate start_urls
like [list_\d+_\d+_\d+\.htm]
,
and extract items from URLs like [node_\d+\.htm]
during crawling.
Can I use CrawlSpider
to realize this function?
And how can I generate the start_urls
dynamically in crawling?
The best way to generate URLs dynamically is to override the start_requests method of the spider:
from scrapy.http.request import Request def start_requests(self): with open('urls.txt', 'rb') as urls: for url in urls: yield Request(url, self.parse)