I have a url of the form:
example.com/foo/bar/page_1.html
There are a total of 53 pages, each one of them has ~20 rows.
I basically want to get all the rows from all the pages, i.e. ~53*20 items.
I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:
def parse(self, response):
hxs = HtmlXPathSelector(response)
restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')
for rest in restaurants:
item = DegustaItem()
item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
# some items don't have category associated with them
try:
item['category'] = rest.select('td[3]/a/text()').extract()[0]
except:
item['category'] = ''
item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]
# get profile url
rel_url = rest.select('td[2]/a/@href').extract()[0]
# join with base url since profile url is relative
base_url = get_base_url(response)
follow = urljoin_rfc(base_url,rel_url)
request = Request(follow, callback = parse_profile)
request.meta['item'] = item
return request
def parse_profile(self, response):
item = response.meta['item']
# item['address'] = figure out xpath
return item
The question is, how do I crawl each page?
example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html
You have two options to solve your problem. The general one is to use yield
to generate new requests instead of return
. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.
In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:
class MySpider(BaseSpider):
start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]