scrapy - parsing items that are paginated

AlexBrand picture AlexBrand · Oct 11, 2012 · Viewed 23.4k times · Source

I have a url of the form:

example.com/foo/bar/page_1.html

There are a total of 53 pages, each one of them has ~20 rows.

I basically want to get all the rows from all the pages, i.e. ~53*20 items.

I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

The question is, how do I crawl each page?

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

Answer

Achim picture Achim · Oct 11, 2012

You have two options to solve your problem. The general one is to use yield to generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]