Scrapy pipeline to export csv file in the right format

W.S. picture W.S. · Apr 29, 2015 · Viewed 20.4k times · Source

I made the improvement according to the suggestion from alexce below. What I need is like the picture below. However each row/line should be one review: with date, rating, review text and link.

I need to let item processor process each review of every page.
Currently TakeFirst() only takes the first review of the page. So 10 pages, I only have 10 lines/rows as in the picture below.

enter image description here

Spider code is below:

import scrapy
from amazon.items import AmazonItem

class AmazonSpider(scrapy.Spider):
   name = "amazon"
   allowed_domains = ['amazon.co.uk']
   start_urls = [
    'http://www.amazon.co.uk/product-reviews/B0042EU3A2/'.format(page) for      page in xrange(1,114)

]

def parse(self, response):
    for sel in response.xpath('//*[@id="productReviews"]//tr/td[1]'):
        item = AmazonItem()
        item['rating'] = sel.xpath('div/div[2]/span[1]/span/@title').extract()
        item['date'] = sel.xpath('div/div[2]/span[2]/nobr/text()').extract()
        item['review'] = sel.xpath('div/div[6]/text()').extract()
        item['link'] = sel.xpath('div/div[7]/div[2]/div/div[1]/span[3]/a/@href').extract()

        yield item

Answer

Frank Martin picture Frank Martin · Apr 30, 2015

I started from scratch and the following spider should be run with

scrapy crawl amazon -t csv -o Amazon.csv --loglevel=INFO

so that opening the CSV-File with a spreadsheet shows for me

enter image description here

Hope this helps :-)

import scrapy

class AmazonItem(scrapy.Item):
    rating = scrapy.Field()
    date = scrapy.Field()
    review = scrapy.Field()
    link = scrapy.Field()

class AmazonSpider(scrapy.Spider):

    name = "amazon"
    allowed_domains = ['amazon.co.uk']
    start_urls = ['http://www.amazon.co.uk/product-reviews/B0042EU3A2/' ]

    def parse(self, response):

        for sel in response.xpath('//table[@id="productReviews"]//tr/td/div'):

            item = AmazonItem()
            item['rating'] = sel.xpath('./div/span/span/span/text()').extract()
            item['date'] = sel.xpath('./div/span/nobr/text()').extract()
            item['review'] = sel.xpath('./div[@class="reviewText"]/text()').extract()
            item['link'] = sel.xpath('.//a[contains(.,"Permalink")]/@href').extract()
            yield item

        xpath_Next_Page = './/table[@id="productReviews"]/following::*//span[@class="paging"]/a[contains(.,"Next")]/@href'
        if response.xpath(xpath_Next_Page):
            url_Next_Page = response.xpath(xpath_Next_Page).extract()[0]
            request = scrapy.Request(url_Next_Page, callback=self.parse)
            yield request