scrapy text encoding

mindcast picture mindcast · Feb 7, 2012 · Viewed 42k times · Source

Here is my spider

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem

class vriskoSpider(CrawlSpider):
    name = 'vrisko'
    allowed_domains = ['vrisko.gr']
    start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)

    def parse_start_url(self, response):
        hxs = HtmlXPathSelector(response)
        vriskoit = VriskoItem()
        vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
        vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
        return vriskoit

My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.

Thank you in advance!

Answer

Lacek picture Lacek · Dec 27, 2016

Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING is introduced. By specifying it as utf-8, JSON output will not be escaped.

That is to add in your settings.py:

FEED_EXPORT_ENCODING = 'utf-8'