I just started programming Python. I want to use scrapy to create a bot,and it showed TypeError: Object of type 'bytes' is not JSON serializable when I run the project.
import json
import codecs
class W3SchoolPipeline(object):
def __init__(self):
self.file = codecs.open('w3school_data_utf8.json', 'wb', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
# print line
self.file.write(line.decode("unicode_escape"))
return item
from scrapy.spiders import Spider
from scrapy.selector import Selector
from w3school.items import W3schoolItem
class W3schoolSpider(Spider):
name = "w3school"
allowed_domains = ["w3school.com.cn"]
start_urls = [
"http://www.w3school.com.cn/xml/xml_syntax.asp"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[@id="navsecond"]/div[@id="course"]/ul[1]/li')
items = []
for site in sites:
item = W3schoolItem()
title = site.xpath('a/text()').extract()
link = site.xpath('a/@href').extract()
desc = site.xpath('a/@title').extract()
item['title'] = [t.encode('utf-8') for t in title]
item['link'] = [l.encode('utf-8') for l in link]
item['desc'] = [d.encode('utf-8') for d in desc]
items.append(item)
return items
Traceback:
TypeError: Object of type 'bytes' is not JSON serializable
2017-06-23 01:41:15 [scrapy.core.scraper] ERROR: Error processing {'desc': [b'\x
e4\xbd\xbf\xe7\x94\xa8 XSLT \xe6\x98\xbe\xe7\xa4\xba XML'],
'link': [b'/xml/xml_xsl.asp'],
'title': [b'XML XSLT']}
Traceback (most recent call last):
File
"c:\users\administrator\appdata\local\programs\python\python36\lib\site-p
ackages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\LZZZZB\w3school\w3school\pipelines.py", line 19, in process_item
line = json.dumps(dict(item)) + '\n'
File
"c:\users\administrator\appdata\local\programs\python\python36\lib\json\_
_init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File
"c:\users\administrator\appdata\local\programs\python\python36\lib\json\e
ncoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File
"c:\users\administrator\appdata\local\programs\python\python36\lib\json\e
ncoder.py", line 257, in iterencode
return _iterencode(o, 0)
File
"c:\users\administrator\appdata\local\programs\python\python36\lib\
json\encoder.py", line 180, in default
o.__class__.__name__)
TypeError: Object of type 'bytes' is not JSON serializable
You are creating those bytes
objects yourself:
item['title'] = [t.encode('utf-8') for t in title]
item['link'] = [l.encode('utf-8') for l in link]
item['desc'] = [d.encode('utf-8') for d in desc]
items.append(item)
Each of those t.encode()
, l.encode()
and d.encode()
calls creates a bytes
string. Do not do this, leave it to the JSON format to serialise these.
Next, you are making several other errors; you are encoding too much where there is no need to. Leave it to the json
module and the standard file object returned by the open()
call to handle encoding.
You also don't need to convert your items
list to a dictionary; it'll already be an object that can be JSON encoded directly:
class W3SchoolPipeline(object):
def __init__(self):
self.file = open('w3school_data_utf8.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(item) + '\n'
self.file.write(line)
return item
I'm guessing you followed a tutorial that assumed Python 2, you are using Python 3 instead. I strongly suggest you find a different tutorial; not only is it written for an outdated version of Python, if it is advocating line.decode('unicode_escape')
it is teaching some extremely bad habits that'll lead to hard-to-track bugs. I can recommend you look at Think Python, 2nd edition for a good, free, book on learning Python 3.