I have a json file storing some user information including id
, name
and url
. The json file looks like this:
{"link": "https://www.example.com/user1", "id": 1, "name": "user1"}
{"link": "https://www.example.com/user1", "id": 2, "name": "user2"}
This file was written by a scrapy spider. Now I want to read the urls from the json file and scrape each user's webpage. But I cannot load the data from the json file.
At this time, I have no idea how to get these urls. I think I should read the lines from the json file first. I tried the following code in Python shell:
import json
f = open('links.jl')
line = json.load(f)
I got the following error message:
Raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1- line 138 column 497(char498-67908)
I did some searches online. The search suggested that the json file may have some formatting issues. But the json file was created and populated with items using scrapy pipeline. Does anybody have a clue what caused the error? And how to solve it? Any suggestions on reading the urls?
Thanks a lot.
Those are json lines as the exporter name implies.
Take a look in scrapy.contrib.exporter
and see the difference between JsonItemExporter
and JsonLinesItemExporter
This should do the trick:
import json
lines = []
with open('links.jl', 'r') as f:
for line in f:
lines.append(json.loads(line))