How to read lines from a json file in scrapy

Olivia picture Olivia · Dec 24, 2012 · Viewed 7.2k times · Source

I have a json file storing some user information including id, name and url. The json file looks like this:

{"link": "https://www.example.com/user1", "id": 1, "name": "user1"}
{"link": "https://www.example.com/user1", "id": 2, "name": "user2"}

This file was written by a scrapy spider. Now I want to read the urls from the json file and scrape each user's webpage. But I cannot load the data from the json file.

At this time, I have no idea how to get these urls. I think I should read the lines from the json file first. I tried the following code in Python shell:

import json    
f = open('links.jl')    
line = json.load(f)

I got the following error message:

Raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1- line 138 column 497(char498-67908)

I did some searches online. The search suggested that the json file may have some formatting issues. But the json file was created and populated with items using scrapy pipeline. Does anybody have a clue what caused the error? And how to solve it? Any suggestions on reading the urls?

Thanks a lot.

Answer

marius_5 picture marius_5 · Apr 16, 2013

Those are json lines as the exporter name implies.

Take a look in scrapy.contrib.exporter and see the difference between JsonItemExporter and JsonLinesItemExporter

This should do the trick:

import json

lines = []

with open('links.jl', 'r') as f:
    for line in f:
        lines.append(json.loads(line))