web scraping google news with python

Jiyda Moussa picture Jiyda Moussa · Mar 21, 2013 · Viewed 35.4k times · Source

I am creating a web scraper for different news outlets, for Nytimes and the Guardian it was easy since they have their own API.

Now, I want to scrape results from this newspaper GulfTimes.com. They do not provide an advanced search in their website, so I resorted to Google news. However, Google news Api has been deprecated. What i want is to retrieve the number of results from an advanced search like keyword = "Egypt" and begin_date="10/02/2011" and end_date="10/05/2011".

This is feasible in the Google News UI just by putting the source as "Gulf Times" and the corresponding query and date and simply counting manually the number of results but when I try to do this using python, I get a 403 error which is understandable.

Any idea on how I would do this? Or is there another service besides Google news that would allow me to do this? Keeping in mind that I would issue almost 500 requests at once.

import json
import urllib2
import cookielib
import re
from bs4 import BeautifulSoup


def run():
   Query = "Egypt"
   Month = "3"
   FromDay = "2"
   ToDay = "4"
   Year = "13"
   url='https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q='+Query+'&as_occt=any&as_drrb=b&as_mindate='+Month+'%2F'+FromDay+'%2F'+Year+'&as_maxdate='+Month+'%2F'+ToDay+'%2F'+Year+'&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'
   cj = cookielib.CookieJar()
   opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
   request = urllib2.Request(url)   
   response = opener.open(request)
   htmlFile = BeautifulSoup(response)
   print htmlFile


run()

Answer

alecxe picture alecxe · Mar 21, 2013

You can use awesome requests library:

import requests

URL = 'https://www.google.com/search?pz=1&cf=all&ned=us&hl=en&tbm=nws&gl=us&as_q={query}&as_occt=any&as_drrb=b&as_mindate={month}%2F%{from_day}%2F{year}&as_maxdate={month}%2F{to_day}%2F{year}&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F13%2Ccd_max%3A3%2F2%2F13&as_nsrc=Gulf%20Times&authuser=0'


def run(**params):
    response = requests.get(URL.format(**params))
    print response.content, response.status_code


run(query="Egypt", month=3, from_day=2, to_day=2, year=13)

And you'll get status_code=200.

And, btw, take a look at scrapy project. Nothing makes web-scraping more simple than this tool.