Scraping and parsing Google search results using Python

Flake picture Flake · Oct 12, 2011 · Viewed 57.5k times · Source

I asked a question on realizing a general idea to crawl and save webpages. Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.

With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).

Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.

The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing. What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).

p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.

Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P


Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.

import time, random
from xgoogle.search import GoogleSearch, SearchError

f = open('a.txt','wb')

for i in range(0,2):
    wt = random.uniform(2, 5)
    gs = GoogleSearch("about")
    gs.results_per_page = 10
    gs.page = i
    results = gs.get_results()
    #Try not to annnoy Google, with a random short wait
    time.sleep(wt)
    print 'This is the %dth iteration and waited %f seconds' % (i, wt)
    for res in results:
        f.write(res.url.encode("utf8"))
        f.write("\n")

print "Done"
f.close()

Note on xgoogle (below answered by Mike Pennington): The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.


Resources known so far:

  • For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.

  • For parsing HTML, BeautifulSoup seems to be the one of the most popular choices. Of course. lxml too.

Answer

Mike Pennington picture Mike Pennington · Oct 12, 2011

You may find xgoogle useful... much of what you seem to be asking for is there...