web scraping tutorial using python 3?

Question 1

web scraping tutorial using python 3?

python web-scraping python-3.2

vt2424253 · May 28, 2013 · Viewed 10.6k times · Source

Answer

Answer

I've actually just written a full guide on web scraping that includes some sample code in Python. I wrote and tested in on Python 2.7 but both the of the packages that I used (requests and BeautifulSoup) are fully compatible with Python 3 according to the Wall of Shame.

Here's some code to get you started with web scraping in Python:

import sys
import requests
from BeautifulSoup import BeautifulSoup


def scrape_google(keyword):

    # dynamically build the URL that we'll be making a request to
    url = "http://www.google.com/search?q={term}".format(
        term=keyword.strip().replace(" ", "+"),
    )

    # spoof some headers so the request appears to be coming from a browser, not a bot
    headers = {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
        "accept-encoding": "gzip,deflate,sdch",
        "accept-language": "en-US,en;q=0.8",
    }

    # make the request to the search url, passing in the the spoofed headers.
    r = requests.get(url, headers=headers)  # assign the response to a variable r

    # check the status code of the response to make sure the request went well
    if r.status_code != 200:
        print("request denied")
        return
    else:
        print("scraping " + url)

    # convert the plaintext HTML markup into a DOM-like structure that we can search
    soup = BeautifulSoup(r.text)

    # each result is an <li> element with class="g" this is our wrapper
    results = soup.findAll("li", "g")

    # iterate over each of the result wrapper elements
    for result in results:

        # the main link is an <h3> element with class="r"
        result_anchor = result.find("h3", "r").find("a")

        # print out each link in the results
        print(result_anchor.contents)


if __name__ == "__main__":

    # you can pass in a keyword to search for when you run the script
    # be default, we'll search for the "web scraping" keyword
    try:
        keyword = sys.argv[1]
    except IndexError:
        keyword = "web scraping"

    scrape_google(keyword)

If you just want to learn more about Python 3 in general and are already familiar with Python 2.x, then this article on transitioning from Python 2 to Python 3 might be helpful.

Question 2

I am trying to learn python 3.x so that I can scrape websites. People have recommended that I use Beautiful Soup 4 or lxml.html. Could someone point me in the right direction for tutorial or examples for BeautifulSoup with python 3.x?

Thank you for your help.

web scraping tutorial using python 3?

Answer

Related questions