Passing web data into Beautiful Soup - Empty list

user3885774 picture user3885774 · Jul 31, 2014 · Viewed 13.9k times · Source

I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form:

>>> from bs4 import BeautifulSoup

>>> from urllib3 import poolmanager

>>> connectBuilder = poolmanager.PoolManager()

>>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')

>>> content
<urllib3.response.HTTPResponse object at 0x00000000032EC390>

>>> soup = BeautifulSoup(content)

>>> soup.title
>>> soup.title.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'name'
>>> soup.p
>>> soup.get_text()
''

>>> content.data
a stream of data follows...

As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content, it makes sense that it can read the status of the response, but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup). You can see that I've tried to read a few tags and text, the get_text() returns an empty list, this is strange.

Strangely, when I access the web data via content.data, the data shows up but it's not useful since I can't use Beautiful Soup to parse it. What is my problem? Thanks.

Answer

Padraic Cunningham picture Padraic Cunningham · Jul 31, 2014

If you just want to scrape the page, requests will get the content you need:

from bs4 import BeautifulSoup

import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)

In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [60]: soup.title.name
Out[60]: 'title'