Getting HTML with Pycurl

Sinthet picture Sinthet · Jul 2, 2011 · Viewed 21.6k times · Source

I've been trying to retrieve a page of HTML using pycurl, so I can then parse it for relevant information using str.split and some for loops. I know Pycurl retrieves the HTML, since it prints it to the terminal, however, if I try to do something like

html = str(c.perform())  

The variable will just hold a string which says "None".

How can I use pycurl to get the html, or redirect whatever it sends to the console so it can be used as a string as described above?

Thanks a lot to anyone who has any suggestions!

Answer

Corey Goldberg picture Corey Goldberg · Jul 2, 2011

this will send a request and store/print the response body:

from StringIO import StringIO    
import pycurl

url = 'http://www.google.com/'

storage = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, storage.write)
c.perform()
c.close()
content = storage.getvalue()
print content

if you want to store the response headers, use:

c.setopt(c.HEADERFUNCTION, storage.write)