I'm scraping a page with Python's pyquery, and I'm kinda confused by the types it returns, and in particular how to iterate over a list of results.
If my HTML looks a bit like this:
<div class="formwrap">blah blah <h3>Something interesting</h3></div>
<div class="formwrap">more rubbish <h3>Something else interesting</h3></div>
How do I get the inside of the <h3>
tags, one by one so I can process them? I'm trying:
results_page = pq(response.read())
formwraps = results_page(".formwrap")
print type(formwraps)
print type([formwraps])
for my_div in [formwraps]:
print type(my_div)
print my_div("h3").text()
This produces:
<class 'pyquery.pyquery.PyQuery'>
<type 'list'>
<class 'pyquery.pyquery.PyQuery'>
Something interesting something else interesting
It looks like there's no actual iteration going on. How can I pull out each element individually?
Extra question from a newbie: what are the square brackets around [a]
doing? It looks like it converts a special Pyquery object to a list. Is []
a standard Python operator?
------UPDATE--------
I've found an 'each' function in the pyquery docs. However, I don't understand how to use it for what I want. Say I just want to print out the content of the <h3>
. This produces a syntax error: why?
formwraps.each(lambda e: print e("h3").text())
Since pyquery 1.2.3 (commit), you can use items()
of a PyQuery
object for going through each item as PyQuery
object:
print(type(formwraps.items()))
for my_div in formwraps.items():
print(my_div("h3").text())
The method items()
returns a generator
and this will work on both Python 2 and 3.