I'm trying to scrape all the inner html from the <p>
elements in a web page using BeautifulSoup. There are internal tags, but I don't care, I just want to get the internal text.
For example, for:
<p>Red</p>
<p><i>Blue</i></p>
<p>Yellow</p>
<p>Light <b>green</b></p>
How can I extract:
Red
Blue
Yellow
Light green
Neither .string
nor .contents[0]
does what I need. Nor does .extract()
, because I don't want to have to specify the internal tags in advance - I want to deal with any that may occur.
Is there a 'just get the visible HTML' type of method in BeautifulSoup?
----UPDATE------
On advice, trying:
soup = BeautifulSoup(open("test.html"))
p_tags = soup.findAll('p',text=True)
for i, p_tag in enumerate(p_tags):
print str(i) + p_tag
But that doesn't help - it prints out:
0Red
1
2Blue
3
4Yellow
5
6Light
7green
8
Short answer: soup.findAll(text=True)
This has already been answered, here on StackOverflow and in the BeautifulSoup documentation.
UPDATE:
To clarify, a working piece of code:
>>> txt = """\
... <p>Red</p>
... <p><i>Blue</i></p>
... <p>Yellow</p>
... <p>Light <b>green</b></p>
... """
>>> import BeautifulSoup
>>> BeautifulSoup.__version__
'3.0.7a'
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> for node in soup.findAll('p'):
... print ''.join(node.findAll(text=True))
Red
Blue
Yellow
Light green