I try to extract "THIS IS MY TEXT" from the following HTML:
<html>
<body>
<table>
<td class="MYCLASS">
<!-- a comment -->
<a hef="xy">Text</a>
<p>something</p>
THIS IS MY TEXT
<p>something else</p>
</br>
</td>
</table>
</body>
</html>
I tried it this way:
soup = BeautifulSoup(html)
for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
print hit.text
But I get all the text between all nested Tags plus the comment.
Can anyone help me to just get "THIS IS MY TEXT" out of this?
Learn more about how to navigate through the parse tree in BeautifulSoup
. Parse tree has got tags
and NavigableStrings
(as THIS IS A TEXT). An example
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
To move down the parse tree you have contents
and string
.
contents is an ordered list of the Tag and NavigableString objects contained within a page element
if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]
For the above, that is to say you can get
soup.b.string
# u'one'
soup.b.contents[0]
# u'one'
For several children nodes, you can have for instance
pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']
so here you may play with contents
and get contents at the index you want.