Python BeautifulSoup extract text between element

ɥɔǝnq ɹǝƃloɥ picture ɥɔǝnq ɹǝƃloɥ · May 30, 2013 · Viewed 171.7k times · Source

I try to extract "THIS IS MY TEXT" from the following HTML:

<html>
<body>
<table>
   <td class="MYCLASS">
      <!-- a comment -->
      <a hef="xy">Text</a>
      <p>something</p>
      THIS IS MY TEXT
      <p>something else</p>
      </br>
   </td>
</table>
</body>
</html>

I tried it this way:

soup = BeautifulSoup(html)

for hit in soup.findAll(attrs={'class' : 'MYCLASS'}):
    print hit.text

But I get all the text between all nested Tags plus the comment.

Can anyone help me to just get "THIS IS MY TEXT" out of this?

Answer

kiriloff picture kiriloff · May 30, 2013

Learn more about how to navigate through the parse tree in BeautifulSoup. Parse tree has got tags and NavigableStrings (as THIS IS A TEXT). An example

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
#  <head>
#   <title>
#    Page title
#   </title>
#  </head>
#  <body>
#   <p id="firstpara" align="center">
#    This is paragraph
#    <b>
#     one
#    </b>
#    .
#   </p>
#   <p id="secondpara" align="blah">
#    This is paragraph
#    <b>
#     two
#    </b>
#    .
#   </p>
#  </body>
# </html>

To move down the parse tree you have contents and string.

  • contents is an ordered list of the Tag and NavigableString objects contained within a page element

  • if a tag has only one child node, and that child node is a string, the child node is made available as tag.string, as well as tag.contents[0]

For the above, that is to say you can get

soup.b.string
# u'one'
soup.b.contents[0]
# u'one'

For several children nodes, you can have for instance

pTag = soup.p
pTag.contents
# [u'This is paragraph ', <b>one</b>, u'.']

so here you may play with contents and get contents at the index you want.