Only extracting text from this element, not its children

Dragon picture Dragon · Feb 14, 2011 · Viewed 52.6k times · Source

I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:

I have

import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text

The output to this is yesno. I want simply 'yes'.

What's the best way of achieving this?

Edit: I also want yes to be output when parsing '<html><b>no</b>yes</html>'.

Answer

jbochi picture jbochi · Feb 14, 2011

what about .find(text=True)?

>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True)
u'no'

EDIT:

I think that I've understood what you want now. Try this:

>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False)
u'yes'
>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False)
u'yes'