I'm parsing HTML
with BeautifulSoup. At the end, I would like to obtain the body
contents, but without the body
tags. But BeautifulSoup adds html
, head
, and body
tags. I this googlegrops discussion one possible solution is proposed:
>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n Some paragraph\n </p>'
This solution is a hack. There should be a better and obvious way to do it.
Do you mean getting everything inbetween the body tags?
In this case you can use :
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)