Currently I have code that does something like this:
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name not in VALID_TAGS:
tag.extract()
soup.renderContents()
Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?
Current versions of the BeautifulSoup library have an undocumented method on Tag objects called replaceWithChildren(). So, you could do something like this:
html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
print soup
Looks like it behaves like you want it to and is fairly straightforward code (although it does make a few passes through the DOM, but this could easily be optimized.)