Python code to remove HTML tags from a string

Bruno Rocha - rochacbruno picture Bruno Rocha - rochacbruno · Mar 12, 2012 · Viewed 238.8k times · Source

I have a text like this:

text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

using pure Python, with no external module I want to have this:

>>> print remove_tags(text)
Title A long text..... a link

I know I can do it using lxml.html.fromstring(text).text_content() but I need to achieve the same in pure Python using builtin or std library for 2.6+

How can I do that?

Answer

c24b picture c24b · Oct 19, 2012

Using a regex

Using a regex, you can clean everything inside <> :

import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

Some HTML texts can also contain entities, that are not enclosed in brackets such as '&nsbm'. If that is the case then you might want to write the regex as

cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

This link contains more details on this.

Using BeautifulSoup

You could also use BeautifulSoup additional package to find out all the raw text

You will need to explicitly set a parser when calling BeautifulSoup I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (i.e. available without additional install) 'html.parser'

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.