Decode HTML entities in Python string?

jkp picture jkp · Jan 18, 2010 · Viewed 234.8k times · Source

I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

How can I decode the HTML entities in text to get "£682m" instead of "&pound;682m".

Answer

luc picture luc · Jan 18, 2010

Python 3.4+

Use html.unescape():

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m