Decoding HTML entities with Python

python unicode character-encoding content-type beautifulsoup

KeyboardInterrupt · Jul 30, 2009 · Viewed 21.5k times · Source

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.

Take for example:

"U.S. Adviser&#8217;s Blunt Memo on Iraq: Time &#8216;to Go Home&#8217;"

I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.

Answer

>>> from HTMLParser import HTMLParser
>>> print HTMLParser().unescape('U.S. Adviser&#8217;s Blunt Memo on Iraq: '
...                             'Time &#8216;to Go Home&#8217;')
U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’

The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.

Decoding HTML entities with Python

Answer

Related questions