How to parse unicode strings with minidom?

dariopy picture dariopy · Mar 16, 2011 · Viewed 15.8k times · Source

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

Answer

Blender picture Blender · Mar 16, 2011

Try to decode it:

> print u'abcdé'.encode('utf-8')
> abcdé

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé