Write xml utf-8 file with utf-8 data with ElementTree

Question 1

Write xml utf-8 file with utf-8 data with ElementTree

python elementtree

c0m4 · Apr 6, 2012 · Viewed 21.4k times · Source

Answer

Answer

codecs.open expects Unicode strings to be written to the file object and it will handle encoding to UTF-8. ElementTree's write encodes the Unicode strings to UTF-8 byte strings before sending them to the file object. Since the file object wants Unicode strings, it is coercing the byte string back to Unicode using the default ascii codec and causing the UnicodeDecodeError.

Just do this:

#expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write('testunicode.xml',encoding="UTF-8",xml_declaration=True)
#expfile.close()

Question 2

I'm trying to write an xml file with utf-8 encoded data using ElementTree like this:

#!/usr/bin/python                                                                       
# -*- coding: utf-8 -*-                                                                   

import xml.etree.ElementTree as ET
import codecs

testtag = ET.Element('unicodetag')
testtag.text = u'Töreboda' #The o is really &#246; (o with two dots over). No idea why SO dont display this
expfile = codecs.open('testunicode.xml',"w","utf-8-sig")
ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
expfile.close()

This blows up with the error

Traceback (most recent call last):
  File "unicodetest.py", line 10, in <module>
    ET.ElementTree(testtag).write(expfile,encoding="UTF-8",xml_declaration=True)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 815, in write
    serialize(write, self._root, encoding, qnames, namespaces)    
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

Using the "us-ascii" encoding instead works fine, but don't preserve the unicode characters in the data. What is happening?

Write xml utf-8 file with utf-8 data with ElementTree

Answer

Related questions