How to create a temporary file with Unicode encoding?

dbarbosa picture dbarbosa · May 8, 2012 · Viewed 15.1k times · Source

When I use open() to open a file, I am not able to write unicode strings. I have learned that I need to use codecs and open the file with Unicode encoding (see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data).

Now I need to create some temporary files. I tried to use the tempfile library, but it doesn't have any encoding option. When I try to write any unicode string in a temporary file with tempfile, it fails:

#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
  fh.write(u"Hello World: ä")
  fh.seek(0)
  for line in fh:
    print line

How can I create a temporary file with Unicode encoding in Python?

Edit:

  1. I am using Linux and the error message that I get for this code is:

    Traceback (most recent call last):
      File "tmp_file.py", line 5, in <module>
        fh.write(u"Hello World: ä")
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 13: ordinal not in range(128)
    
  2. This is just an example. In practice I am trying to write a string that some API returned.

Answer

dfb picture dfb · May 8, 2012

Everyone else's answers are correct, I just want to clarify what's going on:

The difference between the literal 'foo' and the literal u'foo' is that the former is a string of bytes and the latter is the Unicode object.

First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.

To get a Unicode string from a byte string, you call the .encode() method:

>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True

Similarly, you could call your string.encode in the write call and achieve the same effect as just removing the u.

If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str type).

The error you're getting, then, is only because the tempfile module is expecting a str object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.