Usage of unicode() and encode() functions in Python

xralf picture xralf · Apr 23, 2012 · Viewed 271.9k times · Source

I have a problem with encoding of the path variable and inserting it to the SQLite database. I tried to solve it with encode("utf-8") function which didn't help. Then I used unicode() function which gives me type unicode.

print type(path)                  # <type 'unicode'>
path = path.replace("one", "two") # <type 'str'>
path = path.encode("utf-8")       # <type 'str'> strange
path = unicode(path)              # <type 'unicode'>

Finally I gained unicode type, but I still have the same error which was present when the type of the path variable was str

sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

Could you help me solve this error and explain the correct usage of encode("utf-8") and unicode() functions? I'm often fighting with it.

EDIT:

This execute() statement raised the error:

cur.execute("update docs set path = :fullFilePath where path = :path", locals())

I forgot to change the encoding of fullFilePath variable which suffers with the same problem, but I'm quite confused now. Should I use only unicode() or encode("utf-8") or both?

I can't use

fullFilePath = unicode(fullFilePath.encode("utf-8"))

because it raises this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 32: ordinal not in range(128)

Python version is 2.7.2

Answer

newtover picture newtover · Apr 23, 2012

str is text representation in bytes, unicode is text representation in characters.

You decode text from bytes to unicode and encode a unicode into bytes with some encoding.

That is:

>>> 'abc'.decode('utf-8')  # str to unicode
u'abc'
>>> u'abc'.encode('utf-8') # unicode to str
'abc'

UPD Sep 2020: The answer was written when Python 2 was mostly used. In Python 3, str was renamed to bytes, and unicode was renamed to str.

>>> b'abc'.decode('utf-8') # bytes to str
'abc'
>>> 'abc'.encode('utf-8'). # str to bytes
b'abc'