How to read excel Unicode characters using Python

Trebia Project. picture Trebia Project. · Apr 21, 2015 · Viewed 12.1k times · Source

I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".

My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.

When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)

Being more concrete I perform the following:

Read the Excel:

  xlApp = Dispatch("Excel.Application")

  excel = xlApp.Workbooks.Open(excel_location)

in an internal loop I extract the value of the cell:

cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)

and finally, if I try to convert cell_value to str, crashes:

print str(cell_value)

If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.

What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.

Any idea how can I progress?

Answer

Aaron Digulla picture Aaron Digulla · Apr 21, 2015

This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.

The source of the error is print: In Python 2.x, print doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).

Now you convert a COMObject to a string. str is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.

Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1 markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).

That's when the ascii encoder says: Sorry, can't help you there.

That means you can work with this value, compare it and such, but you can't print it. A simple fix for the printing problem is:

s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii

If your terminal supports UTF-8, then you need to tell Python about it:

import sys
import codecs

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

You should also have a look at sys.stdout.encoding which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.

Related: