Unicode to UTF8 for CSV Files - Python via xlrd

anschauung picture anschauung · Jul 27, 2009 · Viewed 25.6k times · Source

I'm trying to translate an Excel spreadsheet to CSV using the Python xlrd and csv modules, but am getting hung up on encoding issues. Xlrd produces output from Excel in Unicode, and the CSV module requires UTF-8.

I imaging that this has nothing to do with the xlrd module: everything works fine outputing to stdout or other outputs that don't require a specific encoding.

The worksheet is encoded as UTF-16-LE, according to book.encoding

The simplified version of what I'm doing is:

from xlrd import *
import csv
b = open_workbook('file.xls')
s = b.sheet_by_name('Export')
bc = open('file.csv','w')
bcw = csv.writer(bc,csv.excel,b.encoding)
for row in range(s.nrows):
    this_row = []
    for col in range(s.ncols):
        this_row.append(s.cell_value(row,col))
    bcw.writerow(this_row)

This produces the following error, about 740 lines in:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 5: ordinal not in range(128)

The value is seems to be getting hung up on is "516-777316" -- the text in the original Excel sheet is "516-7773167" (with a 7 on the end)

I'll be the first to admit that I have only a vague sense of how character encoding works, so most of what I've tried so far are various fumbling permutations of .encode and .decode on the s.cell_value(row,col)

If someone could suggest a solution I would appreciate it -- even better if you could provide an explanation of what's not working and why, so that I can more easily debug these problems myself in the future.

Thanks in advance!

EDIT:

Thanks for the comments so far.

When I user this_row.append(s.cell(row,col)) (e.g. s.cell instead of s.cell_value) the entire document writes without errors.

The output isn't particularly desirable (text:u'516-7773167'), but it avoids the error even though the offending characters are still in the output.

This makes me think that the challenge might be in xlrd after all.

Thoughts?

Answer

Alex Martelli picture Alex Martelli · Jul 27, 2009

I expect the cell_value return value is the unicode string that's giving you problems (please print its type() to confirm that), in which case you should be able to solve it by changing this one line:

this_row.append(s.cell_value(row,col))

to:

this_row.append(s.cell_value(row,col).encode('utf8'))

If cell_value is returning multiple different types, then you need to encode if and only if it's returning a unicode string; so you'd split this line into a few lines:

val = s.cell_value(row, col)
if isinstance(val, unicode):
    val = val.encode('utf8')
this_row.append(val)