How to read excel cell and retain or detect its format in Python

Ian picture Ian · Apr 20, 2016 · Viewed 16.7k times · Source

I am given the an excel file which contains some text formatting. Some can be bold, some italic, some are supercase1, and some other formats (but not as many as the three mentioned).

Examples:

  • Ku'lah 2ku.lah v; definition: some def; usage: some usage;
  • He'lahsa 2he.lah.sa n; definition: some def; usage: some usage;
  • And so on

Now, since this cell is to be made as dictionary (real, human, dictionary) database entry, I would like to retain the format of the cell, as it will be beneficial to tell the usage of the word (such as bold in the above case indicating the word type: v (verb) and italic indicating new section).

But it is all in the excel cell.

When I try to simply read the excel file directly using database tool like Toad for Oracle, the format is gone!

  1. Is there any way to read the excel file and yet retain the format?
  2. Alternatively, is there any way to detect the formatting? As long as we can detect the format, I can simply replace the text with some HTML format like <b>v</b> and that will be my work. I only want to know how we retain or detect the excel cell text format in Python. (in particular are these three formats: bold, italic, and supercase)

Edit:

I try to get the text format with xlrd package, but I can't seem to find a way to get the text format style as the cell object only consists of: ctype, value, and xf_index. It has no info about the text format, and when I create the instance with the formatting_info=True:

book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
                          formatting_info=True, on_demand=False, ragged_rows=False)

I got the following error:

NotImplementedError: formatting_info=True not yet implemented

Raised by this line in the xlsx.py file of the xlrd package:

if formatting_info:
    raise NotImplementedError("formatting_info=True not yet implemented")

Which I found it strange, since I use version 0.9.4 xlrd (latest) and the documentation says that since version above 0.6.1, the formatting info is included:

Default Formatting

Default formatting is applied to all empty cells (those not described by a cell record). Firstly row default information (ROW record, Rowinfo class) is used if available. Failing that, column default information (COLINFO record, Colinfo class) is used if available. As a last resort the worksheet/workbook default cell format will be used; this should always be present in an Excel file, described by the XF record with the fixed index 15 (0-based). By default, it uses the worksheet/workbook default cell style, described by the very first XF record (index 0). Formatting features not included in xlrd version 0.6.1

Rich text i.e. strings containing partial bold italic and underlined text, change of font inside a string, etc. See OOo docs s3.4 and s3.2 Asian phonetic text (known as "ruby"), used for Japanese furigana. See OOo docs s3.4.2 (p15) Conditional formatting. See OOo docs s5.12, s6.21 (CONDFMT record), s6.16 (CF record) Miscellaneous sheet-level and book-level items e.g. printing layout, screen panes. Modern Excel file versions don't keep most of the built-in "number formats" in the file; Excel loads formats according to the user's locale. Currently xlrd's emulation of this is limited to a hard-wired table that applies to the US English locale. This may mean that currency symbols, date order, thousands separator, decimals separator, etc are inappropriate. Note that this does not affect users who are copying XLS files, only those who are visually rendering cells.

Did I make any mistake here? My code is simply as shown:

book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
                          formatting_info=True, on_demand=False, ragged_rows=False)

Edit 2:

The example shown in the post shows that it creates the class instance (book) with formatting_info=True. But I check it in my implementation. It raises the error above. Any idea?

Answer

alec_djinn picture alec_djinn · Apr 20, 2016

I suggest you the library xlrd https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966

On GitHub here https://github.com/python-excel/xlrd

You can find an easy example on how to use xlrd to determine the font style here Using XLRD module and Python to determine cell font style (italics or not)

Here a practical example:

from xlrd import open_workbook

path = '/Users/.../Desktop/Workbook1.xls'
wb = open_workbook(path, formatting_info=True)
sheet = wb.sheet_by_name("Sheet1")
cell = sheet.cell(0, 0) # The first cell
print("cell.xf_index is", cell.xf_index)
fmt = wb.xf_list[cell.xf_index]
print("type(fmt) is", type(fmt))
print("Dumped Info:")
fmt.dump()

It outputs the following:

cell.xf_index is 62
type(fmt) is <class 'xlrd.formatting.XF'>
Dumped Info:
_alignment_flag: 0
_background_flag: 0
_border_flag: 0
_font_flag: 1
_format_flag: 0
_protection_flag: 0
alignment (XFAlignment object):
    hor_align: 0
    indent_level: 0
    rotation: 0
    shrink_to_fit: 0
    text_direction: 0
    text_wrapped: 0
    vert_align: 2
background (XFBackground object):
    background_colour_index: 65
    fill_pattern: 0
    pattern_colour_index: 64
border (XFBorder object):
    bottom_colour_index: 0
    bottom_line_style: 0
    diag_colour_index: 0
    diag_down: 0
    diag_line_style: 0
    diag_up: 0
    left_colour_index: 0
    left_line_style: 0
    right_colour_index: 0
    right_line_style: 0
    top_colour_index: 0
    top_line_style: 0
font_index: 6
format_key: 0
is_style: 0
lotus_123_prefix: 0
parent_style_index: 0
protection (XFProtection object):
    cell_locked: 1
    formula_hidden: 0
xf_index: 62

Where _font_flag: 1 indicates that is Bold