I'm using xlrd
to process .xls files, and openpyxl
to process .xlsx files, and this is working well.
Then I'm handed what is ostensibly a .xls file, so I try to xlrd.open_workbook()
, and get:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'
I take a look at this question, and I surmise that my file, although ending with extension .xls, must actually be a .xlsx. And indeed, I can view it in a text editor:
<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
:
:
:
(for privacy reasons, I can't post the whole file, but it's probably not required for our analysis).
So I surmise that if I just copy (cp
) it to a .xlsx, I should be able to open it with openpyxl.load_workbook()
, but I get:
BadZipfile: File is not a zip file
If it's actually an xls (unlikely) but can't be opened with xlrd
, and if it is atcually an xlsx but can't be opened with openpyxl
, even after I cp
it to a .xlsx, what to do?
Note: If I open up the .xls in Excel, save it as a .xlsx, and retry with openpyxl
, it does load fine, but this manual step is not a luxury I will have in the executing of my program.
One thing is clear: The file you're trying to open has a different format than its extension suggests.
As you already know, Excel file formats include (but are not limited to) xls
and xlsx
.
The Excel 2003 format (xls
) is a binary format. This means that if you open a xls
file with a text editor, you'll just see gibberish.
The Excel 2007 format (xlsx
) is quite different. A xlsx
file is a zip file with a bunch of XML files inside. You can use a zip archiver to extract the contents of the xlsx
file. Then, you can edit the XML files using any text editor. However, opening a xlsx
file directly with a text editor is like opening a zip
file with a text editor: You'll just see gibberish.
The fact that you can open your file with a text editor (and read its contents) shows that it's neither a xls
file nor a xlsx
file. Your file is neither a binary file nor a zip file, it's a plain XML file.
Moreover, this error message says a lot.
BadZipfile: File is not a zip file
It means that openpyxl
is trying to open your file as a xlsx
file and therefore a zip
file. But when it tries to extract its contents, it fails, because your file isn't even a zip file.
But if the file is neither a xlsx
file nor a xls
file, how can Microsoft Excel read it? I wondered that too. After some research, I believe your file has the XML Spreadsheet 2003 file format. This example looks very similar to the file content you posted. Since Microsoft Excel supports this format, it's no wonder that it can read your file.
Unfortunately, Python libraries such as xlrd
and openpyxl
only support xls
and xlsx
file formats, so they won't be able to read your file. I think you'll just have to manually convert it to a supported format.