I'm making a Python script which parses data files. The parsed data is then sent to an Excel file. The data can be rather huge. I'm looking at 10 to 20 columns, but the number of rows can be well over 100.000.
Writing this amount of data to Excel with win32com takes a considerable amount of time. My first step was to iterate over every cell in the Excel file, which is very time consuming. After some digging I found out how I can write a row with a single call, reducing the time needed drastically.
However, when I need to send 100.000 rows of data to Excel, it will still take a lot of time. I'm pretty sure that I can further speed this up, by sending the complete array with a single call. However I'm not able to achieve this so far.
See the code below, which demonstrates the issue. The code shows the differences in time. However, the third step, sending the complete array with a single call to a range is not resulting in the correct data in Excel. What am I doing wrong?
import win32com.client
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 1000:
data_array.append(row)
i += 1
#write the array to an excel file
excel = win32com.client.Dispatch("Excel.Application")
excel.Visible = True
excel.ScreenUpdating = False
book = excel.Workbooks.Add()
sh1 = book.Worksheets(1)
sh2 = book.Worksheets(2)
sh3 = book.Worksheets(3)
#double loop, writing individual cells
print "Writing with double loop to inidividual cells."
start = time.time()
row = 0
for line in data_array:
row += 1
col = 0
for field in line:
col += 1
sh1.Cells(row, col).Value = field
print "Processing time: " + str(time.time() - start) + " seconds."
#single loop, writing a row to a range
print "Writing with double loop to inidividual cells."
start = time.time()
row = 0
for line in data_array:
row += 1
sh2.Range(sh2.Cells(row,1), sh2.Cells(row, len(line))).Value = line
print "Processing time: " + str(time.time() - start) + " seconds."
#no loop, write array to range
print "Writing with double loop to inidividual cells."
start = time.time()
try:
sh3.Range(sh3.Cells(row,1), sh3.Cells(len(data_array), len(data_array[0]))).Value = data_array
print "Processing time: " + str(time.time() - start) + " seconds."
except:
print "Failed to write array to excel file."
excel.ScreenUpdating = True
sheet = None
book = None
excel.Quit()
excel = None
I have been doing some research on this. And have been coming to some interesting conclusions.
There are indeed multiple solutions around to write data to excel from python. I ended up focussing on three modules.
win32com.client
Works slow. However the document can be opened in excel. So the end result is available for the user in excel to start working on. Not great for large amounts of data.
Writing 10,000 rows with 10 columns takes on my computer (core i5) app. 70 seconds.
import win32com.client
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
data_array.append(row)
i += 1
#write the array to an excel file
excel = win32com.client.Dispatch("Excel.Application")
excel.Visible = True
excel.ScreenUpdating = False
book = excel.Workbooks.Add()
sheet = book.Worksheets(1)
#single loop, writing a row to a range
print "Writing using win32com.client"
start = time.time()
row = 0
for line in data_array:
row += 1
sheet.Range(sheet.Cells(row,1), sheet.Cells(row, len(line))).Value = line
print "Processing time: " + str(time.time() - start) + " seconds."
print "Completed: " + str(time.time() - start) + " seconds."
excel.ScreenUpdating = True
sheet = None
book = None
excel.Quit()
excel = None
openpyxl
Is a bit faster, but still not great. This module is slow on transfering the data to the excel memory object, but save it really fast. It creates the 10,000 lines with 10 columns in 22.3 seconds, and saves the file in another 0.5 seconds. When I test this with 100,000 rows and 10 columns. The data is created in 228.3 seconds, and saving the file is done in another 2.9 seconds. Fairly slow, but the file saving is fast. Hence openpyxl is probably good for making changes (formatting) to existing data, although I have not testing this yet. Another advantage is that coding with openpyxl is easier then using win32com.client.
import openpyxl
import sys
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
data_array.append(row)
i += 1
#create an excel workbook and sheet object
book = openpyxl.Workbook(optimized_write = True)
#single loop, writing rows
start = time.time()
print "Writing with single loop using openpyxl"
sheet = book.create_sheet()
for line in data_array:
sheet.append(line)
print "Processing time: " + str(time.time() - start) + " seconds."
#close and save the file.
book.save('openpyxl.xlsx')
print "Total time: " + str(time.time() - start) + " seconds."
I am facing another issue with openpyxl. In my real tool, openpyxl is having problems with saving big amounts of data (> 10,000 rows). This I have not figured out yet, and maybe I won't bother to look further in to it.
PyExcelerate
This one is first of all fast. It creates the 10,000 rows and 10 columns in just 0.17 seconds. Saving the file however takes 2.2 seconds. By far the fastest option of the three. When I try to save 100,000 rows and 10 columns with this module, the excel data is created in just 1.8 seconds. However saving the file now takes 21.7 seconds. Hence, this module is really fast, but there is a penalty in writing the file. Overall still the fastest of them all. Another advantage of PyExcelerate is that coding is very easy, easier again as openpyxl.
import pyexcelerate
import sys
import time
#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
data_array.append(row)
i += 1
print sys.version
#create an excel workbook and sheet object
book = pyexcelerate.Workbook()
#single loop, writing rows
start = time.time()
print "Writing with single loop using PyExcelerate"
book.new_sheet("data", data = data_array)
print "Processing time: " + str(time.time() - start) + " seconds."
#close and save the file.
book.save('pyexcelerate.xlsx')
print "Total time: " + str(time.time() - start) + " seconds."
So my conclusion is that PyExcelerate is by far the fastest. The advantage of win32com.client is that the created excel file can be openend in excel to make the created data available for the user to start working with it. Openpyxl could be interesting to make style changes after creating it. However, I have not tested this yet. Hence, it could be beneficial to combine win32com.client, openpyxl and PyExcelerate in one application.