I have written a script which has to read lot of excel files from a folder (around 10,000). This script loads the excel file (some of them has more than 2,000 rows) and reads one column to count the number of rows (checking stuff). If the number of rows is not equal to a given number, it writes the warning in a log.
The problem comes when the script reads more than 1,000 excel files. It's then when it throws memory error, and I don't know where could be the problem. Previously, the script read two csv file with 14,000 rows and stores it in a list. These lists contain an identificator for the excel file and its respective number of rows. If this number of rows is not equal to the number of rows of the excel file, it writes the warning. Could be the problem the reading of these lists?
I'm using openpyxl to load the workbooks, do I need to close them before open the next?
This is my code:
# -*- coding: utf-8 -*-
import os
from openpyxl import Workbook
import glob
import time
import csv
from time import gmtime,strftime
from openpyxl import load_workbook
folder = ''
conditions = 0
a = 0
flight_error = 0
condition_error = 0
typical_flight_error = 0
SP_error = 0
cond_numbers = []
with open('Conditions.csv','rb') as csv_name: # Abre el fichero csv donde estarán las equivalencias
csv_read = csv.reader(csv_name,delimiter='\t')
for reads in csv_read:
cond_numbers.append(reads)
flight_TF = []
with open('vuelo-TF.csv','rb') as vuelo_TF:
csv_read = csv.reader(vuelo_TF,delimiter=';')
for reads in csv_read:
flight_TF.append(reads)
excel_files = glob.glob('*.xlsx')
for excel in excel_files:
print "Leyendo excel: "+excel
wb = load_workbook(excel)
ws = wb.get_sheet_by_name('Control System')
flight = ws.cell('A7').value
typical_flight = ws.cell('B7').value
a = 0
for row in range(6,ws.get_highest_row()):
conditions = conditions + 1
value_flight = int(ws.cell(row=row,column=0).value)
value_TF = ws.cell(row=row,column=1).value
value_SP = int(ws.cell(row=row,column=4).value)
if value_flight == '':
break
if value_flight != flight:
flight_error = 1 # Si no todos los flight numbers dentro del vuelo son iguales
if value_TF != typical_flight:
typical_flight_error = 2 # Si no todos los typical flight dentro del vuelo son iguales
if value_SP != 100:
SP_error = 1
for cond in cond_numbers:
if int(flight) == int(cond[0]):
conds = int(cond[1])
if conds != int(conditions):
condition_error = 1 # Si el número de condiciones no se corresponde con el esperado
for vuelo_TF in flight_TF:
if int(vuelo_TF[0]) == int(flight):
TF = vuelo_TF[1]
if typical_flight != TF:
typical_flight_error = 1 # Si el vuelo no coincide con el respectivo typical flight
if flight_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': Los flight numbers del vuelo '+str(flight)+' no coinciden.\n'
log.write(message)
log.close()
flight_error = 0
if condition_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n'
log.write(message)
log.close()
condition_error = 0
if typical_flight_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n'
log.write(message)
log.close()
typical_flight_error = 0
if typical_flight_error == 2:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n'
log.write(message)
log.close()
typical_flight_error = 0
if SP_error == 1:
today = datetime.datetime.today()
time = today.strftime(" %Y-%m-%d %H.%M.%S")
log = open('log.txt','aw')
message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n'
log.write(message)
log.close()
SP_error = 0
conditions = 0
The if statements of the end are for checking and writing warning logs.
I'm using windows xp with 8 gb RAM and intel xeon w3505 (two cores, 2,53 GHz).
The default implementation of openpyxl will store all the accessed cells into memory. I will suggest you to use the Optimized reader (link - https://openpyxl.readthedocs.org/en/latest/optimized.html) instead
In code:-
wb = load_workbook(file_path, use_iterators = True)
While loading a workbook pass use_iterators = True
. Then access the sheet and cells like:
for row in sheet.iter_rows():
for cell in row:
cell_text = cell.value
This will reduce the memory footprint to 5-10%
UPDATE: In version 2.4.0 use_iterators = True
option is removed. In newer versions openpyxl.writer.write_only.WriteOnlyWorksheet
is introduced for dumping large amounts of data.
from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()
# now we'll fill it with 100 rows x 200 columns
for irow in range(100):
ws.append(['%d' % i for i in range(200)])
# save the file
wb.save('new_big_file.xlsx')
Not tested the below code just copied from the above link.
Thanks @SdaliM for the information.