I'm trying to create a progress bar for reading excel data into pandas using tqdm. I can do this easily with a csv using the chunksize argument like so:
data_reader = pd.read_csv(path,
chunksize = 1000)
for row in tqdm(data_reader,
total = 200):
df_list = []
df_list.append(row)
Which updates the progress bar for every chunk of 1000 out of 200 total chunks. pd.read_excel
, however, no longer has a chunksize
argument. Is there an alternative?
Edit: I've read the question re: reading an excel file in chunks (Reading a portion of a large xlsx file with python), however, read_excel does not have a chunksize argument anymore and pd.ExcelFile.parse
is equivalent. I am wondering if there is an alternative to the chunksize
argument or another way to create an iterable to loop over chunks while they are being read in.
If you want to add a progress indicator, you could use the .tell()
method of file objects. That's of course not quite accurate, but maybe it gives your users enough accuracy to estimate, how long a coffee break they can make :-)
So here is the plan: basically open your excel file with open
and pass the result object to pd.read_excel
. According to the docs, this should be possible, and I just verified it with a simple example for an xlsx file.
At the beginning you evaluate the size of the file. E.g by:
import io
fp.seek(0, io.SEEK_END) # set the file cursor to the end of the file
fp_len= fp.tell()
fp.seek(0, io.SEEK_SET) # set the file cursor back to the beginning of the file
With this setup, you have two possibilities:
fp.tell()
on the file object, you opened for the xlsx file, orread
method) and updates the progress bar synchronously, so you don't need am extra thread. Your class would just need to pass the method calls to the actual file class. In that sense you could compare it with a proxy object.I have to admit, that 2 is kind of dirty. But I'm convinced that both methods would work, because I just verified, that pd.read_excel
really can read from a file object (io.BufferedReader
), also xlsx files, which are afaik zipped files. This method would just not be so accurate, because the file pointer might not move linear with time depending on things like fluctuations in the compression rate (some parts of the file might be compressable with a higher rate, than others).