Is there a chunksize argument for read_excel in pandas?

jsxgd picture jsxgd · Jun 26, 2017 · Viewed 7.4k times · Source

I'm trying to create a progress bar for reading excel data into pandas using tqdm. I can do this easily with a csv using the chunksize argument like so:

data_reader = pd.read_csv(path,
                          chunksize = 1000)

for row in tqdm(data_reader,
                total = 200):
    df_list = []
    df_list.append(row)

Which updates the progress bar for every chunk of 1000 out of 200 total chunks. pd.read_excel, however, no longer has a chunksize argument. Is there an alternative?

Edit: I've read the question re: reading an excel file in chunks (Reading a portion of a large xlsx file with python), however, read_excel does not have a chunksize argument anymore and pd.ExcelFile.parse is equivalent. I am wondering if there is an alternative to the chunksize argument or another way to create an iterable to loop over chunks while they are being read in.

Answer

jottbe picture jottbe · Jul 24, 2019

If you want to add a progress indicator, you could use the .tell() method of file objects. That's of course not quite accurate, but maybe it gives your users enough accuracy to estimate, how long a coffee break they can make :-)

So here is the plan: basically open your excel file with open and pass the result object to pd.read_excel. According to the docs, this should be possible, and I just verified it with a simple example for an xlsx file.

At the beginning you evaluate the size of the file. E.g by:

import io
fp.seek(0, io.SEEK_END) # set the file cursor to the end of the file
fp_len= fp.tell()
fp.seek(0, io.SEEK_SET) # set the file cursor back to the beginning of the file

With this setup, you have two possibilities:

  1. Either you create a thread, that updates the progress bar from time to time by calling fp.tell() on the file object, you opened for the xlsx file, or
  2. create your own wrapper, that provides the methods, pandas needs to read the data (at least a read method) and updates the progress bar synchronously, so you don't need am extra thread. Your class would just need to pass the method calls to the actual file class. In that sense you could compare it with a proxy object.

I have to admit, that 2 is kind of dirty. But I'm convinced that both methods would work, because I just verified, that pd.read_excel really can read from a file object (io.BufferedReader), also xlsx files, which are afaik zipped files. This method would just not be so accurate, because the file pointer might not move linear with time depending on things like fluctuations in the compression rate (some parts of the file might be compressable with a higher rate, than others).