How to read data in Python dataframe without concatenating?

Geet picture Geet · Sep 8, 2016 · Viewed 16.9k times · Source

I want to read the file f (file size:85GB) in chunks to a dataframe. Following code is suggested.

chunksize = 5
TextFileReader = pd.read_csv(f, chunksize=chunksize)

However, this code gives me TextFileReader, not dataframe. Also, I don't want to concatenate these chunks to convert TextFileReader to dataframe because of the memory limit. Please advise.

Answer

Sayali Sonawane picture Sayali Sonawane · Sep 8, 2016

As you are trying to process 85GB CSV file, if you will try to read all the data by breaking it into chunks and converting it into dataframe then it will hit memory limit for sure. You can try to solve this problem by using different approach. In this case, you can use filtering operations on your data. For example, if there are 600 columns in your dataset and you are interested only in 50 columns. Try to read only 50 columns from the file. This way you will save lot of memory. Process your rows as you read them. If you need to filter the data first, use a generator function. yield makes a function a generator function, which means it won't do any work until you start looping over it.

For more information regarding generator function: Reading a huge .csv file

For efficient filtering refer: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3

For processing smaller dataset:

Approach 1: To convert reader object to dataframe directly:

full_data = pd.concat(TextFileReader, ignore_index=True)

It is necessary to add parameter ignore index to function concat, because avoiding duplicity of indexes.

Approach 2: Use Iterator or get_chunk to convert it into dataframe.

By specifying a chunksize to read_csv,return value will be an iterable object of type TextFileReader.

df=TextFileReader.get_chunk(3)

for chunk in TextFileReader:
    print(chunk)

Source : http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

df= pd.DataFrame(TextFileReader.get_chunk(1))

This will convert one chunk to dataframe.

Checking total number of chunks in TextFileReader

number_of_chunks=0

for chunk in TextFileReader:
   number_of_chunks=number_of_chunks+1 


print(number_of_chunks)

If file size is bigger,I won't recommend second approach. For example, if csv file consist of 100000 records then chunksize=5 will create 20,000 chunks.