Pandas SQL chunksize

Nitin Kumar picture Nitin Kumar · Aug 5, 2015 · Viewed 31.9k times · Source

This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. import pandas as pd

df = pd.read_sql_query('select name, birthdate from table1', chunksize = 1000)

What I do not understand is when I do not give a chunksize, data is stored in the memory and I can see the memory growing however, when I give a chunksize the memory usage is not that high.

I have is that this df now contains a number of arrays which I can access as

for df_array in df:
    print df.head(5)

What I do not understand here is if the entire result of the SQL statement is kept in memory i.e. df is an object carrying multiple arrays or if these are like pointers pointing towards a temp table created by SQL query.

I would be very glad to develop some understanding about how this process is actually working.

Answer

prusya picture prusya · Aug 5, 2015

Let's consider two options and what happens in both cases:

  1. chunksize is None(default value):
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize is None
    • pandas tells database that it wants to receive all rows of the result table at once
    • database returns all rows of the result table
    • pandas stores the result table in memory and wraps it into a data frame
    • now you can use the data frame
  2. chunksize in not None:
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize has some value
    • pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
    • pandas tells database that it wants to receive chunksize rows
    • database returns the next chunksize rows from the result table
    • pandas stores the next chunksize rows in memory and wraps it into a data frame
    • now you can use the data frame

For more details you can see pandas\io\sql.py module, it is well documented