I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This works:
import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
...but this does not work:
sql = "SELECT TOP 2000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
It returns this error:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error
I have read here that a similar problem exists when creating a dataframe
from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.
As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql
to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying