What's the most efficient way to convert a MySQL result set to a NumPy array?

thegreatt picture thegreatt · Aug 15, 2011 · Viewed 20k times · Source

I'm using MySQLdb and Python. I have some basic queries such as this:

c=db.cursor()
c.execute("SELECT id, rating from video")
results = c.fetchall()

I need "results" to be a NumPy array, and I'm looking to be economical with my memory consumption. It seems like copying the data row by row would be incredibly inefficient (double the memory would be required). Is there a better way to convert MySQLdb query results into the NumPy array format?

The reason I'm looking to use the NumPy array format is because I want to be able to slice and dice the data easily, and it doesn't seem like python is very friendly to multi-dimensional arrays in that regard.

e.g. b = a[a[:,2]==1] 

Thanks!

Answer

sirlark picture sirlark · Aug 15, 2013

This solution uses Kieth's fromiter technique, but handles the two dimensional table structure of SQL results more intuitively. Also, it improves on Doug's method by avoiding all the reshaping and flattening in python data types. Using a structured array we can read pretty much directly from the MySQL result into numpy, cutting out python data types almost entirely. I say 'almost' because the fetchall iterator still produces python tuples.

There is one caveat though, but it's not a biggie. You must know the data type of your columns and the number of rows in advance.

Knowing the column types should be obvious, since you know what the query is presumably, otherwise you can always use curs.description, and a map of the MySQLdb.FIELD_TYPE.* constants.

Knowing the row count means you have to use client side cursor (which is the default). I don't know enough about the internals of MySQLdb and the MySQL client libraries, but my understanding is that the entire result is fetched into client side memory when using client side cursors, although I suspect there's actually some buffering and caching involved. This would mean using double memory for the result, once for the cursor copy and once for the array copy, so it's probably a good idea to close the cursor as soon as possible to free up the memory if the result set is large.

Strictly speaking, you don't have to provide the number of rows in advance, but doing so means the array memory is allocated once off in advance, and not continuously resized as more rows come in from the iterator which is meant to provide a huge performance boost.

And with that, some code

import MySQLdb
import numpy

conn = MySQLdb.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb')
curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount
numrows = curs.execute("SELECT id, rating FROM video")

#curs.fetchall() is the iterator as per Kieth's answer
#count=numrows means advance allocation
#dtype='i4,i4' means two columns, both 4 byte (32 bit) integers
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))

print A #output entire array
ids = A['f0'] #ids = an array of the first column
              #(strictly speaking it's a field not column)
ratings = A['f1'] #ratings is an array of the second colum

See the numpy documentation for dtype and the link above about structured arrays for how to specify column data types, and column names.