I wonder what the best way of normalizing/standardizing a numpy recarray
is.
To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).
a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)
As you can see, I cannot e.g. process a[:,:-1]
as the shape is one-dimensional.
The best I found is to iterate over all columns:
for nam in a.dtype.names[:-1]:
col = a[nam]
a[nam] = (col - col.min()) / (col.max() - col.min())
Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?
There are a number of ways to do it, but some are cleaner than others.
Usually, in numpy, you keep the string data in a separate array.
(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)
Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas
provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).
However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']]
)
At any rate, one way is to do something like this:
import numpy as np
data = np.recfromcsv('iris.csv')
# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])
float_dat = data[float_fields]
# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))
# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)
However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.
Incidentally, using pandas
, you'd do something like this:
import pandas
data = pandas.read_csv('iris.csv', header=None)
float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)
data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)