How to store scaling parameters for later use

LetsPlayYahtzee picture LetsPlayYahtzee · Mar 11, 2016 · Viewed 8.6k times · Source

I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.

How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?

I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?

Answer

Ami Tavory picture Ami Tavory · Mar 11, 2016

I think that the best way is to pickle it post fit, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.

Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:

scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data. New in version 0.17: scale_ is recommended instead of deprecated std_. mean_ : array of floats with shape [n_features] The mean value for each feature in the training set.

The following short snippet illustrates this:

from sklearn import preprocessing
import numpy as np

s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))