I have a matrix where each column has mean 0 and std 1
In [67]: x_val.std(axis=0).min()
Out[70]: 0.99999999999999922
In [71]: x_val.std(axis=0).max()
Out[71]: 1.0000000000000007
In [72]: x_val.mean(axis=0).max()
Out[72]: 1.1990408665951691e-16
In [73]: x_val.mean(axis=0).min()
Out[73]: -9.7144514654701197e-17
The number of non 0 coefficients changes if I use the normalize option
In [74]: l = Lasso(alpha=alpha_perc70).fit(x_val, y_val)
In [81]: sum(l.coef_!=0)
Out[83]: 47
In [84]: l2 = Lasso(alpha=alpha_perc70, normalize=True).fit(x_val, y_val)
In [93]: sum(l2.coef_!=0)
Out[95]: 3
It seems to me that normalize just set the variance of each columns to 1. This is strange that the results change so much. My data has already variance=1.
So what does normalize=T actually do?
This is due to an (or a potential [1]) inconsistency in the concept of scaling in sklearn.linear_model.base.center_data
: If normalize=True
, then it will divide by the norm of each column of the design matrix, not by the standard deviation . For what it's worth, the keyword normalize=True
will be deprecated from sklearn version 0.17.
Solution: Do not use standardize=True
. Instead, build a sklearn.pipeline.Pipeline
and prepend a sklearn.preprocessing.StandardScaler
to your Lasso
object. That way you don't even need to perform your initial scaling.
Note that the data loss term in the sklearn implementation of Lasso is scaled by n_samples
. Thus the minimal penalty yielding a zero solution is alpha_max = np.abs(X.T.dot(y)).max() / n_samples
(for normalize=False
).
[1] I say potential inconsistency, because normalize is associated to the word norm and thus at least linguistically consistent :)
[Stop reading here if you don't want the details]
Here is some copy and pasteable code reproducing the problem
import numpy as np
rng = np.random.RandomState(42)
n_samples, n_features, n_active_vars = 20, 10, 5
X = rng.randn(n_samples, n_features)
X = ((X - X.mean(0)) / X.std(0))
beta = rng.randn(n_features)
beta[rng.permutation(n_features)[:n_active_vars]] = 0.
y = X.dot(beta)
print X.std(0)
print X.mean(0)
from sklearn.linear_model import Lasso
lasso1 = Lasso(alpha=.1)
print lasso1.fit(X, y).coef_
lasso2 = Lasso(alpha=.1, normalize=True)
print lasso2.fit(X, y).coef_
In order to understand what is going on, now observe that
lasso1.fit(X / np.sqrt(n_samples), y).coef_ / np.sqrt(n_samples)
is equal to
lasso2.fit(X, y).coef_
Hence, scaling the design matrix and appropriately rescaling the coefficients by np.sqrt(n_samples)
converts one model to the other. This can also be achieved by acting on the penalty: A lasso estimator with normalize=True
with its penalty scaled down by np.sqrt(n_samples)
acts like a lasso estimator with normalize=False
(on your type of data, i.e. already standardized to std=1
).
lasso3 = Lasso(alpha=.1 / np.sqrt(n_samples), normalize=True)
print lasso3.fit(X, y).coef_ # yields the same coefficients as lasso1.fit(X, y).coef_