I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.
If you use sklearn.cross_validation.cross_val_score
with a sklearn.linear_model.LassoCV
object, then you are performing nested cross-validation. cross_val_score
will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold
). The train set will be passed to the LassoCV
, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV
will not do all the work for you, and you have to use it in conjunction with cross_val_score
to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV
without necessarily evaluating it directly on another set of held out data.