I'm trying to perform my first KNN Classifier using SciKit-Learn. I've been following the User Guide and other online examples but there are a few things I am unsure about. For this post lets use the following
X = data Y = target
1) In most introduction to machine learning pages that I've read it seems to say you want a training set, a validation set, and a test set. From what I understand, cross validation allows you to combine the training and validations sets to train the model, and then you should test it on the test set to get a score. However, I have seen in papers that in a lot of cases you can just cross validate on the entire data set and then report the CV score as the accuracy. I understand in an ideal world you would want to test on separate data but if this is legitimate I would like to cross-validate on my entire dataset and report those scores
2) So starting the process
I define my KNN Classifier as follows
knn = KNeighborsClassifier(algorithm = 'brute')
I search for best n_neighbors using
clf = GridSearchCV(knn, parameters, cv=5)
Now if I say
clf.fit(X,Y)
I can check the best parameter using
clf.best_params_
and then I can get a score
clf.score(X,Y)
But - as I understand it, this hasn't cross validated the model, as it only gives 1 score?
If I have seen clf.best_params_ = 14 now could I go on
knn2 = KNeighborsClassifier(n_neighbors = 14, algorithm='brute')
cross_val_score(knn2, X, Y, cv=5)
Now I know the data has been cross validated but I don't know if it is legitimate to use clf.fit to find the best parameter and then use cross_val_score with a new knn model?
3) I understand that the 'proper' way to do it would be as follows
Split to X_train, X_test, Y_train, Y_test, Scale train sets -> apply transform to test sets
knn = KNeighborsClassifier(algorithm = 'brute')
clf = GridSearchCV(knn, parameters, cv=5)
clf.fit(X_train,Y_train)
clf.best_params_
and then I can get a score
clf.score(X_test,Y_test)
In this case, is the score calculated using the best parameter?
I hope that this makes sense. I've been trying to find as much as I can without posting but I have come to the point where I think it would be easier to get some direct answers.
In my head I am trying to get some cross-validated scores using the whole dataset but also use a gridsearch (or something similar) to fine tune the parameters.
Thanks in advance
Yes you can CV on your entire dataset it is viable, but I still suggest you to at least split your data into 2 sets one for CV and one for testing.
The .score
function is supposed to return a single float
value according to the documentation which is the score of the best estimator
(which is the best scored estimator you get from fitting your GridSearchCV
) on the given X,Y
Hope that makes the things clearer :)