Grid Search parameter and cross-validated data set in KNN classifier in Scikit-learn

browser picture browser · Nov 16, 2016 · Viewed 10.1k times · Source

I'm trying to perform my first KNN Classifier using SciKit-Learn. I've been following the User Guide and other online examples but there are a few things I am unsure about. For this post lets use the following

X = data Y = target

1) In most introduction to machine learning pages that I've read it seems to say you want a training set, a validation set, and a test set. From what I understand, cross validation allows you to combine the training and validations sets to train the model, and then you should test it on the test set to get a score. However, I have seen in papers that in a lot of cases you can just cross validate on the entire data set and then report the CV score as the accuracy. I understand in an ideal world you would want to test on separate data but if this is legitimate I would like to cross-validate on my entire dataset and report those scores

2) So starting the process

I define my KNN Classifier as follows

knn = KNeighborsClassifier(algorithm = 'brute')

I search for best n_neighbors using

clf = GridSearchCV(knn, parameters, cv=5)

Now if I say

clf.fit(X,Y)

I can check the best parameter using

clf.best_params_

and then I can get a score

clf.score(X,Y)

But - as I understand it, this hasn't cross validated the model, as it only gives 1 score?

If I have seen clf.best_params_ = 14 now could I go on

knn2 = KNeighborsClassifier(n_neighbors = 14, algorithm='brute')
cross_val_score(knn2, X, Y, cv=5)

Now I know the data has been cross validated but I don't know if it is legitimate to use clf.fit to find the best parameter and then use cross_val_score with a new knn model?

3) I understand that the 'proper' way to do it would be as follows

Split to X_train, X_test, Y_train, Y_test, Scale train sets -> apply transform to test sets

knn = KNeighborsClassifier(algorithm = 'brute')
clf = GridSearchCV(knn, parameters, cv=5)
clf.fit(X_train,Y_train)
clf.best_params_

and then I can get a score

clf.score(X_test,Y_test)

In this case, is the score calculated using the best parameter?


I hope that this makes sense. I've been trying to find as much as I can without posting but I have come to the point where I think it would be easier to get some direct answers.

In my head I am trying to get some cross-validated scores using the whole dataset but also use a gridsearch (or something similar) to fine tune the parameters.

Thanks in advance

Answer

nitheism picture nitheism · Nov 17, 2016
  1. Yes you can CV on your entire dataset it is viable, but I still suggest you to at least split your data into 2 sets one for CV and one for testing.

  2. The .score function is supposed to return a single float value according to the documentation which is the score of the best estimator(which is the best scored estimator you get from fitting your GridSearchCV) on the given X,Y

  3. If you saw that the best parameter is 14 than yes you can go on whith using it in your model, but if you gave it more parameters you should set all of them. (- I say that because you haven't given your parameters list) And yes it is legitimate to check your CV once again just in case if this model is as good as it should.

Hope that makes the things clearer :)