Cross Validation in Weka

Titus Pullo picture Titus Pullo · May 3, 2012 · Viewed 58.1k times · Source

I've always thought from what I read that cross validation is performed like this:

In k-fold cross-validation, the original sample is randomly partitioned into k subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation

So k models are built and the final one is the average of those. In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?

Answer

Rushdi Shams picture Rushdi Shams · May 10, 2012

So, here is the scenario again: you have 100 labeled data

Use training set

  • weka will take 100 labeled data
  • it will apply an algorithm to build a classifier from these 100 data
  • it applies that classifier AGAIN on these 100 data
  • it provides you with the performance of the classifier (applied to the same 100 data from which it was developed)

Use 10 fold CV

  • Weka takes 100 labeled data

  • it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.

  • it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.

  • It does the same thing for set 2 to 10 and produces 9 more classifiers

  • it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets

Let me know if that answers your question.