Java Weka: How to specify split percentage?

rishi picture rishi · Feb 4, 2013 · Viewed 13.2k times · Source

I have written the code to create the model and save it. It works fine. My understanding is data, by default, is split in 10 folds. I want data to be split into two sets (training and testing) when I create the model. On Weka UI, I can do it by using "Percentage split" radio button. I want to know how to do it through code. I want it to be split in two parts 80% being the training and 20% being the testing. Here is my code.

        FilteredClassifier model = new FilteredClassifier();
        model.setFilter(new StringToWordVector());
        model.setClassifier(new NaiveBayesMultinomial());
        try {
            model.buildClassifier(trainingSet);
        } catch (Exception e1) { // TODO Auto-generated catch block
            e1.printStackTrace();
        }

        ObjectOutputStream oos = new ObjectOutputStream(
                new FileOutputStream(
                        "/Users/me/models/MyModel.model"));
        oos.writeObject(model);
        oos.flush();
        oos.close();

trainingSet here is already populated Instances object. Can someone help me with this?

Thanks in advance!

Answer

Jan Eglinger picture Jan Eglinger · Feb 5, 2013

In the UI class ClassifierPanel's method startClassifier(), I found the following code:

// Percent split

int trainSize = (int) Math.round(inst.numInstances() * percent
    / 100);
int testSize = inst.numInstances() - trainSize;
Instances train = new Instances(inst, 0, trainSize);
Instances test = new Instances(inst, trainSize, testSize);

so after randomizing your dataset...

trainingSet.randomize(new java.util.Random(0));

... I suggest you split your trainingSetin the same way:

int trainSize = (int) Math.round(trainingSet.numInstances() * 0.8);
int testSize = trainingSet.numInstances() - trainSize;
Instances train = new Instances(trainingSet, 0, trainSize);
Instances test = new Instances(trainingSet, trainSize, testSize);

then use Classifier#buildClassifier(Instances data) to train the classifier with 80% of your set instances:

model.buildClassifier(train);

UPDATE: thanks to @ChengkunWu's answer, I added the randomizing step above.