I am fairly new to Weka and even more new to Weka on the command line. I find documentation is poor and I am struggling to figure out a few things to do. For example, want to take two .arff files, one for training, one for testing and get an output of predictions for the missing labels in the test data.
How can I do this?
I have this code as a starting block
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues -- -c last
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a
Running that code gives me "Illegal option -c last" and I am not sure why. I am also not going to be using MLP as NN tend to be too slow when I have a few thousand features from the text data. I know how to change it to another classifier though (like NB or libSVM so that is good).
But I am not sure how to add multiple filters in one call as I also need to add the StringToWordVector filter (and possibly the Reorder filter to make the class the last, instead of first attribute).
And then how do I get it actually output me the prediction labels of each class? And then store so those in an arff with the initial data.
Weka is not really the shining example of documentation, but you can still find valuable information about it on their sites. You should start with the Primer. I understand that you want to classify text files, so you should also have a look at Text categorization with WEKA. There is also a new Weka documentation site.
[Edit: Wikispaces has shut down and Weka hasn't brought up the sites somewhere else, yet, so I've modified the links to point at the Google cache. If someone reads this and a new Weka Wiki is up, feel free to edit the links and remove this note.]
The command line you posted in your question contains an error. I know, you copied it from my answer to another question, but I also just noticed it. You have to omit the -- -c last
, because the ReplaceMissingValue
filter doesn't like it.
In the Primer it says:
weka.filters.supervised
Classes below weka.filters.supervised in the class hierarchy are for supervised filtering, i.e. taking advantage of the class information. A class must be assigned via -c, for WEKA default behaviour use
-c last
.
but ReplaceMissingValue
is an unsupervised filter, as is StringToWordVector
.
Adding multiple filter is also no problem, that is what the MultiFilter
is for. The command line can get a bit messy, though: (I chose RandomForest
here, because it is a lot faster than NN).
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
-t ~/weka-3-7-9/data/ReutersCorn-train.arff \
-T ~/weka-3-7-9/data/ReutersCorn-test.arff \
-F "weka.filters.MultiFilter \
-F weka.filters.unsupervised.attribute.StringToWordVector \
-F weka.filters.unsupervised.attribute.Standardize" \
-W weka.classifiers.trees.RandomForest -- -I 100 \
Here is what the Primer says about getting the prediction:
However, if more detailed information about the classifier's predictions are necessary, -p # outputs just the predictions for each test instance, along with a range of one-based attribute ids (0 for none).
It is a good convention to put those general options like -p 0
directly after the class you're calling, so the command line would be
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
-t ~/weka-3-7-9/data/ReutersCorn-train.arff \
-T ~/weka-3-7-9/data/ReutersCorn-test.arff \
-p 0 \
-F "weka.filters.MultiFilter \
-F weka.filters.unsupervised.attribute.StringToWordVector \
-F weka.filters.unsupervised.attribute.Standardize" \
-W weka.classifiers.trees.RandomForest -- -I 100 \
But as you can see, WEKA can get very complicated when calling it from the command line. This is due to the tree structure of WEKA classifiers and filters. Though you can run only one classifier/filter per command line, it can be structured as complex as you like. For the above command, the structure looks like this:
The FilteredClassifier will initialize a filter on the training data set, filter both training and test data, then train a model on the training data and classify the given test data.
FilteredClassifier
|
+ Filter
|
+ Classifier
If we want multiple filters, we use the MultiFilter, which is only one filter, but it calls multiple others in the order they were given.
FilteredClassifier
|
+ MultiFilter
| |
| + StringToWordVector
| |
| + Standardize
|
+ RandomForest
The hard part of running something like this from the command line is assigning the desired options to the right classes, because often the option names are the same. For example, the -F
option is used for the FilteredClassifier
and the MultiFilter
as well, so I had to use quotes to make it clear which -F belongs to what filter.
In the last line, you see that the option -I 100
, which belongs to the RandomForest
, can't be appended directly, because then it would be assigned to FilteredClassifier
and you will get Illegal options: -I 100
. Hence, you have to add --
before it.
Adding the predicted class label is also possible, but even more complicated. AFAIK this can't be done in one step, but you have to train and save a model first, then use this one for predicting and assigning new class labels.
Training and saving the model:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier \
-t ~/weka-3-7-9/data/ReutersCorn-train.arff \
-d rf.model \
-F "weka.filters.MultiFilter \
-F weka.filters.unsupervised.attribute.StringToWordVector \
-F weka.filters.unsupervised.attribute.Standardize" \
-W weka.classifiers.trees.RandomForest -- -I 100 \
This will serialize the model of the trained FilteredClassifier
to the file rf.model
. The important thing here is that the initialized filter will also be serialized, otherwise the test set wouldn't be compatible after filtering.
Loading the model, making predictions and saving it:
java -classpath weka.jar weka.filters.supervised.attribute.AddClassification \
-serialized rf.model \
-classification \
-remove-old-class \
-i ~/weka-3-7-9/data/ReutersCorn-test.arff \
-o pred.arff \
-c last