While libsvm provides tools for scaling data, with Scikit-Learn (which should be based upon libSVM for the SVC classifier) I find no way to scale my data.
Basically I want to use 4 features, of which 3 range from 0 to 1 and the last one is a "big" highly variable number.
If I include the fourth feature in libSVM (using the easy.py script which scales my data automatically) I get some very nice results (96% accuracy). If I include the fourth variable in Scikit-Learn the accuracy drops to ~78% - but if I exclude it, I get the same results I get in libSVM when excluding that feature. Therefore I am pretty sure it's a problem of missing scaling.
How do I replicate programmatically (i.e. without calling svm-scale) the scaling process of SVM?
You have that functionality in sklearn.preprocessing
:
>>> from sklearn import preprocessing
>>> X = [[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]]
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
The data will then have zero mean and unit variance.