I'm using SMOTE filter in WEKA to balance data.
I have doubts about the two parameters nearestNeighbors
and percentage
.
nearestNeighbors -- The number of nearest neighbors to use.
percentage -- The percentage of SMOTE instances to create.
How should I set them?
I thought the number of neighbors is the amount of syntetic samples it is going to create.
So what's the meaning of percentage? It should be less than or equal to the number of neighbors, right? Is the percentage of syntetic samples considered?
For example:
If I put 10 neighbors and 200% what will happen?
Can anyone give me some examples of correct use?
The nearestNeighbors
parameter says how many nearest neighbor instances (surrounding the currently considered instance) are used to build an inbetween synthetic instance. The default value is 5. Thus the attributes of 5 nearest neighbors of a real existing instance are used to compute a new synthetic one.
The percentage
parameter says how many synthetic instances are created based on the number of the class with less instances (by default - you can also use the majority class by setting the -C
option). The default value is 100. This means if you have 25 instances in your minority class, again 25 instances are created synthetically from these (using their nearest neighbours' values). With 200% 50 synthetic instances are created and so on.
For further information also refer to the weka doc of SMOTE and the original paper of Chawla et al. 2002, where the whole method is explained in depth.
For me it appeared that the Weka SMOTE alone only oversamples the instances. So additionally you can use the supervised SpreadSubsample filter to undersample the minority class instances afterwards.