SVM Classification - minimum number of input sets for each class

machine-learning classification svm training-data

Amol Joshi · Feb 17, 2010 · Viewed 9.2k times · Source

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.

From the help that I got on this Stackoverflow question, I thought SVM is the best approach to my aim.

So, I have coded SVM and an SMO myself. The dataset which I have got from UCI data repository has 3280 instances ( Link to Dataset ) where around 400 of them are from class representing Advertisement images and rest of them representing non-advertisement images.

Right now I'm taking the first 2800 input sets and training the SVM. But after looking at the accuracy rate I realised that most of those 2800 input sets are from non-advertisement image class. So I`m getting very good accuracy for that class.

So what can I do here? About how many input set shall I give to SVM to train and how many of them for each class?

Thanks. Cheers. ( Basically made a new question because the context was different from my previous question. Optimization of Neural Network input data )

Thanks for the reply. I want to check whether I`m deriving the C values for ad and non-ad class correctly or not. Please give me feedback on this.

enter image description here

Or you u can see the doc version here.

You can see graph of y1 eqaul to y2 here enter image description here

and y1 not equal to y2 here enter image description here

Answer

There are two ways of going about this. One would be to balance the training data so it includes an equal number of advertisement and non-advertisement images. This could be done by either oversampling the 400 advertisement images or undersampling the thousands of non-advertisement images. Since training time can increase dramatically with the number of data points used, you should probably first try undersampling the non-advertisement images and create a training set with the 400 ad images and 400 randomly selected non-advertisements.

The other solution would be to use a weighted SVM so that margin errors for the ad images are weighted more heavily than those for non-ads, for the package libSVM this is done with the -wi flag. From your description of the data, you could try weighing the ad images about 7 times more heavily than the non-ads.

SVM Classification - minimum number of input sets for each class

Answer

Related questions