Bag of words training and testing opencv, matlab

Mario picture Mario · Jul 22, 2012 · Viewed 14.7k times · Source

I'm implementing Bag Of Words in opencv by using SIFT features in order to make a classification for a specific dataset. So far, I have been apple to cluster the descriptors and generate the vocabulary. As I know, I have to train SVM ... but i have some questions which i'm really confused about them. The major problem is the concept behind the implementations, these are my questions:

1- When I extract the features and then create the vocabulary, shall I extract the features for all the objects (let's say 5 objects)and put them in one file, so I make all of them in a one vocabulary file that has all the words? and how I will separate them later on when I do the classification?

2- How to implement the SVM? I know the functions that are used in openCV but how?

3- I can do the work in MATLAB, which I mean the implementation of the SVM training, but is there any code available that can guide me through my work? I have seen the code used by Andrea Vedaldi, here but he is working only with one class each time and another issue that he is not showing how to create the .mat file that he's using in his exercises. All other implementations that I could find, they are not using SVM. So, can you guide in this point too!

Thank you

Answer

Alceu Costa picture Alceu Costa · Jul 24, 2012

Local features

When you work with SIFT, you usually want to extract local features. What does that means? You have your image and from this image you will locate points from which you will extract local feature vectors. A local feature vector is just a vector consisting of numerical values that describes the visual information of the image region from which it was extracted. Although the number of local feature vectors that you can extract from image A does not need to be the same as the number of feature vectors that you can extract from image B, the number components of a local feature vector (i.e. its dimensionality) is always the same.

Now, if you want to use your local feature vectors to classify images you have a problem. In traditional image classification, each image is described by a global feature vector, which, in the context of machine learning, can be seen as a set of numerical attributes. However, when you extract a set of local feature vectors you don't have a global representation of each image which is required for image classification. A technique that can be employed to solve this problem is the bag of words, also known as bag of visual words (BoW).

Bag of visual words

Here's the (very) simplified BoW algorithm:

  1. Extract the SIFT local feature vectors from your set of images;

  2. Put all this local feature vectors into a single set. At this point you don't even need to store from which image each local feature vector was extracted;

  3. Apply a clustering algorithm (e.g. k-means) over the set of local feature vectors in order to find centroid coordinates and assign an id to each centroid. This set of centroids will be your vocabulary;

  4. The global feature vector will be a histogram that counts how many times each centroid occurred in each image. To compute the histogram find the nearest centroid for each local feature vector.

Image Classification

Here I am assuming that your problem is the following:

You have as input a set of labeled images and a set of non-labeled images which you want to assign a label based on its visual appearance. Suppose your problem is to classify landscape photography. You image labels could be, for example, “mountains”, “beach” or “forest”.

The global feature vector extracted from each image (i.e. its bag of visual words) can be seen as a set of numerical attributes. This set of numerical attributes representing the visual characteristics of each image and the corresponding image labels can be used to train classifier. For example, you could use a data mining software such as Weka, which has an implementation of SVM, known as SMO, to solve your problem.

Basically, you only have to format the global feature vectors and corresponding image labels according to the ARFF file format, which is, basically, a CSV of global feature vectors followed by image label.