Sequential feature selection Matlab

Mohamad Ibrahim picture Mohamad Ibrahim · Nov 27, 2011 · Viewed 13.3k times · Source

Can somebody explain how to use this function in Matlab "sequentialfs"

it looks straight forward but I do not know how can we design a function handler for it?!

any clue?!

Answer

Sam Roberts picture Sam Roberts · Nov 28, 2011

Here's a simpler example than the one in the documentation.

First let's create a very simple dataset. We have some class labels y. 500 are from class 0, and 500 are from class 1, and they are randomly ordered.

>> y = [zeros(500,1); ones(500,1)];
>> y = y(randperm(1000));

And we have 100 variables x that we want to use to predict y. 99 of them are just random noise, but one of them is highly correlated with the class label.

>> x = rand(1000,99);
>> x(:,100) = y + rand(1000,1)*0.1;

Now let's say we want to classify the points using linear discriminant analysis. If we were to do this directly without applying any feature selection, we would first split the data up into a training set and a test set:

>> xtrain = x(1:700, :); xtest = x(701:end, :);
>> ytrain = y(1:700); ytest = y(701:end);

Then we would classify them:

>> ypred = classify(xtest, xtrain, ytrain);

And finally we would measure the error rate of the prediction:

>> sum(ytest ~= ypred)
ans =
     0

and in this case we get perfect classification.

To make a function handle to be used with sequentialfs, just put these pieces together:

>> f = @(xtrain, ytrain, xtest, ytest) sum(ytest ~= classify(xtest, xtrain, ytrain));

And pass all of them together into sequentialfs:

>> fs = sequentialfs(f,x,y)
fs =
  Columns 1 through 16
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  Columns 17 through 32
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  Columns 33 through 48
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  Columns 49 through 64
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  Columns 65 through 80
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  Columns 81 through 96
     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  Columns 97 through 100
     0     0     0     1

The final 1 in the output indicates that variable 100 is, as expected, the best predictor of y among the variables in x.

The example in the documentation for sequentialfs is a little more complex, mostly because the predicted class labels are strings rather than numerical values as above, so ~strcmp is used to calculate the error rate rather than ~=. In addition it makes use of cross-validation to estimate the error rate, rather than direct evaluation as above.