Support vector machine in Python using libsvm example of features

Usi Usi picture Usi Usi · Jun 23, 2015 · Viewed 8.1k times · Source

I have scraped a lot of ebay titles like this one:

Apple iPhone 5 White 16GB Dual-Core

and I have manually tagged all of them in this way

B M C S NA

where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)

Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.

I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.

I want to considering this features:

* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized 
....

I can't understand how can I give all this info to the library. The official doc lacks a lot of information

My class are Brand, Model, Size, Color, NA

what does the input file of the SVM algo must contain?

how can I create it? could I have an example of that file considering the 4 features that I put as example in my question? Can I also have an example of the code that I must use to elaborate the input file ?

* UPDATE * I want to represent these features... How can I must do?

  1. Identity of the current word

I think that I can interpret it in this way

0 --> Brand
1 --> Model
2 --> Color
3 --> Size 
4 --> NA

If I know that the word is a Brand I will set that variable to 1 (true). It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).

  1. N-gram substring features of current word (N=4,5,6) No Idea, what does it means?

  2. Identity of 2 words before the current word. How can I model this feature?

Considering the legend that I create for the 1st feature I have 5^(5) combination)

00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

How can I convert it to a format that the libsvm (or scikit-learn) can understand?

  1. Membership to the 4 dictionaries of attributes

Again how can I do it? Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.

  1. Exclusive membership to dictionary of brand names

I think that like in the 4. feature I must use a bool variable. Do you agree?

Answer

Julia Schwarz picture Julia Schwarz · Jun 28, 2015

Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)

Step 0: Install dependencies

You need to install the following libraries:

  • pandas
  • scikit-learn

From command line:

pip install pandas
pip install scikit-learn

Step 1: Load the data

We will use pandas to load our data. pandas is a library for easily loading data. For illustration, we first save sample data to a csv and then load it.

We will train the SVM with train.csv and get test labels with test.csv

import pandas as pd

train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""


with open('train.csv', 'w') as output:
    output.write(train_data_contents)

train_dataframe = pd.read_csv('train.csv')

Step 2: Process the data

We will convert our dataframe into numpy arrays which is a format that scikit- learn understands.

We need to convert the labels "B", "M", "C",... to numbers also because svm does not understand strings.

Then we will train a linear svm with the data

import numpy as np

train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)

print "train labels: "
print train_labels
print 
print "train features:"
print train_features

We see here that the length of train_labels (5) exactly matches how many rows we have in trainfeatures. Each item in train_labels corresponds to a row.

Step 3: Train the SVM

from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)

Step 4: Evaluate the SVM on some testing data

test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""

with open('test.csv', 'w') as output:
    output.write(test_data_contents)

test_dataframe = pd.read_csv('test.csv')

test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])

test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)

results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"

Links & Tips

You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.

Note that since you're evaluating using the data you trained on the accuracy will be unusually high.