Prepare data for text classification using Scikit Learn SVM

user1906856 picture user1906856 · Dec 18, 2012 · Viewed 31k times · Source

I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?

Labels    features    frequency
  A        'book'        54
  B       'movies'       32

Any help is appreciated.

Answer

ogrisel picture ogrisel · Dec 18, 2012

Have a look at the documentation on text feature extraction.

Also have a look at the text classification example.

There is also a tutorial here:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.