text classification methods? SVM and decision tree

zsh picture zsh · Jul 2, 2013 · Viewed 9.6k times · Source

i have a training set and i want to use a classification method for classifying other documents according to my training set.my document types are news and categories are sports,politics,economic and so on.

i understand naive bayes and KNN completely but SVM and decision tree are vague and i dont know if i can implement this method by myself?or there is applications for using this methods?

what is the best method i can use for classifying docs in this way?

thanks!

Answer

Freya Ren picture Freya Ren · Jul 2, 2013
  • Naive Bayes

Though this is the simplest algorithm and everything is deemed independent, in real text classification case, this method work great. And I would try this algorithm first for sure.

  • KNN

KNN is for clustering rather than classification. I think you misunderstand the conception of clustering and classification.

  • SVM

SVM has SVC(classification) and SVR(Regression) algorithms to do class classification and prediction. It sometime works good, but from my experiences, it has bad performance in text classification, as it has high demands for good tokenizers (filters). But the dictionary of the dataset always has dirty tokens. The accuracy is really bad.

  • Random Forest (decision tree)

I've never try this method for text classification. Because I think decision tree need several key nodes, while it's hard to find "several key tokens" for text classification, and random forest works bad for high sparse dimensions.

FYI

These are all from my experiences, but for your case, you have no better ways to decide which methods to use but to try every algorithm to fit your model.

Apache's Mahout is a great tool for machine learning algorithms. It integrates three aspects' algorithms: recommendation, clustering, and classification. You could try this library. But you have to learn some basic knowledge about Hadoop.

And for machine learning, weka is a software toolkit for experiences which integrates many algorithms.