I want a machine to learn to categorize short texts

atp picture atp · Apr 23, 2010 · Viewed 9k times · Source

I have a ton of short stories about 500 words long and I want to categorize them into one of, let's say, 20 categories:

  • Entertainment
  • Food
  • Music
  • etc

I can hand-classify a bunch of them, but I want to implement machine learning to guess the categories eventually. What's the best way to approach this? Is there a standard approach to machine learning I should be using? I don't think a decision tree would work well since it's text data...I'm completely new in this field.

Any help would be appreciated, thanks!

Answer

bayer picture bayer · Apr 23, 2010

A naive Bayes will most probably work for you. The method is like this:

  • Fix a number of categories and get a training data set of (document, category) pairs.
  • A data vector of your document will be sth like a bag of words. e.g. Take the 100 most common words except words like "the", "and" and such. Each word gets a fixed component of your data vector (e.g. "food" is position 5). A feature vector is then an array of booleans, each indicating whether that word came up in the corresponding document.

Training:

  • For your training set, calculate the probability of every feature and every class: p(C) = number documents of class C / total number of documents.
  • Calculate the probability of a feature in a class: p(F|C) = number of documents of class with given feature (= word "food" is in the text) / number of documents in given class.

Decision:

  • Given an unclassified document, the probability of it belonging to class C is proportional to P(C|F1, ..., F500) = P(C) * P(F1|C) * P(F2|C) * ... * P(F500|C). Pick the C that maximizes this term.
  • Since multiplication is numerically difficult, you can use the sum of the logs instead, which is maximized at the same C: log P(C|F1, ..., F500) = log P(C) + log P(F1|C) + log P(F2|C) + ... + log P(F500|C).