I have a set of Books objects, classs Book is defined as following :
Class Book{
String title;
ArrayList<tags> taglist;
}
Where title is the title of the book, example : Javascript for dummies.
and taglist is a list of tags for our example : Javascript, jquery, "web dev", ..
As I said a have a set of books talking about different things : IT, BIOLOGY, HISTORY, ... Each book has a title and a set of tags describing it..
I have to classify automaticaly those books into separated sets by topic, example :
IT BOOKS :
HISTORY BOOKS :
BIOLOGY BOOKS :
Do you guys know a classification algorithm/method to apply for that kind of problems ?
A solution is to use an external API to define the category of the text, but the problem here is that books are in different languages : french, spanish, english ..
This looks like a reasonably straightforward keyword-based classification task. Since you're using Java, good packages to consider for this would be Classifier4J, Weka, or Lucene Mahout.
Classifier4J
Classifier4J supports classification using naive Bayes and a vector space model.
As seen in this source code snippet on training and scoring using its naive Bayes classifier, the package is reasonably easy to use. It's also distributed under the liberal Apache Software License.
Weka
Weka is a very popular tool for data mining. An advantage of using it is that you'd be able to readily experiment with using numerous different machine learning models to categorize the books into topics including naive Bayes, decision trees, support vector machines, k-nearest neighbor, logistic regression, and even a rule set based learner.
You'll find a tutorial on using Weka for text categorization here.
Weka is, however, distributed under the GPL. You won't be able to use it for closed source software that you want to distribute. But, you could still use it to back a web service.
Lucene Mahout
Mahout is designed for doing machine learning on very large datasets. It's built on top of Apache Hadoop and supports supervised classification using naive Bayes.
You'll find a tutorial covering how to use Mahout for text classification here.
Like Classifier4J, Mahout is distributed under the liberal Apache Software License.