I have over 15000 text docs of a specific topic. I would like to build a language model based on the former so that I can present to this model new random text documents of various topics and the algorithms tells if the new doc is of the same topic.
I tried out sklearn.naive_bayes.MultinomialNB
, sklearn.svm.classes.LinearSVC
and others, however I have the following problem:
These algorithms require training data with more than one label or category and I only have web pages of covering a specific topic. The other docs are not labeled and of many different topics.
I would appreciate any guidance on how to train a model with only one label or how to proceed in general. What I have so far is:
c = MultinomialNB()
c.fit(X_train, y_train)
c.predict(X_test)
Thank you very much.
What you're looking for is the OneClassSvm. For more information you might want to check out the corresponding documentation at this link.