LDA topic modeling - Training and testing

tan picture tan · Jun 22, 2012 · Viewed 22k times · Source

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.

References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.

What I don't understand is, if the above is true, then why do many topic modeling tutorials talk about separating the dataset into training and test set?

Can anyone explain me the steps (the basic concept) of how LDA can be used for training a model, which can then be used to analyze another test dataset?

Answer

gregamis picture gregamis · Jun 26, 2012

Splitting the data into training and testing sets is a common step in evaluating the performance of a learning algorithm. It's more clear-cut for supervised learning, wherein you train the model on the training set, then see how well its classifications on the test set match the true class labels. For unsupervised learning, such evaluation is a little trickier. In the case of topic modeling, a common measure of performance is perplexity. You train the model (like LDA) on the training set, and then you see how "perplexed" the model is on the testing set. More specifically, you measure how well the word counts of the test documents are represented by the word distributions represented by the topics.

Perplexity is good for relative comparisons between models or parameter settings, but it's numeric value doesn't really mean much. I prefer to evaluate topic models using the following, somewhat manual, evaluation process:

  1. Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive "topic" or just some random group of words?
  2. Inspect the topic assignments: Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually talking about?

I realize that this process isn't as nice and quantitative as one might like, but to be honest, the applications of topic models are rarely quantitative either. I suggest evaluating your topic model according to the problem you're applying it to.

Good luck!