What's the disadvantage of LDA for short texts?

Shuguang Zhu picture Shuguang Zhu · Apr 22, 2015 · Viewed 11.8k times · Source

I am trying to understand why Latent Dirichlet Allocation(LDA) performs poorly in short text environments like Twitter. I've read the paper 'A biterm topic model for short text', however, I still do not understand "the sparsity of word co-occurrences".

From my point of view, the generation part of LDA is reasonable for any kind of texts, but what causes bad results in short texts is the sampling procedure. I am guessing LDA samples a topic for a word based on two parts: (1) topics of other words in the same doc (2) topic assignments of other occurrences of this word. Since the (1) part of a short text cannot reflect the true distribution of it, that causes a poor topic assignment for each word.

If you have found this question, please feel free to post your idea and help me understand this.

Answer

Xiaohui Yan picture Xiaohui Yan · Apr 28, 2015

Probabilistic models such as LDA exploit statistical inference to discover latent patterns of data. In short, they infer model parameters from observations. For instance, there is a black box containing many balls with different colors. You draw some balls out from the box and then infer the distributions of colors of the balls. That is a typical process of statistical inference. The accuracy of statistical inference depends on the number of your observations.

Now consider the problem of LDA over short texts. LDA models a document as a mixture of topics, and then each word is drawn from one of its topic. You can imagine a black box contains tons of words generated from such a model. Now you have seen a short document with only a few of words. The observations is obvious too few to infer the parameters. It is the data sparsity problem we mentioned.

Actually, besides the the lack of observations, the problem also comes from the over-complexity of the model. Usually, a more flexible model requires more observations to infer. The Biterm Topic Model tries to making topic inference easier by reducing the model complexity. First, it models the whole corpus as a mixture of topics. Since inferring the topic mixture over the corpus is easier than inferring the topic mixture over a short document. Second, it supposes each biterm is draw from a topic. Inferring the topic of a biterm is also easier than inferring the topic of a single word in LDA, since more context is added.

I hope the explanation make sense for you. Thanks for mentioning our paper.