Document topical distribution in Gensim LDA

Moses Xu picture Moses Xu · Jun 26, 2013 · Viewed 10.7k times · Source

I've derived a LDA topic model using a toy corpus as follows:

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

I found that when I use a small number of topics to derive the model, Gensim yields a full report of topical distribution over all potential topics for a test document. E.g.:

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

However when I use a large number of topics, the report is no longer complete:

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

It seems to me that topics with a probability less than some threshold (I observed 0.01 to be more specific) are omitted form the output.

I'm wondering if this behaviour is due to some aesthetic considerations? And how can I get the distribution of the probability mass residual over all other topics?

Thank you for your kind answer!

Answer

Moses Xu picture Moses Xu · Oct 23, 2013

Read the source and it turns out that topics with probabilities smaller than a threshold are ignored. This threshold is with a default value of 0.01.