How to predict the topic of a new query using a trained LDA model using gensim?

Animesh Pandey picture Animesh Pandey · Apr 28, 2013 · Viewed 13.3k times · Source

I have trained a corpus for LDA topic modelling using gensim.

Going through the tutorial on the gensim website (this is not the whole code):

question = 'Changelog generation from Github issues?';

temp = question.lower()
for i in range(len(punctuation_string)):
    temp = temp.replace(punctuation_string[i], '')

words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]

This is the output that I get:

['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]

I don't know how the last output is going to help me find the possible topic for the question !!!

Please help!

Answer

Animesh Pandey picture Animesh Pandey · Apr 30, 2013

I have written a function in python that gives the possible topic for a new query:

def getTopicForQuery (question):
    temp = question.lower()
    for i in range(len(punctuation_string)):
        temp = temp.replace(punctuation_string[i], '')

    words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)

    important_words = []
    important_words = filter(lambda x: x not in stoplist, words)

    dictionary = corpora.Dictionary.load('questions.dict')

    ques_vec = []
    ques_vec = dictionary.doc2bow(important_words)

    topic_vec = []
    topic_vec = lda[ques_vec]

    word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
    for i in range(len(topic_vec)):
        word_count_array[i, 0] = topic_vec[i][0]
        word_count_array[i, 1] = topic_vec[i][1]

    idx = numpy.argsort(word_count_array[:, 1])
    idx = idx[::-1]
    word_count_array = word_count_array[idx]

    final = []
    final = lda.print_topic(word_count_array[0, 0], 1)

    question_topic = final.split('*') ## as format is like "probability * topic"

    return question_topic[1]

Before going through this do refer this link!

In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.

Then, the dictionary that was made by using our own database is loaded.

We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above.

The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1].