How to predict the topic of a new query using a trained LDA model using gensim?

Question 1

How to predict the topic of a new query using a trained LDA model using gensim?

python nlp lda topic-modeling gensim

Animesh Pandey · Apr 28, 2013 · Viewed 13.3k times · Source

Answer

Answer

I have written a function in python that gives the possible topic for a new query:

def getTopicForQuery (question):
    temp = question.lower()
    for i in range(len(punctuation_string)):
        temp = temp.replace(punctuation_string[i], '')

    words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)

    important_words = []
    important_words = filter(lambda x: x not in stoplist, words)

    dictionary = corpora.Dictionary.load('questions.dict')

    ques_vec = []
    ques_vec = dictionary.doc2bow(important_words)

    topic_vec = []
    topic_vec = lda[ques_vec]

    word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
    for i in range(len(topic_vec)):
        word_count_array[i, 0] = topic_vec[i][0]
        word_count_array[i, 1] = topic_vec[i][1]

    idx = numpy.argsort(word_count_array[:, 1])
    idx = idx[::-1]
    word_count_array = word_count_array[idx]

    final = []
    final = lda.print_topic(word_count_array[0, 0], 1)

    question_topic = final.split('*') ## as format is like "probability * topic"

    return question_topic[1]

Before going through this do refer this link!

In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.

Then, the dictionary that was made by using our own database is loaded.

We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above.

The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1].

Question 2

I have trained a corpus for LDA topic modelling using gensim.

Going through the tutorial on the gensim website (this is not the whole code):

question = 'Changelog generation from Github issues?';

temp = question.lower()
for i in range(len(punctuation_string)):
    temp = temp.replace(punctuation_string[i], '')

words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]

This is the output that I get:

['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]

I don't know how the last output is going to help me find the possible topic for the question !!!

Please help!

How to predict the topic of a new query using a trained LDA model using gensim?

Answer

Related questions