I have trained a corpus for LDA topic modelling using gensim.
Going through the tutorial on the gensim website (this is not the whole code):
question = 'Changelog generation from Github issues?';
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]
This is the output that I get:
['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]
I don't know how the last output is going to help me find the possible topic for the question
!!!
Please help!
I have written a function in python that gives the possible topic for a new query:
def getTopicForQuery (question):
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
topic_vec = []
topic_vec = lda[ques_vec]
word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
for i in range(len(topic_vec)):
word_count_array[i, 0] = topic_vec[i][0]
word_count_array[i, 1] = topic_vec[i][1]
idx = numpy.argsort(word_count_array[:, 1])
idx = idx[::-1]
word_count_array = word_count_array[idx]
final = []
final = lda.print_topic(word_count_array[0, 0], 1)
question_topic = final.split('*') ## as format is like "probability * topic"
return question_topic[1]
Before going through this do refer this link!
In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.
Then, the dictionary that was made by using our own database is loaded.
We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec]
where lda
is the trained model as explained in the link referred above.
The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1]
.