How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

nlp text-mining lda gensim

Thomas N T · Sep 9, 2015 · Viewed 9.1k times · Source

LDA Original Output

Uni-grams
- topic1 -scuba,water,vapor,diving
- topic2 -dioxide,plants,green,carbon

Required Output

Bi-gram topics
- topic1 -scuba diving,water vapor
- topic2 -green plants,carbon dioxide

Any idea?

Answer

Given I have a dict called docs, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:

from nltk.util import ngrams

for doc in docs:
    docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]

Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

LDA Original Output

Required Output

Answer

Related questions