Using Word2Vec for topic modeling

nlp topic-modeling word2vec

user1814735 · Oct 6, 2015 · Viewed 20.7k times · Source

I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA).

However, I am interested whether it is a good idea to try out topic modeling with Word2Vec as it clusters words in vector space. Couldn't the clusters therefore be regarded as topics?

Do you think it makes sense to follow this approach for the sake of some research? In the end what I am interested in is to extract keywords from text according to topics.

Answer

You might want to look at the following papers:

Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313. [CODE]

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun. 2015. Topical Word Embeddings. In proceedings of 29th AAAI Conference on Artificial Intelligence, 2418-2424. [CODE]

The first paper integrates word embeddings into the LDA model and the one-topic-per-document DMM model. It reports significant improvements on topic coherence, document clustering and document classification tasks, especially on small corpora or short texts (e.g Tweets).

The second paper is also interesting. It uses LDA to assign topic for each word, and then employs Word2Vec to learn word embeddings based on both words and their topics.

Using Word2Vec for topic modeling

Answer

Related questions