Document similarity: Vector embedding versus Tf-Idf performance?

Alec Matusis picture Alec Matusis · Mar 7, 2017 · Viewed 7.8k times · Source

I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:

  1. A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.

  2. Bag-of-Words: tf-idf or its variations such as BM25.

Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?

Is there another approach, that allows to dynamically refine the document's vectors as more text is added?

Answer

user8001497 picture user8001497 · May 12, 2017
  1. doc2vec or word2vec ?

According to article, the performance of doc2vec or paragraph2vec is poor for short-length documents. [Learning Semantic Similarity for Very Short Texts, 2015, IEEE]

  1. Short-length documents ...?

If you want to compare the similarity between short documents, you might want to vectorize the document via word2vec.

  1. how construct ?

For example, you can construct a document vector with a weighted average vector using tf-idf.

  1. similarity measure

In addition, I recommend using ts-ss rather than cosine or euclidean for similarity.

Please refer to the following article or the summary in github below. "A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"

https://github.com/taki0112/Vector_Similarity

thank you