I have a large collection of texts, where each text is rapidly growing. I need to implement a similarity search.
The idea is to embed each word as word2vec, and represent each text as a normalized vector by vector-adding the embeddings of each word in it. The subsequent additions to the text would only result in the refinement of the resultant text's vector by adding new word vectors to it.
Is it possible to use elasticsearch for cosine similarity, by storing only the coordinates of each text's normalized vector in a document? If so, what's the proper index structure for such search?
This elasticsearch plugin implements a score function (dot product) for vectors stored using the delimited-payload-tokenfilter
The complexity of this search is a linear function of number of documents, and it is worse than tf-idf on a term query, since ES first searches on an inverted index then it uses tf-idf for document scores, so tf-idf is not executed on all the documents of the index. With the vector, the representation you're searching for is the vector space of the document with the lower cosine distance, without the advantages of the inverted index.