How to cluster similar sentences using BERT

somethingstrang picture somethingstrang · Apr 10, 2019 · Viewed 12.9k times · Source

For ElMo, FastText and Word2Vec, I'm averaging the word embeddings within a sentence and using HDBSCAN/KMeans clustering to group similar sentences.

A good example of the implementation can be seen in this short article: http://ai.intelligentonlinetools.com/ml/text-clustering-word-embedding-machine-learning/

I would like to do the same thing using BERT (using the BERT python package from hugging face), however I am rather unfamiliar with how to extract the raw word/sentence vectors in order to input them into a clustering algorithm. I know that BERT can output sentence representations - so how would I actually extract the raw vectors from a sentence?

Any information would be helpful.

Answer

Subham Kumar picture Subham Kumar · Jul 12, 2020

You can use Sentence Transformers to generate the sentence embeddings. These embeddings are much more meaningful as compared to the one obtained from bert-as-service, as they have been fine-tuned such that semantically similar sentences have higher similarity score. You can use FAISS based clustering algorithm if number of sentences to be clustered are in millions or more as vanilla K-means like clustering algorithm takes quadratic time.