Is it possible to use Google BERT to calculate similarity between two textual documents?

goodboy picture goodboy · Sep 11, 2019 · Viewed 13.5k times · Source

Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:

https://github.com/AndriyMulyar/semantic-text-similarity

https://github.com/beekbin/bert-cosine-sim

Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?

Answer

birdmw picture birdmw · Feb 4, 2020

BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.

BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.

This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.

BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at

See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.

Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:

  1. Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.

  2. In a separate step, take each summary and do the STS-B task from Page 3 GLUE

Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!