Spacy, Strange similarity between two sentences

Mr.D picture Mr.D · Aug 31, 2018 · Viewed 12.5k times · Source

I have downloaded en_core_web_lg model and trying to find similarity between two sentences:

nlp = spacy.load('en_core_web_lg')

search_doc = nlp("This was very strange argument between american and british person")

main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

print(main_doc.similarity(search_doc))

Which returns very strange value:

0.9066019751888448

These two sentences should not be 90% similar they have very different meanings.

Why this is happening? Do I need to add some kind of additional vocabulary in order to make similarity result more reasonable?

Answer

Johannes Filter picture Johannes Filter · Jan 6, 2019

Spacy constructs sentence embedding by averaging the word embeddings. Since, in an ordinary sentence, there are a lot of meaningless words (called stop words), you get poor results. You can remove them like this:

search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")

search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))

print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))

or only keep nouns, since they have the most information:

doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']]))