I have downloaded en_core_web_lg
model and trying to find similarity between two sentences:
nlp = spacy.load('en_core_web_lg')
search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")
print(main_doc.similarity(search_doc))
Which returns very strange value:
0.9066019751888448
These two sentences should not be 90% similar they have very different meanings.
Why this is happening? Do I need to add some kind of additional vocabulary in order to make similarity result more reasonable?
Spacy constructs sentence embedding by averaging the word embeddings. Since, in an ordinary sentence, there are a lot of meaningless words (called stop words), you get poor results. You can remove them like this:
search_doc = nlp("This was very strange argument between american and british person")
main_doc = nlp("He was from Japan, but a true English gentleman in my eyes, and another one of the reasons as to why I liked going to school.")
search_doc_no_stop_words = nlp(' '.join([str(t) for t in search_doc if not t.is_stop]))
main_doc_no_stop_words = nlp(' '.join([str(t) for t in main_doc if not t.is_stop]))
print(search_doc_no_stop_words.similarity(main_doc_no_stop_words))
or only keep nouns, since they have the most information:
doc_nouns = nlp(' '.join([str(t) for t in doc if t.pos_ in ['NOUN', 'PROPN']]))