I am using python and scikit-learn to find the cosine similarity between two strings(specifically, names).The program is able to find the similarity score between two strings but, when strings are abbreviated, it shows some undesirable output.
e.g- String1 ="K KAPOOR",String2="L KAPOOR" The cosine similarity score of these strings is 1(maximum) while the two strings are entirely different names.Is there a way to modify it, in order to get some desired results.
My code is:
# -*- coding: utf-8 -*-
"""
Created on Wed Sep 9 14:40:21 2015
@author: gauge
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents=("K KAPOOR","L KAPOOR")
tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)
#print tfidf_matrix.shape
cs=cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)
print cs
As mentioned in the other answer, the cosine similarity is one because the two strings have the exact same representation.
That means that this code:
tfidf_vectorizer=TfidfVectorizer()
tfidf_matrix=tfidf_vectorizer.fit_transform(documents)
produces, well:
print(tfidf_matrix.toarray())
[[ 1.]
[ 1.]]
This means that the two strings/documents (here the rows in the array) have the same representation.
That is because the TfidfVectorizer
tokenizes your document using word tokens, and keeps only words with at least 2 characters.
So you could do one of the following:
Use:
tfidf_vectorizer=TfidfVectorizer(analyzer="char")
to get character n-grams instead of word n-grams.
Change the token pattern so that it keeps one-letter tokens:
tfidf_vectorizer=TfidfVectorizer(token_pattern=u'(?u)\\b\w+\\b')
This is just a simple modification from the default pattern you can see in the documentation. Note that I had to escape the \b
occurrences in the regular expression as I was getting an 'empty vocabulary' error.
Hope this helps.