I computed tf-idf of my documents based of terms. Then, I applied LSA to reduce the dimensionality of the terms. 'similarity_dist' contains values which are negative (see table below). How can I compute cosine distance with the range 0-1?
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, tokenizer=tokenize_and_stem, stop_words='english')
%time tf = tf_vectorizer.fit_transform(descriptions)
print(tf.shape)
svd = TruncatedSVD(100)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
tfidf_desc = lsa.fit_transform(tfidf_matrix_desc)
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))
similarity_dist = cosine_similarity(tfidf_desc)
pd.DataFrame(similarity_dist,index=descriptions.index, columns=descriptions.index).head(10)
print(tfidf_matrix_desc.min(),tfidf_matrix_desc.max())
#0.0 0.736443429828
print(tfidf_desc.min(),tfidf_desc.max())
#-0.518015429416 0.988306783341
print(similarity_dist.max(),similarity_dist.min())
#1.0 -0.272010919022
cosine_similarity is in the range of -1 to 1
cosine distance is defined as:
cosine_distance = 1 - cosine_similarity
hence cosine_distance will be in the range of: 0 to 2
See https://en.wikipedia.org/wiki/Cosine_similarity
Cosine distance is a term often used for the complement in positive space, that is: D_C(A,B) = 1 - S_C(A,B).
Note: if you must have it in the range of 0 to 1, you can use cosine_distance / 2