I am looking at working on an NLP project, in any programming language (though Python will be my preference).
I want to take two documents and determine how similar they are.
The common way of doing this is to transform the documents into TF-IDF vectors and then compute the cosine similarity between them. Any textbook on information retrieval (IR) covers this. See esp. Introduction to Information Retrieval, which is free and available online.
TF-IDF (and similar text transformations) are implemented in the Python packages Gensim and scikit-learn. In the latter package, computing cosine similarities is as easy as
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
or, if the documents are plain strings,
>>> corpus = ["I'd like an apple",
... "An apple a day keeps the doctor away",
... "Never compare an apple to an orange",
... "I prefer scikit-learn to Orange",
... "The scikit-learn docs are Orange and Blue"]
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")
>>> tfidf = vect.fit_transform(corpus)
>>> pairwise_similarity = tfidf * tfidf.T
though Gensim may have more options for this kind of task.
See also this question.
[Disclaimer: I was involved in the scikit-learn TF-IDF implementation.]
From above, pairwise_similarity
is a Scipy sparse matrix that is square in shape, with the number of rows and columns equal to the number of documents in the corpus.
>>> pairwise_similarity
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 17 stored elements in Compressed Sparse Row format>
You can convert the sparse array to a NumPy array via .toarray()
or .A
:
>>> pairwise_similarity.toarray()
array([[1. , 0.17668795, 0.27056873, 0. , 0. ],
[0.17668795, 1. , 0.15439436, 0. , 0. ],
[0.27056873, 0.15439436, 1. , 0.19635649, 0.16815247],
[0. , 0. , 0.19635649, 1. , 0.54499756],
[0. , 0. , 0.16815247, 0.54499756, 1. ]])
Let's say we want to find the document most similar to the final document, "The scikit-learn docs are Orange and Blue". This document has index 4 in corpus
. You can find the index of the most similar document by taking the argmax of that row, but first you'll need to mask the 1's, which represent the similarity of each document to itself. You can do the latter through np.fill_diagonal()
, and the former through np.nanargmax()
:
>>> import numpy as np
>>> arr = pairwise_similarity.toarray()
>>> np.fill_diagonal(arr, np.nan)
>>> input_doc = "The scikit-learn docs are Orange and Blue"
>>> input_idx = corpus.index(input_doc)
>>> input_idx
4
>>> result_idx = np.nanargmax(arr[input_idx])
>>> corpus[result_idx]
'I prefer scikit-learn to Orange'
Note: the purpose of using a sparse matrix is to save (a substantial amount of space) for a large corpus & vocabulary. Instead of converting to a NumPy array, you could do:
>>> n, _ = pairwise_similarity.shape
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()
3