Python LSA with Sklearn

Schweigerama picture Schweigerama · Jun 2, 2015 · Viewed 12.7k times · Source

I'm currently trying to implement LSA with Sklearn to find synonyms in multiple Documents. Here is my Code:

#import the essential tools for lsa
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
#other imports
from os import listdir

#load data
datafolder = 'data/'
filenames = []
for file in listdir(datafolder):
    if file.endswith(".txt"):
        filenames.append(datafolder+file)

#Document-Term Matrix
cv = CountVectorizer(input='filename',strip_accents='ascii')
dtMatrix = cv.fit_transform(filenames).toarray()
print dtMatrix.shape
featurenames = cv.get_feature_names()
print featurenames

#Tf-idf Transformation
tfidf = TfidfTransformer()
tfidfMatrix = tfidf.fit_transform(dtMatrix).toarray()
print tfidfMatrix.shape

#SVD
#n_components is recommended to be 100 by Sklearn Documentation for LSA
#http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
svd = TruncatedSVD(n_components = 100)
svdMatrix = svd.fit_transform(tfidfMatrix)

print svdMatrix

#Cosine-Similarity
#cosine = cosine_similarity(svdMatrix[1], svdMatrix)

Now here is my Problem: the Shape of the Term-DOcument Matrix and the tf-idf Matrix are the same, which is (27,3099). 27 Documents and 3099 words. After the Single Value Decomposition the shape of the Matrix is (27,27). I know you can calculate the cosine-similarity from 2 rows to get there similarity, but i don't think i can get the similiarity of 2 words in my documents by doing that with the SVD-Matrix.

Can someone explain to me what the SVD-Matrix represents and in which ever way i can use that to find synonyms in my Documents?

Thanks in advance.

Answer

Bruce Chou picture Bruce Chou · Jun 2, 2015

SVD is a dimensionality reduction tool, which means it reduces the order (number) of your features to a more representative set.

From the source code on github:

def fit_transform(self, X, y=None):
    """Fit LSI model to X and perform dimensionality reduction on X.
    Parameters
    ----------
    X : {array-like, sparse matrix}, shape (n_samples, n_features)
        Training data.
    Returns
    -------
    X_new : array, shape (n_samples, n_components)
        Reduced version of X. This will always be a dense array.
    """

We can see that the returned matrix contains samples with reduced number of components. You can then use distance calculation methods to determine the similarity of any two rows.

Here also gives an easy example of LSA via SVD.