I have a list of documents and the tf-idf score for each unique word in the entire corpus. How do I visualize that on a 2-d plot to give me a gauge of how many clusters I will need to run k-means?
Here is my code:
sentence_list=["Hi how are you", "Good morning" ...]
vectorizer=TfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
num_samples, num_features=vectorized.shape
print "num_samples: %d, num_features: %d" %(num_samples,num_features)
num_clusters=10
As you can see, I am able to transform my sentences into a tf-idf document matrix. But I am unsure how to plot the data points of the tf-idf score.
I was thinking:
Thanks
I am doing something similar at the moment, trying to plot in 2D, tf-idf scores for a dataset of texts. My approach, similar to suggestions in other comments, is to use PCA and t-SNE from scikit-learn.
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
num_clusters = 10
num_seeds = 10
max_iterations = 300
labels_color_map = {
0: '#20b2aa', 1: '#ff7373', 2: '#ffe4e1', 3: '#005073', 4: '#4d0404',
5: '#ccc0ba', 6: '#4700f9', 7: '#f6f900', 8: '#00f91d', 9: '#da8c49'
}
pca_num_components = 2
tsne_num_components = 2
# texts_list = some array of strings for which TF-IDF is being computed
# calculate tf-idf of texts
tf_idf_vectorizer = TfidfVectorizer(analyzer="word", use_idf=True, smooth_idf=True, ngram_range=(2, 3))
tf_idf_matrix = tf_idf_vectorizer.fit_transform(texts_list)
# create k-means model with custom config
clustering_model = KMeans(
n_clusters=num_clusters,
max_iter=max_iterations,
precompute_distances="auto",
n_jobs=-1
)
labels = clustering_model.fit_predict(tf_idf_matrix)
# print labels
X = tf_idf_matrix.todense()
# ----------------------------------------------------------------------------------------------------------------------
reduced_data = PCA(n_components=pca_num_components).fit_transform(X)
# print reduced_data
fig, ax = plt.subplots()
for index, instance in enumerate(reduced_data):
# print instance, index, labels[index]
pca_comp_1, pca_comp_2 = reduced_data[index]
color = labels_color_map[labels[index]]
ax.scatter(pca_comp_1, pca_comp_2, c=color)
plt.show()
# t-SNE plot
embeddings = TSNE(n_components=tsne_num_components)
Y = embeddings.fit_transform(X)
plt.scatter(Y[:, 0], Y[:, 1], cmap=plt.cm.Spectral)
plt.show()