Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogram

Shlomi Schwartz picture Shlomi Schwartz · Jan 4, 2017 · Viewed 7k times · Source

I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words.

So I have a 1000 numpy.array like so:

[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
 [-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
 ...
 ...[1000 Vectors]
]

And words array like so:

["hello","hi","bye","good_bye"...1000]

I have ran K-Means on my data, and the results I got made sense:

X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
    print(l,words[idx])

--- Output ---
0 hello
0 hi
1 bye
1 good_bye

0 = greeting 1 = farewell

However, some words made me think that hierarchical clustering is more suitable for the task. I've tried using AgglomerativeClustering, Unfortunately ... for this Python nobee, things got complicated and I got lost.

How can I cluster my vectors, so the output would be a dendrogram, more or less, like the one found on this wiki page? enter image description here

Answer

Antoine Reinhold Bertrand picture Antoine Reinhold Bertrand · Jan 5, 2017

I had the same problem till now! After finding always your post after searching it online (keyword = hierarchy clustering on word2vec). I had to give you a perhaps valid solution.

sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]

import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

l = linkage(model.wv.syn0, method='complete', metric='seuclidean')

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=16.,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()