Recently, I've been trying to find the closest word to an embedding. The two most notable ways of doing this is by cosine distance or euclidean distance.
I'm trying to find how to efficiently compute the cosine distance for a tensor of shape [batch_size x embedding_size]
One approach is to unpack the tensor and the compute the cosine distance
#embedding is shape [vocab_size x embedding size]
array_list = tf.unpack(batch_array)
word_class_list = tf.unpack(embedding)
index_list_of_closest_word = []
for eacharray in array_list:
list_of_distances = []
for eachwordclass in word_class_list:
list_of_distances.append(cosine_distance(eacharray, eachwordclass))
index_list_of_closest_word.append(tf.argmax(tf.pack(list_of_distances)))
However, this approach is terribly inefficient. Is there perhaps a more efficient manner to do this? I know word2vec does this pretty fast and tensorflow, with the power of a gpu, should be able to do these batch calculations in parallel.
Thanks!
The cosine similarity formula is:
The inputs you have are:
embedding
: the embedding matrix, of shape [vocab_size, embedding_size]
batch_array
: a batch of embeddings, to which you want to find the closest words, of shape [batch_size, embedding_size]
embedding = tf.placeholder(tf.float32, [vocab_size, embedding_size])
batch_array = tf.placeholder(tf.float32, [batch_size, embedding_size])
To compute the cosine similarity, you can first L2 normalize both inputs:
(you may want to store the normed embedding, as you are going to reuse it a lot)
normed_embedding = tf.nn.l2_normalize(embedding, dim=1)
normed_array = tf.nn.l2_normalize(batch_array, dim=1)
Then you have to compute the dot products of all words (vocab_size
in total) vs. all arrays from the batch (batch_size
in total):
cosine_similarity = tf.matmul(normed_array, tf.transpose(normed_embedding, [1, 0]))
You can finally compute the argmax for each element of the batch:
closest_words = tf.argmax(cosine_similarity, 1) # shape [batch_size], type int64