I wanted to understand the way fastText vectors for sentences are created. According to this issue 309, the vectors for sentences are obtained by averaging the vectors for words.
In order to confirm this, I wrote the following script:
import numpy as np
import fastText as ft
# Loading model for Finnish.
model = ft.load_model('cc.fi.300.bin')
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (one + two) / 2
# Checking if the two approaches yield the same result.
is_equal = np.array_equal(one_two, one_two_avg)
# Printing the result.
print(is_equal)
# Result: FALSE
But, It seems that the obtained vectors are not similar.
Why aren't both values the same? Would it be related to the way I am averaging the vectors? Or, maybe there is something I am missing?
First, you missed the part that get_sentence_vector
is not just a simple "average". Before FastText sum each word vector, each vector is divided with its norm (L2 norm) and then the averaging process only involves vectors that have positive L2 norm value.
Second, a sentence always ends with an EOS. So if you try to calculate manually you need to put EOS before you calculate the average.
try this (I assume the L2 norm of each word is positive):
def l2_norm(x):
return np.sqrt(np.sum(x**2))
def div_norm(x):
norm_value = l2_norm(x)
if norm_value > 0:
return x * ( 1.0 / norm_value)
else:
return x
# Getting word vectors for 'one' and 'two'.
one = model.get_word_vector('yksi')
two = model.get_word_vector('kaksi')
eos = model.get_word_vector('\n')
# Getting the sentence vector for the sentence "one two" in Finnish.
one_two = model.get_sentence_vector('yksi kaksi')
one_two_avg = (div_norm(one) + div_norm(two) + div_norm(eos)) / 3
You can see the source code here or you can see the discussion here.