what does the vector of a word in word2vec represents?

user168983 picture user168983 · Nov 20, 2014 · Viewed 11.3k times · Source

word2vec is a open source tool by Google:

  • For each word it provides a vector of float values, what exactly do they represent?

  • There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length vector for a paragraph.

Answer

Cedias picture Cedias · Dec 2, 2014

TLDR: Word2Vec is building word projections (embeddings) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N dimensional space.

The major idea behind latent space projections, putting objects in a different and continuous dimensional space, is that your objects will have a representation (a vector) that has more interesting calculus characteristics than basic objects.

For words, what's useful is that you have a dense vector space which encodes similarity (i.e tree has a vector which is more similar to wood than from dancing). This opposes to classical sparse one-hot or "bag-of-word" encoding which treat each word as one dimension making them orthogonal by design (i.e tree,wood and dancing all have the same distance between them)

Word2Vec algorithms do this:

Imagine that you have a sentence:

The dog has to go ___ for a walk in the park.

You obviously want to fill the blank with the word "outside" but you could also have "out". The w2v algorithms are inspired by this idea. You'd like all words that fill in the blanks near, because they belong together - This is called the Distributional Hypothesis - Therefore the words "out" and "outside" will be closer together whereas a word like "carrot" would be farther away.

This is sort of the "intuition" behind word2vec. For a more theorical explanation of what's going on i'd suggest reading:

For paragraph vectors, the idea is the same as in w2v. Each paragraph can be represented by its words. Two models are presented in the paper.

  1. In a "Bag of Word" way (the pv-dbow model) where one fixed length paragraph vector is used to predict its words.
  2. By adding a fixed length paragraph token in word contexts (the pv-dm model). By retropropagating the gradient they get "a sense" of what's missing, bringing paragraph with the same words/topic "missing" close together.

Bits from the article:

The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. [...] The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph

For full understanding on how these vectors are built you'll need to learn how neural nets are built and how the backpropagation algorithm works. (i'd suggest starting by this video and Andrew NG's Coursera class)

NB: Softmax is just a fancy way of saying classification, each word in w2v algorithms is considered as a class. Hierarchical softmax/negative sampling are tricks to speed up softmax and handle a lot of classes.