I am using Doc2Vec
function of gensim in Python to convert a document to a vector.
An example of usage
model = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
How should I interpret the size
parameter. I know that if I set size = 100
, the length of output vector will be 100, but what does it mean? For instance, if I increase size
to 200, what is the difference?
Word2Vec captures distributed representation of a word which essentially means, multiple neurons capture a single concept (concept can be word meaning/sentiment/part of speech etc.), and also a single neuron contributes to multiple concepts.
These concepts are automatically learnt and not pre-defined, hence you can think of them as latent/hidden. Also for the same reason, the word vectors can be used for multiple applications.
More is the size parameter, more will be the capacity of your neural network to represent these concepts, but more data will be required to train these vectors (as they are initialised randomly). In absence of sufficient number of sentences/computing power, its better to keep the size
small.
Doc2Vec follows slightly different neural network architecture as compared to Word2Vec, but the meaning of size
is analogous.