I have corpora of classified text. From these I create vectors. Each vector corresponds to one document. Vector components are word weights in this document computed as TFIDF values. Next I build a model in which every class is presented by a single vector. Model has as many vectors as there classes in the corpora. Component of a model vector is computed as mean of all component values taken from vectors in this class. For unclassified vectors I determine similarity with a model vector by computing cosine between these vectors.
Questions:
1) Can I use Euclidean Distance between unclassified and model vector to compute their similarity?
2) Why Euclidean distance can not be used as similarity measure instead of cosine of angle between two vectors and vice versa?
Thanks!
One informal but rather intuitive way to think about this is to consider the 2 components of a vector: direction and magnitude.
Direction is the "preference" / "style" / "sentiment" / "latent variable" of the vector, while the magnitude is how strong it is towards that direction.
When classifying documents we'd like to categorize them by their overall sentiment, so we use the angular distance.
Euclidean distance is susceptible to documents being clustered by their L2-norm (magnitude, in the 2 dimensional case) instead of direction. I.e. vectors with quite different directions would be clustered because their distances from origin are similar.