Cosine similarity when one of vectors is all zeros

Sebastian Widz picture Sebastian Widz · Nov 2, 2014 · Viewed 8.9k times · Source

How to express the cosine similarity ( http://en.wikipedia.org/wiki/Cosine_similarity )

when one of the vectors is all zeros?

v1 = [1, 1, 1, 1, 1]

v2 = [0, 0, 0, 0, 0]

When we calculate according to the classic formula we get division by zero:

Let d1 = 0 0 0 0 0 0
Let d2 = 1 1 1 1 1 1
Cosine Similarity (d1, d2) =  dot(d1, d2) / ||d1|| ||d2||dot(d1, d2) = (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) = 0

||d1|| = sqrt((0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 0

||d2|| = sqrt((1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2) = 2.44948974278

Cosine Similarity (d1, d2) = 0 / (0) * (2.44948974278)
                           = 0 / 0

I want to use this similarity measure in a clustering application. And I often will need to compare such vectors. Also [0, 0, 0, 0, 0] vs. [0, 0, 0, 0, 0]

Do you have any experience? Since this is a similarity (not a distance) measure should I use special case for

d( [1, 1, 1, 1, 1]; [0, 0, 0, 0, 0] ) = 0

d([0, 0, 0, 0, 0]; [0, 0, 0, 0, 0] ) = 1

what about

d([1, 1, 1, 0, 0]; [0, 0, 0, 0, 0] ) = ? etc.

Answer

Has QUIT--Anony-Mousse picture Has QUIT--Anony-Mousse · Nov 2, 2014

If you have 0 vectors, cosine is the wrong similarity function for your application.

Cosine distance is essentially equivalent to squared Euclidean distance on L_2 normalized data. I.e. you normalize every vector to unit length 1, then compute squared Euclidean distance.

The other benefit of Cosine is performance - computing it on very sparse, high-dimensional data is faster than Euclidean distance. It benefits from sparsity to the square, not just linear.

While you obviously can try to hack the similarity to be 0 when exactly one is zero, and maximal when they are identical, it won't really solve the underlying problems.

Don't choose the distance by what you can easily compute.

Instead, choose the distance such that the result has a meaning on your data. If the value is undefined, you don't have a meaning...

Sometimes, it may work to discard constant-0 data as meaningless data anyway (e.g. analyzing Twitter noise, and seeing a Tweet that is all numbers, no words). Sometimes it doesn't.