I have 2 vectors with 11 dimentions.
a <- c(-0.012813841, -0.024518383, -0.002765056, 0.079496744, 0.063928973,
0.476156960, 0.122111977, 0.322930189, 0.400701256, 0.454048860,
0.525526219)
b <- c(0.64175768, 0.54625694, 0.40728261, 0.24819750, 0.09406221,
0.16681692, -0.04211932, -0.07130129, -0.08182200, -0.08266852,
-0.07215885)
cosine_sim <- cosine(a,b)
which returns:
-0.05397935
I used cosine()
from lsa
package.
for some values i am getting negative cosine_sim like the given one. I am not sure how the similarity can be negative. It should be between 0 and 1.
Can anyone explain what is going on here.
The nice thing about R is that you can often dig into the functions and see for yourself what is going on. If you type cosine
(without any parentheses, arguments, etc.) then R prints out the body of the function. Poking through it (which takes some practice), you can see that there is a bunch of machinery for computing the pairwise similarities of the columns of the matrix (i.e., the bit wrapped in the if (is.matrix(x) && is.null(y))
condition, but the key line of the function is
crossprod(x, y)/sqrt(crossprod(x) * crossprod(y))
Let's pull this out and apply it to your example:
> crossprod(a,b)/sqrt(crossprod(a)*crossprod(b))
[,1]
[1,] -0.05397935
> crossprod(a)
[,1]
[1,] 1
> crossprod(b)
[,1]
[1,] 1
So, you're using vectors that are already normalized, so you just have crossprod
to look at. In your case this is equivalent to
> sum(a*b)
[1] -0.05397935
(for real matrix operations, crossprod
is much more efficient than constructing the equivalent operation by hand).
As @Jack Maney's answer says, the dot product of two vectors (which is length(a)*length(b)*cos(a,b)) can be negative ...
For what it's worth, I suspect that the cosine
function in lsa
might be more easily/efficiently implemented for matrix arguments as as.dist(crossprod(x))
...
edit: in comments on a now-deleted answer below, I suggested that the square of the cosine-distance measure might be appropriate if one wants a similarity measure on [0,1] -- this would be analogous to using the coefficient of determination (r^2) rather than the correlation coefficient (r) -- but that it might also be worth going back and thinking more carefully about the purpose/meaning of the similarity measures to be used ...