Cosine similarity and tf-idf

N00programmer picture N00programmer · Jun 6, 2011 · Viewed 51.6k times · Source

I am confused by the following comment about TF-IDF and Cosine Similarity.

I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90."

Now I'm wondering....aren't they 2 different things?

Is tf-idf already inside the cosine similarity? If yes, then what the heck - I can only see the inner dot products and euclidean lengths.

I thought tf-idf was something you could do before running cosine similarity on the texts. Did I miss something?

Answer

rcreswick picture rcreswick · Oct 27, 2013

TF-IDF is just a way to measure the importance of tokens in text; it's just a very common way to turn a document into a list of numbers (the term vector that provides one edge of the angle you're getting the cosine of).

To compute cosine similarity, you need two document vectors; the vectors represent each unique term with an index, and the value at that index is some measure of how important that term is to the document and to the general concept of document similarity in general.

You could simply count the number of times each term occurred in the document (Term Frequency), and use that integer result for the term score in the vector, but the results wouldn't be very good. Extremely common terms (such as "is", "and", and "the") would cause lots of documents to appear similar to each other. (Those particular examples can be handled by using a stopword list, but other common terms that are not general enough to be considered a stopword cause the same sort of issue. On Stackoverflow, the word "question" might fall into this category. If you were analyzing cooking recipes, you'd probably run into issues with the word "egg".)

TF-IDF adjusts the raw term frequency by taking into account how frequent each term occurs in general (the Document Frequency). Inverse Document Frequency is usually the log of the number of documents divided by the number of documents the term occurs in (image from Wikipedia):

IDF, credit to wikipedia

Think of the 'log' as a minor nuance that helps things work out in the long run -- it grows when it's argument grows, so if the term is rare, the IDF will be high (lots of documents divided by very few documents), if the term is common, the IDF will be low (lots of documents divided by lots of documents ~= 1).

Say you have 100 recipes, and all but one requires eggs, now you have three more documents that all contain the word "egg", once in the first document, twice in the second document and once in the third document. The term frequency for 'egg' in each document is 1 or 2, and the document frequency is 99 (or, arguably, 102, if you count the new documents. Let's stick with 99).

The TF-IDF of 'egg' is:

1 * log (100/99) = 0.01    # document 1
2 * log (100/99) = 0.02    # document 2
1 * log (100/99) = 0.01    # document 3

These are all pretty small numbers; in contrast, let's look at another word that only occurs in 9 of your 100 recipe corpus: 'arugula'. It occurs twice in the first doc, three times in the second, and does not occur in the third document.

The TF-IDF for 'arugula' is:

1 * log (100/9) = 2.40  # document 1
2 * log (100/9) = 4.81  # document 2
0 * log (100/9) = 0     # document 3

'arugula' is really important for document 2, at least compared to 'egg'. Who cares how many times egg occurs? Everything contains egg! These term vectors are a lot more informative than simple counts, and they will result in documents 1 & 2 being much closer together (with respect to document 3) than they would be if simple term counts were used. In this case, the same result would probably arise (hey! we only have two terms here), but the difference would be smaller.

The take-home here is that TF-IDF generates more useful measures of a term in a document, so you don't focus on really common terms (stopwords, 'egg'), and lose sight of the important terms ('arugula').