TFIDF calculating confusion

python data-mining text-processing information-retrieval tf-idf

badc0re · May 20, 2013 · Viewed 7.8k times · Source

I found the following code on the internet for calculating TFIDF:

https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py

I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error:

return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))

But i am confused for two things:

I get negative values in some cases, is this correct?
I am confused with line 62, 63 and 64.

Code:

 documentNumber = 0
  for word in documentList[documentNumber].split(None):
       words[word] = tfidf(word,documentList[documentNumber],documentList)

Should TFIDF be calculated on the first document only?

Answer

No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).

TFIDF calculating confusion

Answer

Related questions