Normalizing TF-IDF results

John picture John · Jul 1, 2012 · Viewed 16.3k times · Source

I would like to normalize the tfidf results that I've got from this given code:

for (int docNum = 0; docNum < ir.numDocs(); docNum++) {
            TermFreqVector tfv = ir.getTermFreqVector(docNum, "contents");
            if (tfv == null) {
                // ignore empty fields
                continue;
            }
            String[] tterms = tfv.getTerms();
            int termCount = tterms.length;
            int[] freqs = tfv.getTermFrequencies();
            for (int t = 0; t < termCount; t++) {
                double idf = ir.numDocs() / ir.docFreq(new Term("contents", tterms[t]));
                System.out.println(" " + tterms[t] + " " + freqs[t]*Math.log(idf));
            }
        }

the output for this code is:

area 0.0
areola 5.877735781779639
ari 3.9318256327243257
art 1.6094379124341003
artifici 1.0986122886681098
assign 2.1972245773362196
associ 3.295836866004329
assur 1.9459101490553132
averag 1.0986122886681098
avoid 0.6931471805599453
.
.
.

Any help would be much appreciated. thank you

Answer

Has QUIT--Anony-Mousse picture Has QUIT--Anony-Mousse · Jul 5, 2012

A common approach is to normalize by document size. i.e. instead of using the term counts (or absolute frequencies), you use the relative frequencies.

Let freqsum be the sum over your frequencies array. Then use

freqs[t]/(double)freqsum*Math.log(idf)

To avoid this type of confusions, I recommend to use the terminology:

  • term counts for the "absolute frequencies"
  • relative frequency for the word-in-document ratio

instead of the ambiguous term "term frequency".

I know that historically, if you look up the Salton, Yang, On the specification of term values in automatic indexing (1973), they refer to absolute counts. Cosine similarity will be removing the scale, so there it does not matter there anyway. Modern systems like Lucene will try to control the influence of the document better.