I would like to normalize the tfidf results that I've got from this given code:
for (int docNum = 0; docNum < ir.numDocs(); docNum++) {
TermFreqVector tfv = ir.getTermFreqVector(docNum, "contents");
if (tfv == null) {
// ignore empty fields
continue;
}
String[] tterms = tfv.getTerms();
int termCount = tterms.length;
int[] freqs = tfv.getTermFrequencies();
for (int t = 0; t < termCount; t++) {
double idf = ir.numDocs() / ir.docFreq(new Term("contents", tterms[t]));
System.out.println(" " + tterms[t] + " " + freqs[t]*Math.log(idf));
}
}
the output for this code is:
area 0.0
areola 5.877735781779639
ari 3.9318256327243257
art 1.6094379124341003
artifici 1.0986122886681098
assign 2.1972245773362196
associ 3.295836866004329
assur 1.9459101490553132
averag 1.0986122886681098
avoid 0.6931471805599453
.
.
.
Any help would be much appreciated. thank you
A common approach is to normalize by document size. i.e. instead of using the term counts (or absolute frequencies), you use the relative frequencies.
Let freqsum
be the sum over your frequencies array. Then use
freqs[t]/(double)freqsum*Math.log(idf)
To avoid this type of confusions, I recommend to use the terminology:
instead of the ambiguous term "term frequency".
I know that historically, if you look up the Salton, Yang, On the specification of term values in automatic indexing (1973), they refer to absolute counts. Cosine similarity will be removing the scale, so there it does not matter there anyway. Modern systems like Lucene will try to control the influence of the document better.