I'm trying to compute tf-idf value of each term in a document. So, I iterate through the terms in a document and want to find the frequency of the term in the whole corpus and the number of documents in which the term appears. Following is my code:
//@param index path to index directory
//@param docNbr the document number in the index
public void readingIndex(String index, int docNbr) {
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
Document doc = reader.document(docNbr);
System.out.println("Processing file: "+doc.get("id"));
Terms termVector = reader.getTermVector(docNbr, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
long termFreq = itr.totalTermFreq(); //FIXME: this only return frequency in this doc
long docCount = itr.docFreq(); //FIXME: docCount = 1 in all cases
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);
}
reader.close();
}
Although the documentation says totalTermFreq() returns the total number of occurrences of this term across all documents, when testing I found it only returns the frequency of the term in the document given by docNbr. and docFreq() always return 1.
How can I get frequency of a term across the whole index?
Update Of course, I can create a map to map a term to its frequency. Then iterate through each document to count the total number of time a term occur. However, I thought Lucene should have a built in method for that purpose. Thank you,
IndexReader.TotalTermFreq(Term)
will provide this for you. Your calls to the similar methods on the TermsEnum
are indeed providing the stats for all documents, in the enumeration. Using the reader should get you the stats for all the documents in the index itself. Something like:
String termText = term.utf8ToString();
Term termInstance = new Term("contents", term);
long termFreq = reader.totalTermFreq(termInstance);
long docCount = reader.docFreq(termInstance);
System.out.println("term: "+termText+", termFreq = "+termFreq+", docCount = "+docCount);