I am basically creating a search engine and I want to implement tf*idf to rank my xml documents based on a search query. How do I implement it? How do I start it? Any help appreciated.
I did this in the past, and I used Lucene to get the TD*IDF data.
It took fair amount of fiddling aound though, so if there are other solutions people know are easier, then use them.
Start by looking at TermFreqVector and other classes in org.apache.lucene.index.