Does NLTK have TF-IDF implemented?

alvas picture alvas · Apr 10, 2015 · Viewed 21.3k times · Source

There are TF-IDF implementations in scikit-learn and gensim.

There are simple implementations Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

To avoid reinventing the wheel,

  • Is there really no TF-IDF in NLTK?
  • Are there sub-packages that we can manipulate to implement TF-IDF in NLTK? If there are how?

In this blogpost, it says NLTK doesn't have it. Is that true? http://www.bogotobogo.com/python/NLTK/tf_idf_with_scikit-learn_NLTK.php

Answer

yvespeirsman picture yvespeirsman · Apr 10, 2015

The NLTK TextCollection class has a method for computing the tf-idf of terms. The documentation is here, and the source is here. However, it says "may be slow to load", so using scikit-learn may be preferable.