Does gensim.corpora.Dictionary have term frequency saved?

alvas picture alvas · Oct 11, 2017 · Viewed 8.3k times · Source

Does gensim.corpora.Dictionary have term frequency saved?

From gensim.corpora.Dictionary, it's possible to get the document frequency of the words (i.e. how many document did a particular word occur in):

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

[out]:

The word "these" appears in 1213 documents

And there is the filter_n_most_frequent(remove_n) function that can remove the n-th most frequent tokens:

filter_n_most_frequent(remove_n) Filter out the ‘remove_n’ most frequent tokens that appear in the documents.

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

Is the filter_n_most_frequent function removing the n-th most frequent based on the document frequency or term frequency?

If it's the latter, is there some way to access the term frequency of the words in the gensim.corpora.Dictionary object?

Answer

ubadub picture ubadub · Oct 17, 2017

No, gensim.corpora.Dictionary does not save term frequency. You can see the source code here. The class only stores the following member variables:

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix

This means everything in the class defines frequency as document frequency, never term frequency, as the latter is never stored globally. This applies to filter_n_most_frequent(remove_n) as well as every other method.