Does gensim.corpora.Dictionary have term frequency saved?
From gensim.corpora.Dictionary
, it's possible to get the document frequency of the words (i.e. how many document did a particular word occur in):
from nltk.corpus import brown
from gensim.corpora import Dictionary
documents = brown.sents()
brown_dict = Dictionary(documents)
# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')
[out]:
The word "these" appears in 1213 documents
And there is the filter_n_most_frequent(remove_n)
function that can remove the n-th most frequent tokens:
filter_n_most_frequent(remove_n)
Filter out the ‘remove_n’ most frequent tokens that appear in the documents.After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
Is the filter_n_most_frequent
function removing the n-th most frequent based on the document frequency or term frequency?
If it's the latter, is there some way to access the term frequency of the words in the gensim.corpora.Dictionary
object?
No, gensim.corpora.Dictionary
does not save term frequency. You can see the source code here. The class only stores the following member variables:
self.token2id = {} # token -> tokenId
self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
self.num_docs = 0 # number of documents processed
self.num_pos = 0 # total number of corpus positions
self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
This means everything in the class defines frequency as document frequency, never term frequency, as the latter is never stored globally. This applies to filter_n_most_frequent(remove_n)
as well as every other method.