First let's extract the TF-IDF scores per term per document:
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
Printing it out:
for doc in corpus_tfidf:
print doc
[out]:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? I.e.
>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
... for word, score in doc:
... tfidf_saliency[word] += score / len(corpus_tfidf)
...
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})
Looking at the output, could we assume that the most "prominent" word in the corpus is:
>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'
If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?
The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.
Find the Top Words in corpus_tfidf.
topWords = {}
for doc in corpus_tfidf:
for iWord, tf_idf in doc:
if iWord not in topWords:
topWords[iWord] = 0
if tf_idf > topWords[iWord]:
topWords[iWord] = tf_idf
for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
if i == 6: break
Output comparison cart:
NOTE: Could'n use gensim
, to create a matching dictionary
with corpus_tfidf
.
Can only display Word Indizies.
Question tfidf_saliency topWords(corpus_tfidf) Other TF-IDF implentation
---------------------------------------------------------------------------
1: Word(7) 0.121 1: Word(13) 0.640 1: paths 0.376019
2: Word(8) 0.111 2: Word(27) 0.632 2: intersection 0.376019
3: Word(26) 0.108 3: Word(28) 0.632 3: survey 0.366204
4: Word(29) 0.100 4: Word(8) 0.628 4: minors 0.366204
5: Word(9) 0.090 5: Word(29) 0.628 5: binary 0.300815
6: Word(14) 0.087 6: Word(11) 0.544 6: generation 0.300815
The calculation of TF-IDF takes always the corpus in account.
Tested with Python:3.4.2