I have a document in my elasticsearch with the following id: AVosj8FEIaetdb3CXpP-
I'm trying to access for every words in the fields it's tf-idf I did the following:
GET /cnn/cnn_article/AVosj8FEIaetdb3CXpP-/_termvectors
{
"fields" : ["author_wording"],
"term_statistics" : true,
"field_statistics" : true
}'
The response I've got is:
{
"_index": "dailystormer",
"_type": "dailystormer_article",
"_id": "AVosj8FEIaetdb3CXpP-",
"_version": 3,
"found": true,
"took": 1,
"term_vectors": {
"author_wording": {
"field_statistics": {
"sum_doc_freq": 3408583,
"doc_count": 16111,
"sum_ttf": 7851321
},
"terms": {
"318": {
"doc_freq": 4,
"ttf": 4,
"term_freq": 1,
"tokens": [
{
"position": 121,
"start_offset": 688,
"end_offset": 691
}
]
},
"742": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 122,
"start_offset": 692,
"end_offset": 695
}
]
},
"9971": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 123,
"start_offset": 696,
"end_offset": 700
}
]
},
"a": {
"doc_freq": 14921,
"ttf": 163268,
"term_freq": 11,
"tokens": [
{
"position": 1,
"start_offset": 13,
"end_offset": 14
},
...
"you’re": {
"doc_freq": 1112,
"ttf": 1647,
"term_freq": 1,
"tokens": [
{
"position": 80,
"start_offset": 471,
"end_offset": 477
}
]
}
}
}
}
}
It returns me some interesting fields like the term frequency (tf) but not the tf-idf. Should I recompute it myself? Is that a good idea? How can I do so?
Yes, it returns you a tf
- term frequency (you had both term frequency for this field, and ttf - which is total term frequency, e.g. sum of all tf's across all fields) and df
- document frequency (you also had it in the response). You need to decide which tf-idf you want to calculate across only your field, or all fields. To compute tf-idf you need to do the following:
tf-idf = tf * idf
where
idf = log (N / df)
and N = doc_count
from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself.