Elasticsearch: getting the tf-idf of every term in a given document

mel picture mel · Feb 14, 2017 · Viewed 8.9k times · Source

I have a document in my elasticsearch with the following id: AVosj8FEIaetdb3CXpP- I'm trying to access for every words in the fields it's tf-idf I did the following:

GET /cnn/cnn_article/AVosj8FEIaetdb3CXpP-/_termvectors
{
  "fields" : ["author_wording"],
  "term_statistics" : true,
  "field_statistics" : true
}'

The response I've got is:

{
  "_index": "dailystormer",
  "_type": "dailystormer_article",
  "_id": "AVosj8FEIaetdb3CXpP-",
  "_version": 3,
  "found": true,
  "took": 1,
  "term_vectors": {
    "author_wording": {
      "field_statistics": {
        "sum_doc_freq": 3408583,
        "doc_count": 16111,
        "sum_ttf": 7851321
      },
      "terms": {
        "318": {
          "doc_freq": 4,
          "ttf": 4,
          "term_freq": 1,
          "tokens": [
            {
              "position": 121,
              "start_offset": 688,
              "end_offset": 691
            }
          ]
        },
        "742": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 122,
              "start_offset": 692,
              "end_offset": 695
            }
          ]
        },
        "9971": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 123,
              "start_offset": 696,
              "end_offset": 700
            }
          ]
        },
        "a": {
          "doc_freq": 14921,
          "ttf": 163268,
          "term_freq": 11,
          "tokens": [
            {
              "position": 1,
              "start_offset": 13,
              "end_offset": 14
            },
            ...
            "you’re": {
          "doc_freq": 1112,
          "ttf": 1647,
          "term_freq": 1,
          "tokens": [
            {
              "position": 80,
              "start_offset": 471,
              "end_offset": 477
            }
          ]
        }
      }
    }
  }
}

It returns me some interesting fields like the term frequency (tf) but not the tf-idf. Should I recompute it myself? Is that a good idea? How can I do so?

Answer

Mysterion picture Mysterion · Feb 14, 2017

Yes, it returns you a tf - term frequency (you had both term frequency for this field, and ttf - which is total term frequency, e.g. sum of all tf's across all fields) and df - document frequency (you also had it in the response). You need to decide which tf-idf you want to calculate across only your field, or all fields. To compute tf-idf you need to do the following:

tf-idf = tf * idf

where

idf = log (N / df)

and N = doc_count from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself.