Best way to index arbitrary attribute value pairs on elastic search

Serhat Ozgel picture Serhat Ozgel · Feb 18, 2015 · Viewed 7k times · Source

I am trying to index documents on elastic search, which have attribute value pairs. Example documents:

{
    id: 1,
    name: "metamorphosis",
    author: "franz kafka"
}

{
    id: 2,
    name: "techcorp laptop model x",
    type: "computer",
    memorygb: 4
}

{
    id: 3,
    name: "ss2014 formal shoe x",
    color: "black",
    size: 42,
    price: 124.99
}

Then, I need queries like:

1. "author" EQUALS "franz kafka"
2. "type" EQUALS "computer" AND "memorygb" GREATER THAN 4
3. "color" EQUALS "black" OR ("size" EQUALS 42 AND price LESS THAN 200.00)

What is the best way to store these documents for efficiently querying them? Should I store them exactly as shown in the examples? Or should I store them like:

{
    fields: [
        { "type": "computer" },
        { "memorygb": 4 }
    ]
}

or like:

{
    fields: [
        { "key": "type", "value": "computer" },
        { "key": "memorygb", "value": 4 }
    ]
}

And how should I map my indices for being able to perform both my equality and range queries?

Answer

smnh picture smnh · Oct 20, 2017

If someone is still looking for an answer, I wrote a post about how to index arbitrary data into Elasticsearch and then to search by specific fields and values. All this, without blowing up your index mapping.

The post: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/

In short, you will need to create special index described in the post. Then you will need to flatten your data using the flattenData function https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4. Then, the flattened data can be safely indexed into Elasticsearch index.

For example:

flattenData({
    id: 1,
    name: "metamorphosis",
    author: "franz kafka"
});

Will produce:

[
    {
        "key": "id",
        "type": "long",
        "key_type": "id.long",
        "value_long": 1
    },
    {
        "key": "name",
        "type": "string",
        "key_type": "name.string",
        "value_string": "metamorphosis"
    },
    {
        "key": "author",
        "type": "string",
        "key_type": "author.string",
        "value_string": "franz kafka"
    }
]

And

flattenData({
    id: 2,
    name: "techcorp laptop model x",
    type: "computer",
    memorygb: 4
});

Will produce:

[
    {
        "key": "id",
        "type": "long",
        "key_type": "id.long",
        "value_long": 2
    },
    {
        "key": "name",
        "type": "string",
        "key_type": "name.string",
        "value_string": "techcorp laptop model x"
    },
    {
        "key": "type",
        "type": "string",
        "key_type": "type.string",
        "value_string": "computer"
    },
    {
        "key": "memorygb",
        "type": "long",
        "key_type": "memorygb.long",
        "value_long": 4
    }
]

Then you can use build Elasticsearch queries to query your data. Every query should specify both the key and type of value. If you are unsure of what keys or types the index has, you can run an aggregation to find out, this is also discussed in the post.

For example, to find a document where author == "franz kafka" you need to execute the following query:

{
    "query": {
        "nested": {
            "path": "flatData",
            "query": {
                "bool": {
                    "must": [
                        {"term": {"flatData.key": "author"}},
                        {"match": {"flatData.value_string": "franz kafka"}}
                    ]
                }
            }
        }
    }
}

To find documents where type == "computer" and memorygb > 4 you need to execute the following query:

{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "path": "flatData",
                        "query": {
                            "bool": {
                                "must": [
                                    {"term": {"flatData.key": "type"}},
                                    {"match": {"flatData.value_string": "computer"}}
                                ]
                            }
                        }
                    }
                },
                {
                    "nested": {
                        "path": "flatData",
                        "query": {
                            "bool": {
                                "must": [
                                    {"term": {"flatData.key": "memorygb"}},
                                    {"range": {"flatData.value_long": {"gt": 4}}}
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

Here, because we want same document match both conditions, we are using outer bool query with a must clause wrapping two nested queries.