Mongodb performance difference between Hash and Ascending indices (Any reason not to use hash in a not ordered field?)

J-Rou picture J-Rou · Feb 4, 2015 · Viewed 9.7k times · Source

In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)

I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error

{
    "createdCollectionAutomatically" : true,
    "numIndexesBefore" : 1,
    "errmsg" : "exception: Currently only single field hashed index supported.",
    "code" : 16763,
    "ok" : 0
}

My question:

Between the indices:

  1. db.test.ensureIndex( { "key": 1 } )

  2. db.test.ensureIndex( { "key": "hashed" } )

For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)


How I got to the question:

Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.

The situation in which I made the index was:

I had a collection that contained a sorted list of documents classified by keys.

e.g. {key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...

Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.

That means that I had two possible queries:

  • For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
  • And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})

In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.

Answer

Wan Bachtiar picture Wan Bachtiar · May 4, 2017

For the query db.products.find( { key: "a" } ), which one is more performant?

Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.

If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h

Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.

Any reason not to use hash in a non-ordered field?

Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.

Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.

is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?

Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.

The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.