Elastic search, multiple indexes vs one index and types for different data sets?

burzum picture burzum · Jan 22, 2013 · Viewed 45.9k times · Source

I have an application developed using the MVC pattern and I would like to index now multiple models of it, this means each model has a different data structure.

  • Is it better to use mutliple indexes, one for each model or have a type within the same index for each model? Both ways would also require a different search query I think. I just started on this.

  • Are there differences performancewise between both concepts if the data set is small or huge?

I would test the 2nd question myself if somebody could recommend me some good sample data for that purpose.

Answer

Jonathan Moo picture Jonathan Moo · Jan 28, 2013

There are different implications to both approaches.

Assuming you are using Elasticsearch's default settings, having 1 index for each model will significantly increase the number of your shards as 1 index will use 5 shards, 5 data models will use 25 shards; while having 5 object types in 1 index is still going to use 5 shards.

Implications for having each data model as index:

  • Efficient and fast to search within index, as amount of data should be smaller in each shard since it is distributed to different indices.
  • Searching a combination of data models from 2 or more indices is going to generate overhead, because the query will have to be sent to more shards across indices, compiled and sent back to the user.
  • Not recommended if your data set is small since you will incur more storage with each additional shard being created and the performance gain is marginal.
  • Recommended if your data set is big and your queries are taking a long time to process, since dedicated shards are storing your specific data and it will be easier for Elasticsearch to process.

Implications for having each data model as an object type within an index:

  • More data will be stored within the 5 shards of an index, which means there is lesser overhead issues when you query across different data models but your shard size will be significantly bigger.
  • More data within the shards is going to take a longer time for Elasticsearch to search through since there are more documents to filter.
  • Not recommended if you know you are going through 1 terabytes of data and you are not distributing your data across different indices or multiple shards in your Elasticsearch mapping.
  • Recommended for small data sets, because you will not waste storage space for marginal performance gain since each shard take up space in your hardware.

If you are asking what is too much data vs small data? Typically it depends on the processor speed and the RAM of your hardware, the amount of data you store within each variable in your mapping for Elasticsearch and your query requirements; using many facets in your queries is going to slow down your response time significantly. There is no straightforward answer to this and you will have to benchmark according to your needs.