Scaling of ElasticSearch

Alfe picture Alfe · Apr 30, 2013 · Viewed 10.9k times · Source

I'm searching for information on how ElasticSearch would scale with the amount of data in its indexes and am surprised how little I can find on that topic. Maybe some experience from the crowd here can help me.

We are currently using CloudSearch to index ≈ 7 million documents; in CloudSearch this results in 2 instances of type m2.xlarge. We are considering switching to ElasticSearch instead to reduce the cost. But all I find on the scaling of ElasticSearch is that it does scale well, can be distributed over several instances etc.

But what kind of machine (memory, disc) would I need for this kind of data?

How would that change if I increased the amount of data by the factor of 12 (≈ 80 million documents)?

Answer

Felipe Hummel picture Felipe Hummel · Apr 30, 2013

As Javanna said, it depends. Mostly on: (1) rate of indexing; (2) size of documents; (3) rate and latency requirements for searches; and (4) type of searches.

Considering this, the best we can help is giving examples. On our site (news monitoring) we:

  1. Index more than 100 docs per minute. We have, currently, near 50 million documents. I've also heard of ES indexes with hundreds of millions of documents.
  2. Documents are news articles with some metadata, not short but not that large.
  3. Our search latency varies between ~50ms (for normal and rare terms) up to 800ms for common terms (stopwords, we index them). This variation is largely due to our custom scoring (thanks to Lucene/ES support for customizing it) and to the fact the dataset (inverted lists) do not fit entirely in memory (OS cache). So when it hits a cached inverted list, it's faster.
  4. We do OR queries with a lot of terms which are one of the hardest. Also we do faceting on two single-valued fields. And have some experiments with date facet (to show rate of publication through time).

We do all this with 4 EC2's m1.large instances. And now we're planning moving to ES, just released, 0.9 version to get all the goodies and performance improvements of Lucene 4.0.

Now leaving examples aside. ElasticSearch is pretty scalable. It is very simple to create an index with N shards and M replicas, and then create X machines with ES. It will distribute all shards and replicas accordingly. You can change the number of replicas anytime you want (for each index).

One downside is that you can't change the number of shards after the index creation. But you can still "overshard" it beforehand to leave room for scaling when needed. Or create a new index with the right number of shards and reindex everything (we do this).

Finally, ElasticSearch (and also Solr) uses, under the hood, the Lucene Search library, which is very mature and well known library.