I'm searching for information on how ElasticSearch would scale with the amount of data in its indexes and am surprised how little I can find on that topic. Maybe some experience from the crowd here can help me.
We are currently using CloudSearch to index ≈ 7 million documents; in CloudSearch this results in 2 instances of type m2.xlarge. We are considering switching to ElasticSearch instead to reduce the cost. But all I find on the scaling of ElasticSearch is that it does scale well, can be distributed over several instances etc.
But what kind of machine (memory, disc) would I need for this kind of data?
How would that change if I increased the amount of data by the factor of 12 (≈ 80 million documents)?
As Javanna said, it depends. Mostly on: (1) rate of indexing; (2) size of documents; (3) rate and latency requirements for searches; and (4) type of searches.
Considering this, the best we can help is giving examples. On our site (news monitoring) we:
OR
queries with a lot of terms which are one of the hardest. Also we do faceting on two single-valued fields. And have some experiments with date facet (to show rate of publication through time).We do all this with 4 EC2's m1.large
instances. And now we're planning moving to ES, just released, 0.9 version to get all the goodies and performance improvements of Lucene 4.0.
Now leaving examples aside. ElasticSearch is pretty scalable. It is very simple to create an index with N shards and M replicas, and then create X machines with ES. It will distribute all shards and replicas accordingly. You can change the number of replicas anytime you want (for each index).
One downside is that you can't change the number of shards after the index creation. But you can still "overshard" it beforehand to leave room for scaling when needed. Or create a new index with the right number of shards and reindex everything (we do this).
Finally, ElasticSearch (and also Solr) uses, under the hood, the Lucene Search library, which is very mature and well known library.