I'm trying to think of ways to scale our elasticsearch setup. Do people use multiple node clients on an Elasticsearch cluster and put them in front of a load balancer/reverse proxy like Nginx. Other ideas would be great.
So I'd start with re-capping the three different kinds of nodes you can configure in Elasticsearch:
Data Node - node.data set to true and node.master set to false - these are your core nodes of an elasticsearch cluster, where the data is stored.
Dedicated Master Node - node.data is set to false and node.master is set to true - these are responsible for managing the cluster state.
Client Node - node.data is set to false and node.master is set to
false - these respond to client data requests, querying for results
from the data nodes and gathering the data to return to the client.
By splitting the functions into 3 different base node types you have a great degree of granularity and control in managing the scale of your cluster. As each node type handles a more isolated set of responsibilities you are better able to tune each one and to scale appropriately.
For data nodes, it's a function of handling indexing and query responses, along with making certain you have enough storage allocated to each node. You'll want to monitor storage usage and disk thru-put for each node, along with cpu and memory usage. You want to avoid configurations where you run out of disk, or saturate disk thru-put, while still have substantial excess cpu and memory, or the reverse where memory and cpu max but you have lot's of disk available. The best way to determine this is thru some benchmarking of typical indexing and querying activities.
For master nodes, you should always have at least 3 and should always have an odd number. The quorum should be set to N/2 + 1 where is N is the number of master nodes. This way you don't run into split brain issues with your cluster. Dedicated master nodes tend not to be heavily loaded so that can be quite small.
For client nodes you can indeed put them behind a load balancer, or use dns entries to point to them. They are easily scaled up and down by just adding more to the cluster and should be added for both redundancy and as you see cpu and memory usage climb. Not much need for a lot of disk.
No matter what your configuration, in addition to benchmarking likely loads ahead of time I'd strongly advise close monitoring of cpu, memory and disk - ES is easy to start rolling out but it does need watching as you scale into larger numbers of transactions and more nodes. Dealing with a yellow or red status cluster due to node failures from memory or disk exhaustion is not a lot of fun.
I'd take a close read of this article for some background:
http://elastic.co/guide/en/elasticsearch/reference/current/modules-node.html
Plus this series of articles:
http://elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html