Courier Fetch: shards failed

Carlos Vega picture Carlos Vega · May 5, 2015 · Viewed 42.7k times · Source

Why do I get these warnings after adding more data to my elasticsearch? And the warnings are different every time I browse the dashboard.

"Courier Fetch: 30 of 60 shards failed."

Example 1

Example 2

More details:

It's a sole node on a CentOS 7.1

/etc/elasticsearch/elasticsearch.yml

index.number_of_shards: 3
index.number_of_replicas: 1

bootstrap.mlockall: true

threadpool.bulk.queue_size: 1000
indices.fielddata.cache.size: 50%
threadpool.index.queue_size: 400
index.refresh_interval: 30s

index.number_of_shards: 5
index.number_of_replicas: 1

/usr/share/elasticsearch/bin/elasticsearch.in.sh

ES_HEAP_SIZE=3G

#I use this Garbage Collector instead of the default one.

JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"

cluster status

{
  "cluster_name" : "my_cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 61,
  "active_shards" : 61,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 61
}

cluster details

{
  "cluster_name" : "my_cluster",
  "nodes" : {
    "some weird number" : {
      "name" : "ES 1",
      "transport_address" : "inet[localhost/127.0.0.1:9300]",
      "host" : "some host",
      "ip" : "150.244.58.112",
      "version" : "1.4.4",
      "build" : "c88f77f",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 7854,
        "max_file_descriptors" : 65535,
        "mlockall" : false
      }
    }
  }
}

I'm curious about the "mlockall" : false because on the yml I did write bootstrap.mlockall: true

logs

lots of lines like:

org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@a9a34f5

Answer

Philip picture Philip · Sep 3, 2015

For me tuning the threadpool search queue_size solved the issue. I tried a number of other things and this is the one that solved it.

I added this to my elasticsearch.yml

threadpool.search.queue_size: 10000

and then restarted elasticsearch.

Reasoning... (from the docs)

A node holds several thread pools in order to improve how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.

and for search in particular...

For count/search operations. Defaults to fixed with a size of int((# of available_processors * 3) / 2) + 1, queue_size of 1000.

For more information you can refer to the elasticsearch docs here...

I had trouble finding this information so I hope this helps others!