I'm attempting to improve performance on a suite that tests against ElasticSearch.
The tests take a long time because Elasticsearch does not update it's indexes immediately after updating. For instance, the following code runs without raising an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
results = elasticsearch.search()
assert not results
# results are not populated
Currently out hacked together solution to this issue is dropping a time.sleep
call into the code, to give ElasticSearch some time to update it's indexes.
from time import sleep
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Don't want to use sleep functions
sleep(1)
results = elasticsearch.search()
assert len(results) == 1
# results are now populated
Obviously this isn't great, as it's rather failure prone, hypothetically if ElasticSearch takes longer than a second to update it's indexes, despite how unlikely that is, the test will fail. Also it's extremely slow when you're running 100s of tests like this.
My attempt to solve the issue has been to query the pending cluster jobs to see if there are any tasks left to be done. However this doesn't work, and this code will run without an assertion error.
from elasticsearch import Elasticsearch
elasticsearch = Elasticsearch('es.test')
# Asumming that this is a clean and empty elasticsearch instance
elasticsearch.update(
index='blog',
doc_type=,'blog'
id=1,
body={
....
}
)
# Query if there are any pending tasks
while elasticsearch.cluster.pending_tasks()['tasks']:
pass
results = elasticsearch.search()
assert not results
# results are not populated
So basically, back to my original question, ElasticSearch updates are not immediate, how do you wait for ElasticSearch to finish updating it's index?
As of version 5.0.0, elasticsearch has an option:
?refresh=wait_for
on the Index, Update, Delete, and Bulk api's. This way, the request won't receive a response until the result is visible in ElasticSearch. (Yay!)
See https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information.
edit: It seems that this functionality is already part of the latest Python elasticsearch api: https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.index
Change your elasticsearch.update to:
elasticsearch.update(
index='blog',
doc_type='blog'
id=1,
refresh='wait_for',
body={
....
}
)
and you shouldn't need any sleep or polling.