I see that the following API will do delete by query in Elasticsearch - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
But I want to do the same with the elastic search bulk API, even though I could use bulk to upload docs using
es.bulk(body=json_batch)
I am not sure how to invoke delete by query using the python bulk API for Elastic search.
The elasticsearch-py
bulk API does allow you to delete records in bulk by including '_op_type': 'delete'
in each record. However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them.
The easiest way to do this in bulk is to use python module's scan()
helper, which wraps the ElasticSearch Scroll API so you don't have to keep track of _scroll_id
s. Use it with the bulk()
helper as a replacement for the deprecated delete_by_query()
:
from elasticsearch.helpers import bulk, scan
bulk_deletes = []
for result in scan(es,
query=es_query_body, # same as the search() body parameter
index=ES_INDEX,
doc_type=ES_DOC,
_source=False,
track_scores=False,
scroll='5m'):
result['_op_type'] = 'delete'
bulk_deletes.append(result)
bulk(elasticsearch, bulk_deletes)
Since _source=False
is passed, the document body is not returned so each result is pretty small. However, if do you have memory constraints, you can batch this pretty easily:
BATCH_SIZE = 100000
i = 0
bulk_deletes = []
for result in scan(...):
if i == BATCH_SIZE:
bulk(elasticsearch, bulk_deletes)
bulk_deletes = []
i = 0
result['_op_type'] = 'delete'
bulk_deletes.append(result)
i += 1
bulk(elasticsearch, bulk_deletes)