Elasticsearch python API: Delete documents by query

sysuser picture sysuser · Nov 7, 2014 · Viewed 13k times · Source

I see that the following API will do delete by query in Elasticsearch - http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

But I want to do the same with the elastic search bulk API, even though I could use bulk to upload docs using

es.bulk(body=json_batch)

I am not sure how to invoke delete by query using the python bulk API for Elastic search.

Answer

drs picture drs · Jan 24, 2016

The elasticsearch-py bulk API does allow you to delete records in bulk by including '_op_type': 'delete' in each record. However, if you want to delete-by-query you still need to make two queries: one to fetch the records to be deleted, and another to delete them.

The easiest way to do this in bulk is to use python module's scan() helper, which wraps the ElasticSearch Scroll API so you don't have to keep track of _scroll_ids. Use it with the bulk() helper as a replacement for the deprecated delete_by_query():

from elasticsearch.helpers import bulk, scan

bulk_deletes = []
for result in scan(es,
                   query=es_query_body,  # same as the search() body parameter
                   index=ES_INDEX,
                   doc_type=ES_DOC,
                   _source=False,
                   track_scores=False,
                   scroll='5m'):

    result['_op_type'] = 'delete'
    bulk_deletes.append(result)

bulk(elasticsearch, bulk_deletes)

Since _source=False is passed, the document body is not returned so each result is pretty small. However, if do you have memory constraints, you can batch this pretty easily:

BATCH_SIZE = 100000

i = 0
bulk_deletes = []
for result in scan(...):

    if i == BATCH_SIZE:
        bulk(elasticsearch, bulk_deletes)
        bulk_deletes = []
        i = 0

    result['_op_type'] = 'delete'
    bulk_deletes.append(result)

    i += 1

bulk(elasticsearch, bulk_deletes)