elasticsearch scrolling using python client

Dror picture Dror · Jul 28, 2014 · Viewed 22.1k times · Source

When scrolling in elasticsearch it is important to provide at each scroll the latest scroll_id:

The initial search request and each subsequent scroll request returns a new scroll_id — only the most recent scroll_id should be used.

The following example (taken from here) puzzle me. First, the srolling initialization:

rs = es.search(index=['tweets-2014-04-12','tweets-2014-04-13'], 
               scroll='10s', 
               search_type='scan', 
               size=100, 
               preference='_primary_first',
               body={
                 "fields" : ["created_at", "entities.urls.expanded_url", "user.id_str"],
                   "query" : {
                     "wildcard" : { "entities.urls.expanded_url" : "*.ru" }
                   }
               }
   )
sid = rs['_scroll_id']

and then the looping:

tweets = [] while (1):
    try:
        rs = es.scroll(scroll_id=sid, scroll='10s')
        tweets += rs['hits']['hits']
    except:
        break

It works, but I don't see where sid is updated... I believe that it happens internally, in the python client; but I don't understand how it works...

Answer

Ryan Widmaier picture Ryan Widmaier · Aug 14, 2020

This is an old question, but for some reason came up first when searching for "elasticsearch python scroll". The python module provides a helper method to do all the work for you. It is a generator function that will return each document to you while managing the underlying scroll ids.

https://elasticsearch-py.readthedocs.io/en/master/helpers.html#scan

Here is an example of usage:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan

query = {
    "query": {"match_all": {}}
}

es = Elasticsearch(...)
for hit in scan(es, index="my-index", query=query):
    print(hit["_source"]["field"])