Find all occurrences of a key in nested dictionaries and lists

Matt Swain picture Matt Swain · Mar 21, 2012 · Viewed 82.3k times · Source

I have a dictionary like this:

{ "id" : "abcde",
  "key1" : "blah",
  "key2" : "blah blah",
  "nestedlist" : [ 
    { "id" : "qwerty",
      "nestednestedlist" : [ 
        { "id" : "xyz",
          "keyA" : "blah blah blah" },
        { "id" : "fghi",
          "keyZ" : "blah blah blah" }],
      "anothernestednestedlist" : [ 
        { "id" : "asdf",
          "keyQ" : "blah blah" },
        { "id" : "yuiop",
          "keyW" : "blah" }] } ] } 

Basically a dictionary with nested lists, dictionaries, and strings, of arbitrary depth.

What is the best way of traversing this to extract the values of every "id" key? I want to achieve the equivalent of an XPath query like "//id". The value of "id" is always a string.

So from my example, the output I need is basically:

["abcde", "qwerty", "xyz", "fghi", "asdf", "yuiop"]

Order is not important.

Answer

hexerei software picture hexerei software · Apr 15, 2015

I found this Q/A very interesting, since it provides several different solutions for the same problem. I took all these functions and tested them with a complex dictionary object. I had to take two functions out of the test, because they had to many fail results and they did not support returning lists or dicts as values, which i find essential, since a function should be prepared for almost any data to come.

So i pumped the other functions in 100.000 iterations through the timeit module and output came to following result:

0.11 usec/pass on gen_dict_extract(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6.03 usec/pass on find_all_items(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.15 usec/pass on findkeys(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.79 usec/pass on get_recursively(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.14 usec/pass on find(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0.36 usec/pass on dict_extract(k,o)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

All functions had the same needle to search for ('logging') and the same dictionary object, which is constructed like this:

o = { 'temparature': '50', 
      'logging': {
        'handlers': {
          'console': {
            'formatter': 'simple', 
            'class': 'logging.StreamHandler', 
            'stream': 'ext://sys.stdout', 
            'level': 'DEBUG'
          }
        },
        'loggers': {
          'simpleExample': {
            'handlers': ['console'], 
            'propagate': 'no', 
            'level': 'INFO'
          },
         'root': {
           'handlers': ['console'], 
           'level': 'DEBUG'
         }
       }, 
       'version': '1', 
       'formatters': {
         'simple': {
           'datefmt': "'%Y-%m-%d %H:%M:%S'", 
           'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
         }
       }
     }, 
     'treatment': {'second': 5, 'last': 4, 'first': 4},   
     'treatment_plan': [[4, 5, 4], [4, 5, 4], [5, 5, 5]]
}

All functions delivered the same result, but the time differences are dramatic! The function gen_dict_extract(k,o) is my function adapted from the functions here, actually it is pretty much like the find function from Alfe, with the main difference, that i am checking if the given object has iteritems function, in case strings are passed during recursion:

def gen_dict_extract(key, var):
    if hasattr(var,'iteritems'):
        for k, v in var.iteritems():
            if k == key:
                yield v
            if isinstance(v, dict):
                for result in gen_dict_extract(key, v):
                    yield result
            elif isinstance(v, list):
                for d in v:
                    for result in gen_dict_extract(key, d):
                        yield result

So this variant is the fastest and safest of the functions here. And find_all_items is incredibly slow and far off the second slowest get_recursivley while the rest, except dict_extract, is close to each other. The functions fun and keyHole only work if you are looking for strings.

Interesting learning aspect here :)