I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short documents, with ID numbers as keys and their text content as values.
To perform an 'AND' search for two terms, I thus need to intersect their postings lists (dictionaries). What is a clear (not necessarily overly clever) way to do this in Python? I started out by trying it the long way with iter
:
p1 = index[term1]
p2 = index[term2]
i1 = iter(p1)
i2 = iter(p2)
while ... # not sure of the 'iter != end 'syntax in this case
...
A little known fact is that you don't need to construct set
s to do this:
In Python 2:
In [78]: d1 = {'a': 1, 'b': 2}
In [79]: d2 = {'b': 2, 'c': 3}
In [80]: d1.viewkeys() & d2.viewkeys()
Out[80]: {'b'}
In Python 3 replace viewkeys
with keys
; the same applies to viewvalues
and viewitems
.
From the documentation of viewitems
:
In [113]: d1.viewitems??
Type: builtin_function_or_method
String Form:<built-in method viewitems of dict object at 0x64a61b0>
Docstring: D.viewitems() -> a set-like object providing a view on D's items
For larger dict
s this also slightly faster than constructing set
s and then intersecting them:
In [122]: d1 = {i: rand() for i in range(10000)}
In [123]: d2 = {i: rand() for i in range(10000)}
In [124]: timeit d1.viewkeys() & d2.viewkeys()
1000 loops, best of 3: 714 µs per loop
In [125]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2
1000 loops, best of 3: 805 µs per loop
For smaller `dict`s `set` construction is faster:
In [126]: d1 = {'a': 1, 'b': 2}
In [127]: d2 = {'b': 2, 'c': 3}
In [128]: timeit d1.viewkeys() & d2.viewkeys()
1000000 loops, best of 3: 591 ns per loop
In [129]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2
1000000 loops, best of 3: 477 ns per loop
We're comparing nanoseconds here, which may or may not matter to you. In any case, you get back a set
, so using viewkeys
/keys
eliminates a bit of clutter.