Efficient set intersection of a collection of sets in C++

Paresh picture Paresh · Oct 13, 2012 · Viewed 12.4k times · Source

I have a collection of std::set. I want to find the intersection of all the sets in this collection, in the fastest manner. The number of sets in the collection is typically very small (~5-10), and the number of elements in each set is is usually less than 1000, but can occasionally go upto around 10000. But I need to do these intersections tens of thousands of time, as fast as possible. I tried to benchmark a few methods as follows:

  1. In-place intersection in a std::set object which initially copies the first set. Then for subsequent sets, it iterates over all element of itself and the ith set of the collection, and removes items from itself as needed.
  2. Using std::set_intersection into a temporary std::set, swap contents to a current set, then again find intersection of the current set with the next set and insert into the temp set, and so on.
  3. Manually iterate over all the elements of all sets like in 1), but using a vector as the destination container instead of std::set.
  4. Same as in 4, but using a std::list instead of a vector, suspecting a list will provide faster deletions from the middle.
  5. Using hash sets (std::unordered_set) and checking for all items in all sets.

As it turned out, using a vector is marginally faster when the number of elements in each set is small, and list is marginally faster for larger sets. In-place using set is a substantially slower than both, followed by set_intersection and hash sets. Is there a faster algorithm/datastructure/tricks to achieve this? I can post code snippets if required. Thanks!

Answer

Dietmar Kühl picture Dietmar Kühl · Oct 13, 2012

You might want to try a generalization of std::set_intersection(): the algorithm is to use iterators for all sets:

  1. If any iterator has reached the end() of its corresponding set, you are done. Thus, it can be assumed that all iterators are valid.
  2. Take the first iterator's value as the next candidate value x.
  3. Move through the list of iterators and std::find_if() the first element at least as big as x.
  4. If the value is bigger than x make it the new candidate value and search again in the sequence of iterators.
  5. If all iterators are on value x you found an element of the intersection: Record it, increment all iterators, start over.