Union vs Unionwith in HashSet

Max.Futerman picture Max.Futerman · Aug 28, 2017 · Viewed 14.9k times · Source

What the difference between HashSet.Union vs HashSet.Unionwith when i combine 2 hashsets.

I am trying to combine like this:

HashSet<EngineType> enginesSupportAll = _filePolicyEvaluation.EnginesSupportAll;
        enginesSupportAll = enginesSupportAll != null ? new HashSet<EngineType>(engines.Union(enginesSupportAll)) : enginesSupportAll;

what is the best method for this example and why?

Answer

Kasper van den Berg picture Kasper van den Berg · Nov 16, 2017

Given HashSet<T> A and HashSet<T> B there are four ways to AB:

  1. new HashSet<t>(A.Union(B))
    (see HashSet<T&>(IEnumerable<T>) and Enumerable.Union<T>(IEnumerable<T>, IEnumerable<T>))
  2. A.UnionWith(B)
  3. HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
  4. new HashSet<t>(A.Concat(B))
    (see Enumerable.Concat<T>(IEnumerable<T>, IEnumerable<T>))

Each has its advantages and disadvantages:

  • 1 and 4 are expressions that results in a HashSet whereas 2 and 3 are statements or statement blocks.
    Expression 1 and 4 can be used in more places than 2 and 3. For example, using 2 or 3 in a linq query syntax expression is cumbersome:
    from x in setofSetsA as IEnumerable<HashSet<T>> from y in setOfSetsB as IEnumerable<HashSet<T>> select x.UnionWith(y) will not work since UnionWith returns void.
  • 1, 3, and 4 preserve A and B as they are and return a fresh set whereas 2 modifies A.
    There are situations where modifying one of the original sets is bad and there are situations where at least one of the original sets can be modified without negative consequences.
  • Computational cost:

    A.UnionWith(B)
    (≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|B|))

    HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
    (≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|A| + |B|))

    HashSet<T>(A.Concat(B))
    (≈ O(log(|A∪B|) * |A∪B|) + O(|A| + |B|))

    HashSet<T>(A.Union(B))
    (≈ 2 * O(log(|A∪B|) * |A∪B|) + O(|A| + |B| + |A∪B|))

    The next section delves into the reference source to see the basis of these performance estimates.

Performance

HashSet<T>

In union option 1, 3, and 4 the constructor HashSet<T>(IEnumerable<T>, IEqualityComparer<T>) is used to create a HashSet<T> from an IEnumerable<T>. If the passed IEnumerable<T> has a Count property —i.e. if it is an ICollection<T>—, this property is used to set the size of the new HashSet:

int suggestedCapacity = 0;
ICollection<T> coll = collection as ICollection<T>;
if (coll != null) {
    suggestedCapacity = coll.Count;
}
Initialize(suggestedCapacity);

-- HashSet.cs line 136–141

The [Count()][10] method is never called. So if the IEnumerable's count can be retrieved without effort it is used to reserve capacity; otherwise the HashSet grows and reallocates when new elements are added.
In option 1 A.Union(B) and option 4 A.Concat(B) are not ICollection<T> therefore the created HashSet will grow and reallocate a few (≈ log(|A∪B|)) times. Option 3 can use the Count of A.

The constructor calls UnionWith to fill the new empty HashSet:

this.UnionWith(collection);

-- HashSet.cs line 143

UnionWith(IEnumerable<T>) iterates over the elements in the IEnumerable<T> passed as argument and calls AddIfNotPresent(T) for each.

AddIfNotPresent(T) inserts elements and ensures that duplicates are never inserted into the set.
HashSet<T> is implemented as an array of slots, m_slots, and an array of buckets, m_buckets. A bucket contains just an int index into the m_slots array. Per bucket the Slots in m_slots form a linked list with an index to the next Slot in m_slots.

AddIfNotPresent(T) jumps to the correct bucket and then traverses its linked list to check if the element is already present:

for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
    if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, value)) {
        return false;
    }
}

-- HashSet.cs line 968–975

Next a free index is found and a slot is reserved. First the list of free slots, m_freelist, is checked. When there are no slots in the freelist, the next empty slot in the m_slots array is used. More capacity is reserved (via IncreaseCapacity()) if there are no slots in the freelist and no empty slots:

int index;
if (m_freeList >= 0) {
    index = m_freeList;
    m_freeList = m_slots[index].next;
}
else {
    if (m_lastIndex == m_slots.Length) {
        IncreaseCapacity();
        // this will change during resize
        bucket = hashCode % m_buckets.Length;
    }
    index = m_lastIndex;
    m_lastIndex++;
}

-- HashSet.cs line 977–990

AddIfNotPresent(T) has three operations that demand some computation: the call to object.GetHashCode(), calling object.Equals(object) when there is a collision, and IncreaseCapacity(). The actual adding of an element only incurs the cost of setting some pointers and some ints.

When the HashSet<T> needs to IncreaseCapacity() the capacity is at least doubled. Therefore we can conclude that on average a HashSet<T> is filled for 75%. If the hashes are evenly distributed, the expectancy of a hash collision is also 75%.

SetCapacity(int, bool), called by IncreaseCapacity(), is the most expensive: it allocates new arrays, it copies the old slot array to the new array, and it recalculates the bucket lists:

Slot[] newSlots = new Slot[newSize];
if (m_slots != null) {
    Array.Copy(m_slots, 0, newSlots, 0, m_lastIndex);
}

...

int[] newBuckets = new int[newSize];
for (int i = 0; i < m_lastIndex; i++) {
    int bucket = newSlots[i].hashCode % newSize;
    newSlots[i].next = newBuckets[bucket] - 1;
    newBuckets[bucket] = i + 1;
}
m_slots = newSlots;
m_buckets = newBuckets;

-- HashSet.cs line 929–949

Option 1 and 4 (new HashSet<T>(A.Union(B))) will result in slightly more calls to IncreaseCapacity(). The cost —without the cost of A.Union(B) or A.Concat(B)— is approximately O(log(|A∪B|) * |A∪B|).
Whereas when using option 2 (A.UnionWith(B)) or option 3 (HashSet<T> C = new HashSet<T>(A); C.UnionWith(B)) we get a 'discount' of log(|A|) on the cost: O((log(|A∪B|) - log(|A|)) * |A∪B|). It pays (slightly) to use the largest set as the target into witch the other is merged.

Enumerable<T>.Union(IEnumerable<T>)

Enumerable<T>.Union(IEnumerable<T>) is implemented via UnionIterator<T>(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>).
The UnionIterator uses Set<T> —an internal class in Enumerable.cs— that is very similar to HashSet<T>. UnionIterator lazily Add(T)s items from A and B to this Set<T> and yields the elements if they could be added. The work is done in Find(T, bool) which is similar to HashSet<T>.AddIfNotPresent(T). Check if the element is already present:

int hashCode = InternalGetHashCode(value);
for (int i = buckets[hashCode % buckets.Length] - 1; i >= 0; i = slots[i].next) {
    if (slots[i].hashCode == hashCode && comparer.Equals(slots[i].value, value)) return true;
}

-- Enumerable.cs line 2423–2426

Find a free index and reserve a slot:

int index;
if (freeList >= 0) {
    index = freeList;
    freeList = slots[index].next;
}
else {
    if (count == slots.Length) Resize();
    index = count;
    count++;
}
int bucket = hashCode % buckets.Length;
slots[index].hashCode = hashCode;
slots[index].value = value;
slots[index].next = buckets[bucket] - 1;
buckets[bucket] = index + 1;

-- Enumerable.cs line 2428–2442

Resize() is similar to IncreaseCapacity(). The biggest difference between the two is that Resize() does not use a prime number for the number of buckets, so with bad GetHashCode() there is a slightly higher chance for collisions. Code of Resize():

int newSize = checked(count * 2 + 1);
int[] newBuckets = new int[newSize];
Slot[] newSlots = new Slot[newSize];
Array.Copy(slots, 0, newSlots, 0, count);
for (int i = 0; i < count; i++) {
    int bucket = newSlots[i].hashCode % newSize;
    newSlots[i].next = newBuckets[bucket] - 1;
    newBuckets[bucket] = i + 1;
}
buckets = newBuckets;
slots = newSlots;

-- Enumerable.cs line 2448–2458

The performance cost of A.Union(B) is not significantly different from that of HashSet<T> C = new HashSet<T>(); C.UnionWith(A); C.UnionWith(B);. In option 1 (new HashSet<T>(A.Union(B))) the same HashSet is created twice resulting in a very costly 2 * O(log(|A∪B|) * (|A∪B|)). Option 4 results from knowing how HashSet<T>(IEnumerable<T>) and Enumerable.Union(IEnumerable<T>, IEnumerable<T>) are implemented. It avoids the redundant A.Union(B) leading to a cost of O(log(|A∪B|) * |A∪B|).