What the difference between HashSet.Union
vs HashSet.Unionwith
when i combine 2 hashsets.
I am trying to combine like this:
HashSet<EngineType> enginesSupportAll = _filePolicyEvaluation.EnginesSupportAll;
enginesSupportAll = enginesSupportAll != null ? new HashSet<EngineType>(engines.Union(enginesSupportAll)) : enginesSupportAll;
what is the best method for this example and why?
Given HashSet<T>
A and HashSet<T>
B there are four ways to A ∪ B:
new HashSet<t>(A.Union(B))
HashSet<T&>(IEnumerable<T>)
and
Enumerable.Union<T>(IEnumerable<T>, IEnumerable<T>)
)A.UnionWith(B)
HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
new HashSet<t>(A.Concat(B))
Enumerable.Concat<T>(IEnumerable<T>, IEnumerable<T>)
)Each has its advantages and disadvantages:
HashSet
whereas 2 and 3 are statements or statement blocks.from x in setofSetsA as IEnumerable<HashSet<T>>
from y in setOfSetsB as IEnumerable<HashSet<T>>
select x.UnionWith(y)
will not work since UnionWith
returns void.Computational cost:
A.UnionWith(B)
(≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|B|))
≤
HashSet<T> C = new HashSet<T>(A); C.UnionWith(B);
(≈ O((log(|A∪B|) - log(|A|)) * |A∪B|) + O(|A| + |B|))
≤
HashSet<T>(A.Concat(B))
(≈ O(log(|A∪B|) * |A∪B|) + O(|A| + |B|))
≤
HashSet<T>(A.Union(B))
(≈ 2 * O(log(|A∪B|) * |A∪B|) + O(|A| + |B| + |A∪B|))
The next section delves into the reference source to see the basis of these performance estimates.
HashSet<T>
In union option 1, 3, and 4 the constructor HashSet<T>(IEnumerable<T>, IEqualityComparer<T>)
is used to create a HashSet<T>
from an IEnumerable<T>
.
If the passed IEnumerable<T>
has a Count
property —i.e. if it is an ICollection<T>
—, this property is used to set the size of the new HashSet
:
int suggestedCapacity = 0; ICollection<T> coll = collection as ICollection<T>; if (coll != null) { suggestedCapacity = coll.Count; } Initialize(suggestedCapacity);
The [Count()][10]
method is never called. So if the IEnumerable
's count can be retrieved without effort it is used to reserve capacity; otherwise the HashSet
grows and reallocates when new elements are added.
In option 1 A.Union(B)
and option 4 A.Concat(B)
are not ICollection<T>
therefore the created HashSet
will grow and reallocate a few (≈ log(|A∪B|)) times. Option 3 can use the Count
of A.
The constructor calls UnionWith
to fill the new empty HashSet
:
this.UnionWith(collection);
UnionWith(IEnumerable<T>)
iterates over the elements in the IEnumerable<T>
passed as argument and calls AddIfNotPresent(T)
for each.
AddIfNotPresent(T)
inserts elements and ensures that duplicates are never inserted into the set.
HashSet<T>
is implemented as an array of slots, m_slots
, and an array of buckets, m_buckets
. A bucket contains just an int
index into the m_slots
array. Per bucket the Slot
s in m_slots
form a linked list with an index to the next Slot
in m_slots
.
AddIfNotPresent(T)
jumps to the correct bucket and then traverses its linked list to check if the element is already present:
for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) { if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, value)) { return false; } }
Next a free index is found and a slot is reserved. First the list of free slots, m_freelist
, is checked. When there are no slots in the freelist, the next empty slot in the m_slots
array is used. More capacity is reserved (via IncreaseCapacity()
) if there are no slots in the freelist and no empty slots:
int index; if (m_freeList >= 0) { index = m_freeList; m_freeList = m_slots[index].next; } else { if (m_lastIndex == m_slots.Length) { IncreaseCapacity(); // this will change during resize bucket = hashCode % m_buckets.Length; } index = m_lastIndex; m_lastIndex++; }
AddIfNotPresent(T)
has three operations that demand some computation: the call to object.GetHashCode()
, calling object.Equals(object)
when there is a collision, and IncreaseCapacity()
. The actual adding of an element only incurs the cost of setting some pointers and some ints.
When the HashSet<T>
needs to IncreaseCapacity()
the capacity is at least doubled. Therefore we can conclude that on average a HashSet<T>
is filled for 75%. If the hashes are evenly distributed, the expectancy of a hash collision is also 75%.
SetCapacity(int, bool)
, called by IncreaseCapacity()
, is the most expensive: it allocates new arrays, it copies the old slot array to the new array, and it recalculates the bucket lists:
Slot[] newSlots = new Slot[newSize]; if (m_slots != null) { Array.Copy(m_slots, 0, newSlots, 0, m_lastIndex); } ... int[] newBuckets = new int[newSize]; for (int i = 0; i < m_lastIndex; i++) { int bucket = newSlots[i].hashCode % newSize; newSlots[i].next = newBuckets[bucket] - 1; newBuckets[bucket] = i + 1; } m_slots = newSlots; m_buckets = newBuckets;
Option 1 and 4 (new HashSet<T>(A.Union(B))
) will result in slightly more calls to IncreaseCapacity()
. The cost —without the cost of A.Union(B)
or A.Concat(B)
— is approximately O(log(|A∪B|) * |A∪B|).
Whereas when using option 2 (A.UnionWith(B)
) or option 3 (HashSet<T> C = new HashSet<T>(A); C.UnionWith(B)
) we get a 'discount' of log(|A|) on the cost: O((log(|A∪B|) - log(|A|)) * |A∪B|). It pays (slightly) to use the largest set as the target into witch the other is merged.
Enumerable<T>.Union(IEnumerable<T>)
Enumerable<T>.Union(IEnumerable<T>)
is implemented via UnionIterator<T>(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>)
.
The UnionIterator
uses Set<T>
—an internal class in Enumerable.cs
— that is very similar to HashSet<T>
. UnionIterator
lazily Add(T)
s items from A and B to this Set<T>
and yields
the elements if they could be added. The work is done in Find(T, bool)
which is similar to HashSet<T>.AddIfNotPresent(T)
. Check if the element is already present:
int hashCode = InternalGetHashCode(value); for (int i = buckets[hashCode % buckets.Length] - 1; i >= 0; i = slots[i].next) { if (slots[i].hashCode == hashCode && comparer.Equals(slots[i].value, value)) return true; }
Find a free index and reserve a slot:
int index; if (freeList >= 0) { index = freeList; freeList = slots[index].next; } else { if (count == slots.Length) Resize(); index = count; count++; } int bucket = hashCode % buckets.Length; slots[index].hashCode = hashCode; slots[index].value = value; slots[index].next = buckets[bucket] - 1; buckets[bucket] = index + 1;
Resize()
is similar to IncreaseCapacity()
. The biggest difference between the two is that Resize()
does not use a prime number for the number of buckets, so with bad GetHashCode()
there is a slightly higher chance for collisions. Code of Resize()
:
int newSize = checked(count * 2 + 1); int[] newBuckets = new int[newSize]; Slot[] newSlots = new Slot[newSize]; Array.Copy(slots, 0, newSlots, 0, count); for (int i = 0; i < count; i++) { int bucket = newSlots[i].hashCode % newSize; newSlots[i].next = newBuckets[bucket] - 1; newBuckets[bucket] = i + 1; } buckets = newBuckets; slots = newSlots;
The performance cost of A.Union(B)
is not significantly different from that of HashSet<T> C = new HashSet<T>(); C.UnionWith(A); C.UnionWith(B);
. In option 1 (new HashSet<T>(A.Union(B))
) the same HashSet
is created twice resulting in a very costly 2 * O(log(|A∪B|) * (|A∪B|)). Option 4 results from knowing how HashSet<T>(IEnumerable<T>)
and Enumerable.Union(IEnumerable<T>, IEnumerable<T>)
are implemented. It avoids the redundant A.Union(B)
leading to a cost of O(log(|A∪B|) * |A∪B|).