What's better for creating distinct data structures: HashSet or Linq's Distinct()?

MegaMatt picture MegaMatt · Jun 9, 2011 · Viewed 11.2k times · Source

I'm wondering whether I can get a consensus on which method is the better approach to creating a distinct set of elements: a C# HashSet or using IEnumerable's .Distinct(), which is a Linq function?

Let's say I'm looping through query results from the DB with DataReader, and my options are to add the objects I construct to a List<SomeObject> or to a HashSet<SomeObject> With the List option, I would wind up having to do something like:

myList = myList.Distinct().ToList<SomeObject>();

With the HashSet, my understanding is that adding elements to it takes care of the non-duplication by itself, assuming you've overrided the GetHashCode() and Equals() methods in SomeObject. I'm concerned mainly with the risks and performance aspects of the options.

Thanks.

Answer

nawfal picture nawfal · Nov 22, 2012

Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct or HashSet isn't that big different when it comes to performance. Use a HashSet when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T> and .Distinct() ont it when you will have to add duplicates and remove duplicates later. The intention matters.

In general,

a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom Equals of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>.

b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.

c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.

d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class!

As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it.

Test method: Starting with two general functions,

public static void Benchmark(Action method, int iterations = 10000)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < iterations; i++)
        method();

    sw.Stop();
    MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}

public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
    if (count < 0)
        throw new ArgumentOutOfRangeException("count");

    var ret = Enumerable.Empty<T>();

    for (var i = 0; i < count; i++)
        ret = ret.Concat(lst);

    return ret.ToList();
}

Implementation:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }
});

~3300 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list.Clear();
    foreach (var item in d)
    {
        list.Add(item);
    }

    list = list.Distinct().ToList();
});

~5800 ms

A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.

The best approach possibly for you with your current design:

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash.Clear();
    foreach (var item in d)
    {
        hash.Add(item);
    }

    list = hash.ToList();
});

~3300 ms

There isn't any significant difference, see..


Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list.

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();

Benchmark(() =>
{
    hash = new HashSet<int>(d);
});

~3900 ms

var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();

Benchmark(() =>
{
    list = d.Distinct().ToList();
});

~3200 ms

Here the right tool Distinct is faster than hackish HashSet! Perhaps its the overhead of creating a hash set.


I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.