SortedSet / SortedList with better LINQ performance?

Max picture Max · Feb 3, 2013 · Viewed 7.4k times · Source

Let's say we have a sorted collection such as SortedSet or SortedList with many (10M+) elements. Lots of querying is happening, so performance matters. From runtime comparisons, I'm under the impression that LINQ to Objects doesn't take advantage of the sorting, therefore not taking advantage of potential performance gains.

First example - counting the elements in a range:

        var mySortedSet1 = new SortedSet<int>();
        // populate ...
        int rangeCount = (from n in mySortedSet1
                          where ((n >= 1000000000) && (n <= 2000000000))
                          select n).Count();

Not exactly sure what LINQ to Objects does here internally, worst case it's checking every single element which would be O(n). The can be done a lot faster by taking advantage of the sorting with a binary search for the lower and upper bound in O(log n).

Second example - SelectMany over list of sets:

        var myListOfSortedSets = new List<SortedSet<int>>();
        // populate...

        var q = myListOfSortedSets.SelectMany(s => s).OrderBy(s => s);
        foreach (var n in q)
        {
            Console.WriteLine(n);
        }

If LINQ to SQL Objects were to take advantage of the sorting, it could effectively zipper-merge all the sorted sets into one large sorted list in O(n). The .OrderBy on the result could then be ignored as the list is already sorted.

Instead, SelectMany concatenates all the sorted sets into one large (now unsorted) list which will required another O(n log n) sort. This can easily be verified by removing the .OrderBy and observing the order in which the elements are written to the console.

My question is: is there already an alternative, more efficient implementation of LINQ to SortedSet/SortedList out there?

i4o looks very interesting, but it seems to require secondary index collections to improve query performance on the original collection. I just want queries on my sorted collections to run faster by taking advantage of the sorting.

Answer

jessehouwing picture jessehouwing · Feb 3, 2013

The problem for LINQ is that it can't know the sorted set is ordered exactly the same way as the query expects. Since any ordered collection can be created with an IComparer / IComparable / Comparison<T>, there is no knowing that > 500000 actually makes sense. Maybe you've got a custom method on the comparer that first sorts by Odd/Even, then by number. In which case the order would be completely messed up and O(n) is required in all cases.

So to be on the safe side, LINQ will need to iterate through all elements in the Collection, even when it is sorted in some way. The default .Where implementation does not contain an optimization for ordered collections.

It might be possible to create an optimized version which keeps the existing ordering in mind while iterating, but it will be very difficult to do and to make it work in all cases.

You could create a Between method that uses the GetViewBetween method of SortedSet to return a new pre-ordered collection. Or would add the standard .Where as you'd normally would for any non-pre-sorted set.

Linq-to-SQL and Entity Framework make use if the IQueryable and will actually translate your Linq query to SQL and let the server handle the indexing, sorting, filtering etc.