Let's say we have a sorted collection such as SortedSet or SortedList with many (10M+) elements. Lots of querying is happening, so performance matters. From runtime comparisons, I'm under the impression that LINQ to Objects doesn't take advantage of the sorting, therefore not taking advantage of potential performance gains.
First example - counting the elements in a range:
var mySortedSet1 = new SortedSet<int>();
// populate ...
int rangeCount = (from n in mySortedSet1
where ((n >= 1000000000) && (n <= 2000000000))
select n).Count();
Not exactly sure what LINQ to Objects does here internally, worst case it's checking every single element which would be O(n). The can be done a lot faster by taking advantage of the sorting with a binary search for the lower and upper bound in O(log n).
Second example - SelectMany over list of sets:
var myListOfSortedSets = new List<SortedSet<int>>();
// populate...
var q = myListOfSortedSets.SelectMany(s => s).OrderBy(s => s);
foreach (var n in q)
{
Console.WriteLine(n);
}
If LINQ to SQL Objects were to take advantage of the sorting, it could effectively zipper-merge all the sorted sets into one large sorted list in O(n). The .OrderBy on the result could then be ignored as the list is already sorted.
Instead, SelectMany concatenates all the sorted sets into one large (now unsorted) list which will required another O(n log n) sort. This can easily be verified by removing the .OrderBy and observing the order in which the elements are written to the console.
My question is: is there already an alternative, more efficient implementation of LINQ to SortedSet/SortedList out there?
i4o looks very interesting, but it seems to require secondary index collections to improve query performance on the original collection. I just want queries on my sorted collections to run faster by taking advantage of the sorting.
The problem for LINQ is that it can't know the sorted set is ordered exactly the same way as the query expects. Since any ordered collection can be created with an IComparer
/ IComparable
/ Comparison<T>
, there is no knowing that > 500000
actually makes sense. Maybe you've got a custom method on the comparer that first sorts by Odd/Even, then by number. In which case the order would be completely messed up and O(n) is required in all cases.
So to be on the safe side, LINQ will need to iterate through all elements in the Collection, even when it is sorted in some way. The default .Where
implementation does not contain an optimization for ordered collections.
It might be possible to create an optimized version which keeps the existing ordering in mind while iterating, but it will be very difficult to do and to make it work in all cases.
You could create a Between
method that uses the GetViewBetween
method of SortedSet
to return a new pre-ordered collection. Or would add the standard .Where
as you'd normally would for any non-pre-sorted set.
Linq-to-SQL and Entity Framework make use if the IQueryable and will actually translate your Linq query to SQL and let the server handle the indexing, sorting, filtering etc.