Some Performance Notes on Enumerable LINQ Operators
This post is the continuation of Layering enumerators; I wanted to have a much shorter post, but it seemed to me like this would be much more useful if I had all the notes in a single post.
First, some caveats. This post deals with the LINQ operators and friends as they work on IEnumerable<T>, with the extension methods of the Enumerable class. LINQ to Objects, if you will. Now, I'm not an expert on this library in particular, but I've used it in the past, and I'd like to share some notes on how some decisions may impact performance.
Many of the observations change significantly if you work with IQueryable<T> and databases, because they typically have specialized auxiliary data structures at their disposal. They may not support all methods, though.
Still, it's interesting that many of these consideration apply across a variety of software components that ultimately have some notion of iterator (like XmlReader for example). Also, some operators may work better on lists, as they have random access and keep a count of elements.
Sometimes you might also see changes in performance characteristics depending on whether you use some overload that takes in additional selectors or filter. For example, the implementation of Last() works great on list, as it just picks the last item, but Last(predicate) ends up consuming its complete input.
Without any further ado, here we go.
- Conditional Search
- Peeking
- Stricly Streaming
- Streaming with State
- Consume Sequence, Single Result (Aggregates)
- Consume Sequence, Multiple Results
- Multiple Input Streams, Streaming
- Multiple Input Streams, Fully Consuming One
- Throttled
- Generators
Conditional Search
There are some that look for the first element that satisfies or fails to satisfy a condition. These may consume all their input before returning a result, but it depends on the data.
Method
Description
All
Determines whether all elements of a sequence satisfy a condition.
Any
Determines whether any element of a sequence satisfies a condition.
Contains
Determines whether a sequence contains a specified element by using the default equality comparer.
ElementAt
Returns the element at a specified index in a sequence.
ElementAtOrDefault
Returns the element at a specified index in a sequence or a default value if the index is out of range.
Peeking
These methods look at the first one or two elements and then return a single result.
Method
Description
Any
Determines whether a sequence contains any elements.
First
Returns the first element of a sequence.
FirstOrDefault
Returns the first element of a sequence, or a default value if the sequence contains no elements.
Single
Returns the only element of a sequence, and throws an exception if there is not exactly one element in the sequence.
SingleOrDefault
Returns the only element of a sequence, or a default value if the sequence is empty; this method throws an exception if there is more than one element in the sequence.
Strictly Streaming
These operators consume every element, but don't need to process the whole input before enumerating their own results - that is, they process 'as they go', and thus maintain stream-ability and their resource usage does not depend on the size of the input sequence.
Method
Description
Cast
Converts the elements of an IEnumerable to the specified type.
Concat
Concatenates two sequences.
DefaultIfEmpty
Returns the elements of the specified sequence or the type parameter's default value in a singleton collection if the sequence is empty.
OfType
Filters the elements of an IEnumerable based on a specified type.
Select
Projects each element of a sequence into a new form.
SelectMany
Projects each element of a sequence to an IEnumerable<T> and flattens the resulting sequences into one sequence.
Where
Filters a sequence of values based on a predicate.
Of these operators, SelectMany has the interesting ability to produce *more* elements than its input sequence.
Streaming with State
There is one operator here that streams its results in the sense that it need not consume its input sequence before beginning to yield elements, however it does have 'memory' and uses an intermediate data structure that grows proportial to the size of the input sequence.
Distinct is like Select in that it processes one item at a time in a streaming fashion, but it has to remember whether any item has been seen before and skip it if so (that's how you get the 'Distinct-ness').
If all the elements in the input sequence are different, you'll end up with the same elements held in a temporary collection.
Method
Description
Distinct
Returns distinct elements from a sequence by using the default equality comparer to compare values.
A way to improve this could be to check whether your results are ordered by the same criteria used for distinct-ness, in which case you can keep just a reference to the last value you saw, and skip the values that are the same (so you have to remember at-most-one element rather than at-most-all).
Consume Sequence, Single Result (Aggregates)
There's a group of methods that perform an aggregate calculation over the source and thus must process every element. They consume all their input before returning a single-value result. So there is no "streaming" per-se, but you didn't want a stream of values anyway - presumably you were after that one aggregate value.
Method
Description
Aggregate
Applies an accumulator function over a sequence.
Average
Computes the average of a sequence of T values.
Count
Returns the number of elements in a sequence.
Sum
Computes the sum of a sequence of T values.
Max
Returns the maximum value in a sequence of T values.
Min
Returns the minimum value in a sequence of T values.
LongCount
Returns an Int64 that represents the total number of elements in a sequence.
Despite not being an aggregate, the following two operators also consume all input and return a single result. These work much better on lists, by the way.
Method
Description
Last
Returns the last element of a sequence.
LastOrDefault
Returns the last element of a sequence, or a default value if the sequence contains no elements.
Consume Sequence, Multiple Results
This is a group of methods that consume all the input sequence before returning another sequence of results.
You can figure out that GroupBy has to work this way, as the first and last elements could be in the very first group, which is returned in the very first enumeration step.
Similarly for Reverse, the first item to yield is the last one in the sequence, so the whole sequence needs to be buffered before anything can be returned.
Method
Description
GroupBy
Groups the elements of a sequence according to a specified key selector function.
Reverse
Inverts the order of the elements in a sequence.
The ordering methods are clearly in the same bucket as GroupBy - the last element of the sequence may be the first in the specified order, so everything needs to be buffered before the first element can be yielded.
Method
Description
OrderBy
Sorts the elements of a sequence in ascending order according to a key.
OrderByDescending
Sorts the elements of a sequence in descending order according to a key.
ThenBy
Performs a subsequent ordering of the elements in a sequence in ascending order according to a key.
ThenByDescending
Performs a subsequent ordering of the elements in a sequence in descending order, according to a key.
Multiple Input Streams, Streaming
The only operators that take two input sequences and stream through both of them are SequenceEqual, Union and Zip.
Method
Description
SequenceEqual
Determines whether two sequences are equal by comparing the elements by using the default equality comparer for their type.
Union
Produces the set union of two sequences by using the default equality comparer.
Zip
Merges two sequences by using the specified predicate function.
Union is a bit like Distinct in that it also ends up with a side collection of "everything we've seen so far", to remove duplicates, but it proceses each input sequence in order in a streaming fashion.
Multiple Input Streams, Fully Consuming One
There is a group of methods that take more than one input sequence; typically they complete consume one into some helper structured before doing anything else, and then stream the other.
Method
Description
Except
Produces the set difference of two sequences by using the default equality comparer to compare values.
GroupJoin
Correlates the elements of two sequences based on equality of keys and groups the results. The default equality comparer is used to compare keys.
Intersect
Produces the set intersection of two sequences by using the default equality comparer to compare values.
Join
Correlates the elements of two sequences based on matching keys. The default equality comparer is used to compare keys.
Except is similar to Distinct in that to process each element, a lookaside is used to figure out whether it has been seen before or not. In the Except case, an element from the first input sequence is skipped if it's found in the second sequence. So the operator works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.
This has the side effect, by the way, of not returning duplicates for elements that appear many times in the first input sequence. Except, Intersect and Union are all set operations, and so they don't return duplicates.
var a = new int[] { 1, 1, 2, 3 };
var b = new int[] { 2 };
foreach (var n in a.Except(b)) {
Console.WriteLine(n);
}
// Prints: 1, 3
GroupJoin consumes all of the inner enumerator into groups with lookups on the inner keys, then streams through the outer elements picks their keys and returning the groups from the inner sequence.
Intersect is similar to Except in that it consumes the second sequence into a collection and then streams through the first one, checking for presence. The difference is that Intersect yields an element if it has been seen before, rather than skipping it.
Join consumes all of the inner sequence into a group lookup based on the key, then streams the result selector with outer element along with all the items with a matching key.
Throttled
Some operators take direct input into how much they will process.
Method
Description
Skip
Bypasses a specified number of elements in a sequence and then returns the remaining elements.
SkipWhile
Bypasses elements in a sequence as long as a specified condition is true and then returns the remaining elements.
Take
Returns a specified number of contiguous elements from the start of a sequence.
TakeWhile
Returns elements from a sequence as long as a specified condition is true.
Generators
There are a few methods that even't take an input stream, but just generate values.
Method
Description
Empty
Returns an empty IEnumerable <T> that has the specified type argument.
Repeat
Generates a sequence that contains one repeated value.
Range
Generates a sequence of integral numbers within a specified range.
Hopefully this information will come in handy when evaluating when to use the Enumerable methods and when to roll your own for specific scenario and data requirements. At some future point, I hope to be able to talk about how Parallel LINQ extensions can affect some of these.
Enjoy!
Update: fixing intra-post links (hopefully).
Comments
Anonymous
May 06, 2010
SelectMany could also return fewer elements than the input sequence. q.SelectMany(x => new object[] { })Anonymous
May 07, 2010
@Keith - true. I was focusing more on the 'size' aspect, and there are a few operators that can yield less elements than their input (all the ones that filter for example, the sets that remove duplicates, etc), but SelectMany is one of the few that can yield more. If there's something that helps you do less work, that's generally always welcome, in perf and in life :)