Condividi tramite


Some Performance Notes on Enumerable LINQ Operators

This post is the continuation of Layering enumerators; I wanted to have a much shorter post, but it seemed to me like this would be much more useful if I had all the notes in a single post.

First, some caveats. This post deals with the LINQ operators and friends as they work on IEnumerable<T>, with the extension methods of the Enumerable class. LINQ to Objects, if you will. Now, I'm not an expert on this library in particular, but I've used it in the past, and I'd like to share some notes on how some decisions may impact performance.

Many of the observations change significantly if you work with IQueryable<T> and databases, because they typically have specialized auxiliary data structures at their disposal. They may not support all methods, though.

Still, it's interesting that many of these consideration apply across a variety of software components that ultimately have some notion of iterator (like XmlReader for example). Also, some operators may work better on lists, as they have random access and keep a count of elements.

Sometimes you might also see changes in performance characteristics depending on whether you use some overload that takes in additional selectors or filter. For example, the implementation of Last() works great on list, as it just picks the last item, but Last(predicate) ends up consuming its complete input.

Without any further ado, here we go.

Conditional Search
There are some that look for the first element that satisfies or fails to satisfy a condition. These may consume all their input before returning a result, but it depends on the data.

Method

Description

All

Determines whether all elements of a sequence satisfy a condition.

Any

Determines whether any element of a sequence satisfies a condition.

Contains

Determines whether a sequence contains a specified element by using the default equality comparer.

ElementAt

Returns the element at a specified index in a sequence.

ElementAtOrDefault

Returns the element at a specified index in a sequence or a default value if the index is out of range.

Peeking
These methods look at the first one or two elements and then return a single result.

Method

Description

Any

Determines whether a sequence contains any elements.

First

Returns the first element of a sequence.

FirstOrDefault

Returns the first element of a sequence, or a default value if the sequence contains no elements.

Single

Returns the only element of a sequence, and throws an exception if there is not exactly one element in the sequence.

SingleOrDefault

Returns the only element of a sequence, or a default value if the sequence is empty; this method throws an exception if there is more than one element in the sequence.

Strictly Streaming
These operators consume every element, but don't need to process the whole input before enumerating their own results - that is, they process 'as they go', and thus maintain stream-ability and their resource usage does not depend on the size of the input sequence.

Method

Description

Cast

Converts the elements of an IEnumerable to the specified type.

Concat

Concatenates two sequences.

DefaultIfEmpty

Returns the elements of the specified sequence or the type parameter's default value in a singleton collection if the sequence is empty.

OfType

Filters the elements of an IEnumerable based on a specified type.

Select

Projects each element of a sequence into a new form.

SelectMany

Projects each element of a sequence to an IEnumerable<T> and flattens the resulting sequences into one sequence.

Where

Filters a sequence of values based on a predicate.

Of these operators, SelectMany has the interesting ability to produce *more* elements than its input sequence.

Streaming with State
There is one operator here that streams its results in the sense that it need not consume its input sequence before beginning to yield elements, however it does have 'memory' and uses an intermediate data structure that grows proportial to the size of the input sequence.

Distinct is like Select in that it processes one item at a time in a streaming fashion, but it has to remember whether any item has been seen before and skip it if so (that's how you get the 'Distinct-ness').

If all the elements in the input sequence are different, you'll end up with the same elements held in a temporary collection.

Method

Description

Distinct

Returns distinct elements from a sequence by using the default equality comparer to compare values.

A way to improve this could be to check whether your results are ordered by the same criteria used for distinct-ness, in which case you can keep just a reference to the last value you saw, and skip the values that are the same (so you have to remember at-most-one element rather than at-most-all).

Consume Sequence, Single Result (Aggregates)
There's a group of methods that perform an aggregate calculation over the source and thus must process every element. They consume all their input before returning a single-value result. So there is no "streaming" per-se, but you didn't want a stream of values anyway - presumably you were after that one aggregate value.

Method

Description

Aggregate

Applies an accumulator function over a sequence.

Average

Computes the average of a sequence of T values.

Count

Returns the number of elements in a sequence.

Sum

Computes the sum of a sequence of T values.

Max

Returns the maximum value in a sequence of T values.

Min

Returns the minimum value in a sequence of T values.

LongCount

Returns an Int64 that represents the total number of elements in a sequence.

Despite not being an aggregate, the following two operators also consume all input and return a single result. These work much better on lists, by the way.

Method

Description

Last

Returns the last element of a sequence.

LastOrDefault

Returns the last element of a sequence, or a default value if the sequence contains no elements.

Consume Sequence, Multiple Results
This is a group of methods that consume all the input sequence before returning another sequence of results.

You can figure out that GroupBy has to work this way, as the first and last elements could be in the very first group, which is returned in the very first enumeration step.

Similarly for Reverse, the first item to yield is the last one in the sequence, so the whole sequence needs to be buffered before anything can be returned.

Method

Description

GroupBy

Groups the elements of a sequence according to a specified key selector function.

Reverse

Inverts the order of the elements in a sequence.

The ordering methods are clearly in the same bucket as GroupBy - the last element of the sequence may be the first in the specified order, so everything needs to be buffered before the first element can be yielded.

Method

Description

OrderBy

Sorts the elements of a sequence in ascending order according to a key.

OrderByDescending

Sorts the elements of a sequence in descending order according to a key.

ThenBy

Performs a subsequent ordering of the elements in a sequence in ascending order according to a key.

ThenByDescending

Performs a subsequent ordering of the elements in a sequence in descending order, according to a key.

Multiple Input Streams, Streaming
The only operators that take two input sequences and stream through both of them are SequenceEqual, Union and Zip.

Method

Description

SequenceEqual

Determines whether two sequences are equal by comparing the elements by using the default equality comparer for their type.

Union

Produces the set union of two sequences by using the default equality comparer.

Zip

Merges two sequences by using the specified predicate function.

Union is a bit like Distinct in that it also ends up with a side collection of "everything we've seen so far", to remove duplicates, but it proceses each input sequence in order in a streaming fashion.

Multiple Input Streams, Fully Consuming One
There is a group of methods that take more than one input sequence; typically they complete consume one into some helper structured before doing anything else, and then stream the other.

Method

Description

Except

Produces the set difference of two sequences by using the default equality comparer to compare values.

GroupJoin

Correlates the elements of two sequences based on equality of keys and groups the results. The default equality comparer is used to compare keys.

Intersect

Produces the set intersection of two sequences by using the default equality comparer to compare values.

Join

Correlates the elements of two sequences based on matching keys. The default equality comparer is used to compare keys.

Except is similar to Distinct in that to process each element, a lookaside is used to figure out whether it has been seen before or not. In the Except case, an element from the first input sequence is skipped if it's found in the second sequence. So the operator works by first collecting all distinct values from the second sequence, and then streaming the results of the first, checking that they haven't been seen before.

This has the side effect, by the way, of not returning duplicates for elements that appear many times in the first input sequence. Except, Intersect and Union are all set operations, and so they don't return duplicates.

var a = new int[] { 1, 1, 2, 3 };
var b = new int[] { 2 };
foreach (var n in a.Except(b)) {
  Console.WriteLine(n);
}
// Prints: 1, 3

GroupJoin consumes all of the inner enumerator into groups with lookups on the inner keys, then streams through the outer elements picks their keys and returning the groups from the inner sequence.

Intersect is similar to Except in that it consumes the second sequence into a collection and then streams through the first one, checking for presence. The difference is that Intersect yields an element if it has been seen before, rather than skipping it.

Join consumes all of the inner sequence into a group lookup based on the key, then streams the result selector with outer element along with all the items with a matching key.

Throttled
Some operators take direct input into how much they will process.

Method

Description

Skip

Bypasses a specified number of elements in a sequence and then returns the remaining elements.

SkipWhile

Bypasses elements in a sequence as long as a specified condition is true and then returns the remaining elements.

Take

Returns a specified number of contiguous elements from the start of a sequence.

TakeWhile

Returns elements from a sequence as long as a specified condition is true.

Generators
There are a few methods that even't take an input stream, but just generate values.

Method

Description

Empty

Returns an empty IEnumerable <T> that has the specified type argument.

Repeat

Generates a sequence that contains one repeated value.

Range

Generates a sequence of integral numbers within a specified range.

Hopefully this information will come in handy when evaluating when to use the Enumerable methods and when to roll your own for specific scenario and data requirements. At some future point, I hope to be able to talk about how Parallel LINQ extensions can affect some of these.

Enjoy!

Update: fixing intra-post links (hopefully).

Comments

  • Anonymous
    May 06, 2010
    SelectMany could also return fewer elements than the input sequence. q.SelectMany(x => new object[] { })

  • Anonymous
    May 07, 2010
    @Keith - true. I was focusing more on the 'size' aspect, and there are a few operators that can yield less elements than their input (all the ones that filter for example, the sets that remove duplicates, etc), but SelectMany is one of the few that can yield more. If there's something that helps you do less work, that's generally always welcome, in perf and in life :)