Computers are dumb

 

A few short takes today, from questions I've received recently about LINQ in C# 3.0.

The first question was "in the following code, does it really check every single non-negative integer, or does it use the knowledge that once you're beyond ten, you can stop iterating? "

var smallNumbers = Enumerable.Range(0, int.MaxValue).Where(n => n < 10);
foreach (int i in smallNumbers)
Console.WriteLine(i);

The former. You asked LINQ to Objects to apply a predicate to a sequence and that's what it does. If the sequence has two billion elements in it, it runs the predicate two billion times.

You certainly could build your own range object which had a Where method that took an expression tree. Your Where method could then analyze the range and the expression tree in an attempt to build a new range object that iterated a smaller range. Analyzing simple predicates would be pretty easy; more complex predicates would have to be turned back into delegates.

Of course, if it hurts when you do that, don't do that. if you already know the range you want, just restrict it in the range constructor in the first place!

The second question was "which is the preferred coding style? " :

if (0 != customers.Count()) ...

vs

if (customers.Any()) ...

On efficiency grounds alone the latter is far preferable. Suppose you have a box which may contain between zero and two billion pennies. You want to take some action if there are any pennies in the box, but you don't care how many there are, only whether there are zero or more. In that scenario, clearly it makes no sense whatsoever to count them all. The Count method doesn't know that all you care about is whether it's going to return zero or not. You asked it to count, so it counts.

(I am reminded of an old joke: on a bank teller's first day, he is asked to count a stack of twenties and verify that there are one hundred in the stack. He starts counting them from one pile into another: one, two, three, ... but when he gets to fifty, he stops and says to his boss "I'll stop here -- if it's right all the way up to fifty, odds are good it's right the whole way.")

Even leaving the obvious efficiency concern aside, the latter code reads better. It more clearly reflects the intention of the programmer.

The third question was "when iterating through a collection, I noticed that the former code is much faster than the latter code. Any insight as to why? "

myEventLogEntryCollection.CopyTo(myEntriesArray,0);
for(int i=0; i < myEntriesArray.Length, i++) {...}

vs

foreach(EventLogEntry myEntry in myEventLogEntryCollection) {...}

This made no sense to me. Sure, there will be differences. The former code is using more memory because it makes a copy of the collection; initializing that memory is going to take significant time if the collection is large, but I could see how maybe the array access could be faster than the collection iterator in the loop. Perhaps the improvement in iteration is fast enough that for a large collection, the larger initialization cost was made up. But the user was reporting "much faster", not "slightly faster". I asked for more details.

What exactly were the timings? Three seconds vs sixteen seconds. Holy goodness! I had figured we were talking about microseconds of difference here, not actual humans sitting around watching the clock time. What on earth was the user doing? Accessing event logs from a remote machine over a slow network.

Aha! Now it is clear what is going on here. The iteration time is being dominated by the round trip to the remote machine; the former code makes one very expensive remote call that takes three seconds. The latter code makes hundreds of cheaper remote calls, possibly several per loop iteration, that take a fraction of a second each but which add up to more expense in the long run.

My "insights" were therefore:

  • You can sometimes trade increased memory use for decreased time.  
  • The cost of fetching n items by making m requests might be dominated by m, not n.
  • If you don't actually describe the relevant features of the problem, it is impossible to get a correct analysis.

I blogged about these three questions not because I got them all in the same day, but because there is a common thread to the answers: computers are dumb. They do exactly what you tell them. If you want performance savings by eliminating unnecessary work through logical deductions about what work can be eliminated, you're going to have to be the one to do make those deductions!

Comments

  • Anonymous
    May 09, 2008
    The first questioner should look into Take/TakeWhile. Example: var smallNumbers = Enumerable.Range(0, int.MaxValue).TakeWhile(n => n < 10);

  • Anonymous
    May 09, 2008
    "Even leaving the obvious efficiency concern aside, the latter code reads better. It more clearly reflects the intention of the programmer." Isn't that half of the point of LINQ, to allow for a limited degree of functional and declarative programming such that the "what" becomes more important than the "how"? Unless you have a significant, measurable performance difference, write what you mean!

  • Anonymous
    May 09, 2008
    The comment has been removed

  • Anonymous
    May 09, 2008
    @Jacob: Was going to suggest to same, developers just don't know what to use most of the time :(

  • Anonymous
    May 10, 2008
    An english version is available here . Après avoir lu ce post d'Eric Lippert , je me suis rappelé que

  • Anonymous
    May 11, 2008
    "Accessing event logs from a remote machine over a slow network" What kind of developer wouldn't have seen that coming?

  • Anonymous
    May 11, 2008
    A novice? Developers have to start somewhere, you know.

  • Anonymous
    May 11, 2008
    I'm wondering about this one: if (0 != customers.Count()) ... vs if (customers.Any()) ... does Count() always count all the records in the collection, or number of items is stored somewhere?

  • Anonymous
    May 11, 2008
    @B0rG - well, if customers implements IList, then the Enumerable.Count() method takes a shortcut and reads the Count property. But for a general IEnumerable, there's no guarantee that the extension Count() method will find a quicker way to count it than to enumerate the whole collection. And it's perfectly possible to implement an IEnumerable that never terminates, so Count() on a general IEnumerable may simply never return. Calling Count() is essentially asking 'how many times do I have to call MoveNext() on an enumerator before I get an answer of 'false'?'. asking for 'Any()' is simply asking 'does the first call to MoveNext() on an enumerator return true?'

  • Anonymous
    May 11, 2008
    The for-loop iteration is faster I belive because for built in System.Array's the JITer is able to emit efficient machine code that doesn't do range checking for each an every index look-up. I believe for this to work the boundary check must be inline to the 'for', that is to say:   int[] data = new int[] { ... };   for (int i = 0; i < data.Length; i++)      Process(data[i]);   /// Details boring rather than:   int[] data = new int[] { ... };   int length = data.Length;   for (int i = 0; i < length; i++)      Process(data[i]);   // Details boring But like all undocumented optimizations it would dangerous to over rely on this JIT optimization.

  • Anonymous
    May 13, 2008
    An english version is available here . Après avoir lu ce post d'Eric Lippert , je me suis rappelé que

  • Anonymous
    May 22, 2008
    Rather than place the links to the most recent C# team content directly in Community Convergence , I

  • Anonymous
    February 10, 2011
    Some excellent points, especially about "•If you don't actually describe the relevant features of the problem, it is impossible to get a correct analysis.". My personal favorite is when a developer make an "optimization" that increases BOTH memory & Time <eek>