More on PerfProductivity: The devil is in the details...
I enjoyed reading (and forwarding) the comments on my recent post on the productivity performance tradeoff… Many of you (rightly so) pointed out that the details really mater in such a discussion. So I picked one that we are struggling with right now. And I think it is generally interesting to see how enumerator works in List<T>…
Productivity:
List<T> maintains the one of the same invariants that ArrayList offers: You are not allowed to modify the list while it is being enumerated. We do this in order to cheaply provide a level of error checking. It is too expensive for List<T> to be resilient to such changes so we do just enough runtime checks to ensure the invariant is enforced. The common scenario this helps with is:
List<string> l = new List<string>();
l.Add("Joel");
l.Add("KCwalina");
l.Add("Kit");
l.Add("Peter");
l.Add("Joel");
l.Add("Ahmed");
l.Add("Joe");
foreach (string name in l)
{
if (name.Contains("K"))
{
l.Remove(name); //This is not allowed
}
}
With the checks you get an InvalidOperationException in the loop iteration right after Remove is called, without the checks you would get unpredictable results based on if you have enumerated passed the element or not. These are very subtle bugs to find. If you have ever gotten this exception you can appreciate that it is easy to get into this case without meaning to. As a test I took out the check that enforces this invariant and I was surprised that I got a NullReferenceException on the 7th (last) iteration through the loop on the if-branch. Any idea why? I for one would find this a subtle bug to go find-and-fix.
Performance:
Of course, this check comes at a cost. To implement this invariant we use a “version stamp” on each instance of List<T>… Every time you modify the list we increment the version stamp. When you get the enumerator (or the foreach statement does it on our behalf) the enumerator stores that version stamp away. Then each time you (or the foreach statement on your behalf) calls MoveNext() the snapped version stamp is checked against the current version stored in the list being enumerated. If they are different we throw. Here is the code for MoveNext() from the List<T> enumerator:
public bool MoveNext()
{
if (version != list._version) throw new InvalidOperationException();
if (index == endIndex)
return false;
index++;
return true;
}
That is an extra if-branch each time through the loop. For a small number of items the overhead of setting up the loop dwarfs this cost, but for large list this check does show up, although I have never seen it show up top on the list in a real-world scenario.
Feedback
So, what do you think? Does the benefit of checking the invariant at runtime outweigh its cost? If we were able to do this check only if you are running with the debugger enabled in VS would that be good-enough or is there value in having this check always on?
Comments
- Anonymous
June 13, 2004
I believe the beneift of checking outweighs the cost in this scenario for two reasons. First, no matter how well you write the documentation there will be a signifigant percentage of people who do not realize this invariant exists. Secondly, I'll learn pretty quickly about this invariant if I write a foreach statement like the one in your first example. I've learned it, and I'll never try to modify the collection inside a foreach statement again. Then I make my program multi-threaded, and once in a blue moon a thread does modify the collection while a second thread is iterating. In this case the check could save a large amount of debugging effort.
I believe it has to be consistent in debug and release builds. Turning off asserts for C++ production code always made me uneasy. Yes, the error message meant nothing to the end user, but at least the program didn't throw up a screen full of numbers which looked normal but meant nothing. - Anonymous
June 13, 2004
A design philosophy that has worked for me in the past is : if a design decision fork has two equally good branches, it's probably better to keep the decision open and allow the final user to decide. (when possible !! )
In this example, I'd take the check out of the loop and expose a property - HasChanged - which will allow the developer to explicitly check within the loop, if the object has changed since the enumeration started.
This works as long as developer productivity doesn't become a synonym for developer ignorance - allowing developers to get away with not knowing the behaviour of the objects they are using. - Anonymous
June 13, 2004
I believe in this case legacy reigns. You can't modify the behavior to remove the checking because it already exists in other API's and should stay there. Now, you mentioned getting a null reference exception, which is obvious in your scenario, because the endindex item is getting cached because you are assuming the length of the list isn't going to change. Even worse, if you did a Remove and a TrimToSize, you'd eventually get an OOB exception.
Now, that said, an invariant enumerator would have to not cache endindex and instead use the .Count property (or most liekly the private count field) each time through MoveNext. Else you can't safely enumerate even while changes are occuring. Is changing the enumerator to use .Count intead of an endindex going to slow it down sufficiently that we should just be checking the version anyway?
I'd say make an GetUncheckedEnumerator method for power users, but that won't get enabled through C#'s foreach statement. As that is the case, it is worthless. If I want enumeration of a List<T> I'll just use a for loop. Performance means for loop, Productivity means foreach. So I'm not seeing the trade-off anymore, only a choice. - Anonymous
June 13, 2004
I'm not an expert in .NET, but isn't it possible to specify the code to generate by using special attributes for the variable declaration. Would this be too hard and impractical to implement? Or .NET attributes were not meant to be used this way to customize compiler output? Or it's generally bad to allow developers choose to skip checking in order to get faster code?
In any case I would think turning check off in release builds could work, but then most likely you would have to find a good way to notify all developers when such change in the output code takes place. Of course documentation will outline this "feature", but I would not be very surprised if not every single developer examined the documentation thoroughly. And people tend to forget anyway. And since this could potentially change the expected behavior of the application people will have to be aware of this taking place. So imho the biggest problem would be to get the word to the masses. And of course there is the multithreaded problem described earlier. - Anonymous
June 13, 2004
Default the optimization to on perhaps, but expost it (and others) as compiler switches. In this case, for those that want that extra speed and know they are using the List correctly (or are willing to deal with the ramifications), then let them. - Anonymous
June 13, 2004
Think it is good to have that catch, that it is one of the subtle bugs developers usually fall in, and having an exception can save you a lot of debugging time. - Anonymous
June 13, 2004
Take Outs: The Omega Installment - Anonymous
June 13, 2004
The comment has been removed - Anonymous
June 13, 2004
The comment has been removed - Anonymous
June 14, 2004
Keep the check for productivity, use the for loop for performance if necessary. Also to get around the bugs with removing elements as you walk a collection, just walk it in the reverse direction. - Anonymous
June 14, 2004
While I thank you for a specific case, I think this question is moot.
We already have the choice (as pointed out above) :
1. Use foreach(...). I'm relying on a shortcut to produce easy code without having to take control of all the details.
2. Use for(...). I'm taking control of the details, and I know enough about how the class (List<t> in this case) works to enumerate it myself. I'm doing that because I understand the issues, and need the benefits that all this extra work will get me.
It seems to me that developers in case one are clearly saying "productivity for me, please", and are willing to take a small perf hit. In case #2 the developer is saying "I know the risks, give me control", and can address thier own perf issues.
I'd be more interested in an example you have where we can't have it both ways. I think the discussion may be a bit more heated. Personally, when choosing between perf and productivity, I want both.. :) - Anonymous
June 14, 2004
I agree essentially with Justin. We created some base collection classes in VG.net and had to make the exact same decision. We decided to remove the version number, not because it makes the loop slower, but because it makes every instance of the collection larger. This is a more important factor. That version number is not retaining useful information most of the time, and is wasteful. Why force everyone to store it?
As Justin points out, you can eliminate any possible exceptions by making the enumerator more robust, checking to see if the Count changed. This does not prevent changes to the list, but actually adds flexibility -- limited changes might be allowed. If people really want to change the list and iterate at the same time, they can always use a copy, or for (int i = 0; i <....).
So put error checking in the enumerator, and let users use the "for" to get max speed.
This argument applies to an ArrayList type of class, not a true linked list. - Anonymous
June 14, 2004
Correction: in the enumerator, don't check to see if Count changed -- just check to see if the enumerator is going out of bounds, by using Count. - Anonymous
June 15, 2004
Let's see... if we extend the performance argument, I should take out all the error checking in my code and all will be good and lightning fast as well. Sounds good to me; no more writing those pesky if statements.
I just don't buy it. For our customers, stability and security go hand-in-glove. Without the error checking, you miss the obvious logic error (which is based on implementation details). Would the error be found in testing? Maybe - depends on the tests performed. Would the error be found in the field? Definitely. That's not a good place or time for discovering nuances about the underying library.
If statements comparing values in memory for error checking are essentually free. Thank the hard working engineers at Intel for their instruction pipeline optimizations. Please sprinkle on liberally. - Anonymous
June 16, 2004
I agree with Poust:
Unless you can formalize what it means to change an underlying collection while iterating over it, then you should probably throw as an unsupported operation.
For example, if I am iterating over a tree then I expect to get every element in the tree. If someone removes an element which causes a rebalance then it's almost certainly gauranteed that the iterator will be in a hosed position at that point and that continuing the iteration will not give me all the correc telements (it might return an old element again, or it may miss elements). Because of this it's better to throw and make the consumer realize that they are almost definitely not doing the right thing.
If you need to expose mutability semantics that work in the presense of iteration, then it needs to be something that works hand-in-hand with the iterator. For that reason I see those operations as being better suited on the iterator. If you want to add an element, tell the iterator. This will ensure that after the operation the iterator and underlying collection will be in a sane state. - Anonymous
June 17, 2004
Keep the check in. If we've got performance problems that need work, they are likely far bigger than this little check. - Anonymous
July 19, 2006
Don't feel bad if you don't know what the Halloween problem is.&nbsp; According to the Transact SQL Blog,...