Spot the defect: Bad comparisons, part one
The mutable List<T> class provides an in-place sort method which can take a comparison delegate. It's quite handy to be able to sort a list into order by being able to compare any two elements, but you have to make sure you get it right.
First off, what are the requirements of the comparison delegate? They are clearly documented: the comparison takes two elements and returns a 32 bit signed integer. If the first element is greater than the second then the integer is greater than zero. If the first element is less than the second then the integer is less than zero. If the first element is equal to the second then the integer is zero.
See if you can figure out why each of these comparisons can give bad results.
Comparison #1: Putting on my top hat:
enum Clothes
{
Hat,
Tie,
Socks,
Pocketwatch,
Vest,
Shirt,
Shoes,
Cufflinks,
Gloves,
Tailcoat,
Underpants,
Trousers
}
static int Compare(Clothes x, Clothes y)
{
const int xGreater = 1;
const int yGreater = -1;
// If x has to go on after y then x is the greater
// If y has to go on after x then y is the greater
// Otherwise, they are equal
switch (x)
{
case Clothes.Tie:
if (y == Clothes.Shirt) return xGreater;
break;
case Clothes.Socks:
if (y == Clothes.Shoes) return yGreater;
break;
case Clothes.Pocketwatch:
if (y == Clothes.Shirt || y == Clothes.Vest) return xGreater;
break;
case Clothes.Vest:
if (y == Clothes.Shirt) return xGreater;
if (y == Clothes.Tailcoat || y == Clothes.Pocketwatch) return yGreater;
break;
case Clothes.Shirt:
if (y == Clothes.Tie || y == Clothes.Pocketwatch ||
y == Clothes.Vest || y == Clothes.Cufflinks || y == Clothes.Tailcoat)
return yGreater;
break;
case Clothes.Shoes:
if (y == Clothes.Trousers || y == Clothes.Socks || y == Clothes.Underpants)
return xGreater;
break;
case Clothes.Cufflinks:
if (y == Clothes.Shirt) return xGreater;
break;
case Clothes.Tailcoat:
if (y == Clothes.Vest || y == Clothes.Shirt) return xGreater;
break;
case Clothes.Underpants:
if (y == Clothes.Trousers || y == Clothes.Shoes) return yGreater;
break;
case Clothes.Trousers:
if (y == Clothes.Underpants) return xGreater;
if (y == Clothes.Shoes) return yGreater;
break;
}
return 0;
}
OK, before you read on, can you figure out what the defect here is? It seems perfectly straightforward: if two things have to be ordered then they are ordered, and if not, then we say we don't care by setting them equal.
Does that work?
.
.
.
.
.
.
.
If you actually try it out on a real list of randomly ordered clothes, you'll find that much of the time this sorts the list into a senseless order, not into an order that preserves nice properties like shoes going on after trousers. Why?
An undocumented but extremely important assumption of the sorting algorithm is that the comparison function provides a consistent, total order. Part of providing a total order means that the comparison must preserve the invariant that things equal to the same are equal to each other. This idea that if you don't care about the order of two things then you can call them equal is simply false. In our example Tie is equal to Hat and Hat is equal to Shirt, and therefore the sort algorithm is justified in believing that Tie should be equal to Shirt, but it isn't.
Suppose, for example, the first element is "Hat". A sort algorithm is perfectly justified in scanning the entire list, determining that everything is equal to the first element, and concluding that therefore it must already be sorted. Clearly a list where every element is equal to the first element is sorted! The actual implementation of QuickSort in the BCL is quite complex and has several clever tricks in it that help optimize the algorithm for common cases, such as subsets of the list already being in sorted order. If the comparison is not consistent then it is easy to fool those heuristics into doing the wrong thing. And in fact, some sort algorithms will go into infinite loops or crash horribly if you give them an incomplete comparison function.
The right algorithm for sorting a set with a partial order is topological sort; use the right tool for the job.
Next time: we do the same thing, backwards
Comments
Anonymous
January 19, 2011
The $64K question here is why was this perfectly reasonable assumption not documented in the first place? I imagine many people might find words like "transitive" and "consistent total order" scary, but still.Anonymous
January 19, 2011
Problem with the initial specification in comments: // If x has to go on after y then x is the greater // If y has to go on before x then y is the greater They should both be "before" or "after", yes?Anonymous
January 19, 2011
@Anton. Sorting is Hard, let's go shopping. /barbie The docs call out that it uses QuickSort. wikipedia ( en.wikipedia.org/.../Quicksort ) helpfully identifies it as a comparison sort ( en.wikipedia.org/.../Comparison_sort ) "A comparison sort is a type of sorting algorithm that only reads the list elements through a single abstract comparison operation (often a "less than or equal to" operator) that determines which of two elements should occur first in the final sorted list. The only requirement is that the operator obey two of the properties of a total order: 1. if a ≤ b and b ≤ c then a ≤ c (transitivity) 2. for all a and b, either a ≤ b or b ≤ a (totalness or trichotomy). " If someone can't be bothered to understand even the most basics aspects of an algorithm before they use it I'm not overly forgiving.Anonymous
January 19, 2011
Eric, Your point brings tough memories... I used stl sort algorithms I forgot for which version to sort doubles. And some of the doubles where NaNs, default < was used for comparison and sure enough under some arrangement of doubles I would get sporadic data and sometimes access violation (lucky case)Anonymous
January 20, 2011
The comment has been removedAnonymous
January 20, 2011
Are there bonus points for guessing what's going to be wrong with the code in the next post before it's been posted, based on "Next time: we do the same thing, backwards"? Psychic deblogging?Anonymous
January 20, 2011
So are we going to get to see your JS toposort ported to C#?Anonymous
January 20, 2011
Gabe: replace "var" with "int", "for" with "foreach" and you are almost done with porting :)Anonymous
January 20, 2011
This type of comparison function can also end up with stuff like a<b, b<c, and a=c. So a comes before b, b comes before c, but a and c are equal, so b comes before a.Anonymous
January 20, 2011
... resulting in an infinite loop while the sort function tries to get a before b and b before a at the same time... fun fun fun.Anonymous
January 20, 2011
"In our example Tie is equal to Hat and Hat is equal to Shirt, and therefore the sort algorithm is justified in believing that Tie should be equal to Shirt, but it isn't." I love this analogy - the one tweak I would make is to use Trousers and Underpants as the two things that get mixed up, just because I like the idea of a guy getting into that position while already wearing a top hat.Anonymous
January 20, 2011
I just wrote the following pex test for fun: [PexMethod] public static void Transitive(ClothSort.Clothes x, ClothSort.Clothes y, ClothSort.Clothes z) { PexAssume.IsTrue(ClothSort.Compare(x, y) <= 0); PexAssume.IsTrue(ClothSort.Compare(y, z) <= 0); PexAssert.IsTrue(ClothSort.Compare(x, z) <= 0); } It found 9 unique inputs which violate the transitivity law. example (tie, hat, shirt) are not transitive under equality. The other properties however hold: [PexMethod] public static void Reflexive(ClothSort.Clothes x) { Assert.IsTrue(ClothSort.Compare(x, x) == 0); } [PexMethod] public static void Symmetric(ClothSort.Clothes x, ClothSort.Clothes y) { var cmp1 = ClothSort.Compare(x, y); var cmp2 = ClothSort.Compare(y, x); Assert.IsTrue(cmp1 == 0 && cmp2 == 0 || cmp1 < 0 && cmp2 > 0 || cmp1 > 0 && cmp2 < 0); } I love pex. It was so easy to test this comparison function. research.microsoft.com/.../downloads.aspxAnonymous
January 23, 2011
@tobi Hi was curios about pex and how it works and for my surprise you can try it online, so I just copied pasted the code to the web but it find only one exception, did I do something wrong? (I hope this permanent link works) www.pexforfun.com/default.aspxAnonymous
January 24, 2011
I find it peculiar that you have declared constants xGreater and yGreater. Using 1 and -1 would be better as it would be clearer what value you are returning.Anonymous
February 04, 2011
One trap with such comparisons is that the comparison operators on floating-point types don't provide a total ordering. In particular NaN isn't equal to itself, and isn't either larger or smaller than any number. And when abusing Sort as shuffle you need to make sure you generate the random number only once, and not on each comparison. The browser choice screen fell for this one.