次の方法で共有


I can make it arbitrarily fast if I don’t actually have to make it work.

Digging way back into my pre-Microsoft days, I was recently reminded of a story that I believe was told to me by Mary Shaw back when I took her Computer Optimization class at Carnegie-Mellon…

During the class, Mary told an anecdote about a developer “Sue” who found a bug in another developer’s “Joe” code that “Joe” introduced with a performance optimization.  When “Sue” pointed the bug out to “Joe”, his response was “Oops, but it’s WAY faster with the bug”.  “Sue” exploded “If it doesn’t have to be correct, I can calculate the result in 0 time!” [1].

Immediately after telling this anecdote, she discussed a contest that the CS faculty held for the graduate students every year.  Each year the CS faculty posed a problem to the graduate students with a prize awarded to the grad student who came up with the most efficient (fastest) solution to the problem.  She then assigned the exact same problem to us:

“Given a copy of the “Declaration of Independence”, calculate the 10 most common words in the document”

We all went off and built programs to parse the words in the document, inserting them into a tree (tracking usage) and read off the 10 most frequent words.  The next assignment was “Now make it fast – the 5 fastest apps get an ‘A’, the next 5 get a ‘B’, etc.”

So everyone in the class (except me :)) went out and rewrote their apps to use a hash table so that their insertion time was constant and then they optimized the heck out of their hash tables[2].

After our class had our turn, Mary shared the results of what happened when the CS grad students were presented with the exact same problem.

Most of them basically did what most of the students in my class did – built hash tables and tweaked them.  But a couple of results stood out.

  • The first one simply hard coded the 10 most common words in their app and printed them out.  This was disqualified because it was perceived as breaking the rules.
  • The next one was quite clever.  The grad student in question realized that they could write the program much faster if they wrote it in assembly language.  But the rules of the contest required that they use Pascal for the program.  So the grad student essentially created an array on the stack and introduced a buffer overflow and he loaded his assembly language program into the buffer and used that as a way of getting his assembly language version of the program to run.  IIRC he wasn’t disqualified but he didn’t win because he circumvented the rules (I’m not sure, it’s been more than a quarter century since Mary told the class this story).
  • The winning entry was even more clever.  He realized that he didn’t actually need to track all the words in the document.  Instead he decided to track only some of the words in the document in a fixed array.  His logic was that each of the 10 most frequent words were likely to appear in the first <n> words in the document so all he needed to do was to figure out what "”n” is and he’d be golden.

 

So the moral of the story is “Yes, if it doesn’t have to be correct, you can calculate the response in 0 time.  But sometimes it’s ok to guess and if you guess right, you can get a huge performance benefit from the result”. 

 

 

[1] This anecdote might also come from Jon L. Bentley’s “Writing Efficient Programs”, I’ll be honest and say that I don’t remember where I heard it (but it makes a great introduction to the subsequent story).

[2] I was stubborn and decided to take my binary tree program and make it as efficient as possible but keep the basic structure of the solution (for example, instead of comparing strings, I calculated a hash for the string and compared the hashes to determine if strings matched).  I don’t remember if I was in the top 5 but I was certainly in the top 10.  I do know that my program beat out most of the hash table based solutions.

Comments

  • Anonymous
    September 29, 2009
    The comment has been removed

  • Anonymous
    September 29, 2009
    The comment has been removed

  • Anonymous
    September 29, 2009
    Without knowing about the text in advance, the winning entry can't know that the input text doesn't end with a new word repeated 1000 times. IMO it still falls into the "fast but wrong" category. If you were allowed to write the code with assumptions about the input text then the entries which hardcoded the results should have been allowed (at least if the question explicitly said that the Declaration of Independence was the one and only input). I guess "wrong" is a fuzzy thing. Maybe the winning entry is "right often enough" and Joe's bug that caused Sue to explode was more egregious than that.

  • Anonymous
    September 29, 2009
    Interesting story about calculating the wrong answer in 0 time. I've read something similar in Kernighan and Plauger's book on programming style. There they mention how many programmers don't use bound-checking compilers because they're inefficient. KP then quip: "Presumably this means that it is vital to get the wrong answers quickly." I loved that sentence!

  • Anonymous
    September 29, 2009
    This is a somewhat odd contest, since the winner is the person who most accurately guesses judges' interpretation of the ambiguous rules. They should have either required a general solution, or stated the rules in an otherwise non-ambiguous way. For example, they could have said that the program must get 8 out of the top 10 words right on a body of text unknown to the students. But, the moral you concluded with is still a good moral. :-)

  • Anonymous
    September 29, 2009
    So if the Declaration had read "When in the course of human events... inalienable rights... etc... we mutually pledge to each other our lives, our fortunes and our sacred potato potato potato potato potato potato potato potato potato potato potato potato potato potato potato potato potato potato potato potato", the winning entry would in fact not have won. Also, the United States might look very different today, but that's another matter. If you argue that the nature of the input is part of the specification, it's hard to argue that the guy who hard-coded the answer was breaking the rules. The problem would have to be formulated much more carefully to make the winning entry unambiguously adhering to the rules and the hard-coding entry unambiguously in violation of them. After all, you can make things arbitrarily fast if you can redefine "make it work".

  • Anonymous
    September 29, 2009
    The comment has been removed

  • Anonymous
    September 29, 2009
    The comment has been removed

  • Anonymous
    September 30, 2009
    This reminds me of recurring tests that were given at my school. It was a basic programming test, with exercises of increasing difficult, starting at such simple things as "reverse a string", up to moderately difficult ones like "count the number of islands of the same character in an ASCII image". The test was administered every month, with different exercises each time, to all students that hadn't gottent yet the requested mark. Now, the subject of each exercise had a small section with the list of allowed external functions, which generally amounted to "nothing except write() to output the result in the console". Some exercises allowed malloc(), but after a few months, everybody was used to seeing "nothing" and didn't read the section anymore. One day, the subject of the last exercise was "Implement the MD5 algorithm", followed by a copy-paste of the specification of the algorithm. You could see the expression change on the face of most students as they read the subject and started to despair... except for a me and a few other students that had read the "allowed functions" section of the subject and started grinning. That day, it read "all of libc". It took me one minute to type "man 3 md5" and write the three calls to MD5_Init/MD5_Update/MD5_Final. Now that was an easy subject... (I got the points for it of course !) The next month the MD5 exercise was back, but with a "nothing" limitation again.

  • Anonymous
    September 30, 2009
    AKA "If I can't be right I might as well be fast."

  • Anonymous
    September 30, 2009
    I'd be pretty upset if my hard coded program was disqualified for breaking the rules while the winning program determined "n" in advance, unless the programmer discovered a value of "n" that worked for all documents.  This seems rather unlikely in light of potato potato potato potato potato potato potato potato potato.

  • Anonymous
    September 30, 2009
    The comment has been removed

  • Anonymous
    September 30, 2009
    The comment has been removed

  • Anonymous
    September 30, 2009
    The thing that you should remember about running programming contests is that you need to give inputs that are likely to cause the programs to fail. Students often struggle with error checking and edge cases. I'd bet most of the submissions would have fallen apart if the input had fewer than ten words.

  • Anonymous
    October 01, 2009
    The comment has been removed

  • Anonymous
    October 02, 2009
    The point of the excercise is to return the top 10 used words in the Decl. as quickly as possible. I'd argue he's the -only- one who understood the point of the exercise.

  • Anonymous
    October 02, 2009
    This reminds me of a solution I did in college. The instructor asked us to write our own brand new sort algorithm. Most students changed bubble sort of quick sort etc... I turned in what I learned later was the six pack sort. For the array, Random the whole array, check if each array variable is less then it's neighbour. If not, Random again. The instructor laughed and gave me an A, due to it was possible to be more efficient then quick sort, but in reality could never ever finish.

  • Anonymous
    October 04, 2009
    This reminds me of the debate about voting machines.  There always seems to be a lobby group who insist on getting election results quickly and aren't too worried about accuracy.  Given that specification, I always recommend a random number generator; not only is it fast, but it's also much more efficient than actual voting. :-)

  • Anonymous
    October 04, 2009
    @Scott: that's yet another good illustration of the pitfalls of specifications, because "the top 10 most common words" is a meaningless phrase when you have less than 10 words to create that list from. An argument can be made that the programs would be allowed to simply fail in that case (in whatever way they choose), since there's no correct answer. Of course, it's trivial to extend the concept of a "top 10" to allow for less entries in such a case, or to reformulate the problem so that the problematic "top 10" doesn't appear. You (and many others) probably expect the programmer to do this on his own initiative, since nobody likes a failing program if there's a reasonable way to make it not fail. In fact, I'd say the most valuable thing to take away from this exercise is not the actual algorithm but an object lesson in how specifications should be done, can be done, will be done and how the pragmatic programmer has to cope with reality. Programming is not mechanically translating specifications into programs -- that's the compiler's job.

  • Anonymous
    October 05, 2009
    Hmm, hard-coding the results is exactly what I'd do in 'live' code, given that specification. I use that technique occasionally. If I can convert a rather complicated algorithm into a lookup table, that's usually a win.  So I write a program(1) to read the raw data and write out a C/C++ structure; this gets compiled into program(2). Program(1) will be executed once so efficiency is not a concern. Program(2) is nice and fast because it has a simpler algorithm. More skilled persons that I might write program(1) with C++ template metaprogramming.

  • Anonymous
    October 07, 2009
    A classic challenge problem ... see Bentley, J.,  Knuth, D., and McIlroy, D. (1986) Programming pearls: a literate program Communications of the ACM 29(6) (direct link http://cacm.acm.org/magazines/1986/6/10222-programming-pearls/pdf?dl=no). I still lean toward the shell script. Even though the run time is longer, the development time is very fast.

  • Anonymous
    October 07, 2009
    The comment has been removed

  • Anonymous
    October 07, 2009
    Of note, guessing is exactly the same thing as "getting it wrong most of the time." Guessing may make the code faster, but it hardly makes it any more useful.

  • Anonymous
    October 15, 2009
    if (input.size < SOME_THRESHOLD) {  use_lookup_table(input); // hard to be faster than that } else {  use_scalable_but_more_complicated_solution(...); } In the Declaration of Independence case, the size of the input is known. I would have been interested to see a grading scale that tested performance across a range of inputs, some of which were very large (War and Peace comes to mind.)

  • Anonymous
    October 15, 2009
    The comment has been removed

  • Anonymous
    November 10, 2009
    That exact same story and almost the punchline occurs in Code Complete, two devs arguing over the speed of an algorithm.  "Your code takes x seconds per input, mine is much faster."  "Yes, but your code doesn't work, if mine doesn't I can have it return instantly."  Or something to that effect.  I wonder if this is some sort of development story archetype.

  • Anonymous
    December 14, 2009
    I'd argue that, based on the information stated here, the hard-coded answer is the right solution. If they wanted something different, they should have asked, "Given any text document..." I remember a million years ago on the ACM programming team, a problem was, "Print the decimal value of 80 factorial." On then-current hardware, it was impossible to do this within the two-minute time limit. The right answer was to do it quick and dirty, let it run, and then submit a program that just prints the (hard-coded) answer.