Unintended consequences of adding APIs to the system

Yesterday, I wrote about a trick to reduce the number of bits in a number by one.

It turns out that I've only ever had one opportunity to use this trick (although I ran into an instance of it when code reviewing some stuff the other day), back when I was writing the NT 3.1 network filesystem.

One of the things that makes writing a network client so hard is that you need to map legacy versions of your network protocol to the semantics that your customers require.  In this case, I needed to map between the "how much disk space is free on the drive" SMB request and the NT filesystem query that retrieves the amount of disk space on the drive.

It turns out that the legacy servers returned this information in terms of bytes per sector, sectors per cluster and clusters per disk (because that was how MS-DOS maintained that information).

NT maintained the information in a different form, and the conversion between the two forms required that I divide the interim results by bytes per sector.

Well, if the bytes per sector was a power of two (which it almost always was), I could do the division trivially (by shifting the interim result right by the base2 logarithm of the sector size). 

Unfortunately, if the bytes per sector was NOT a power of two, I was stuck with a division.  Since the interim value was a 64bit number, that meant that I had to do a 64bit by 32bit division.  At the time I was writing the code, the compilers didn't have support for 64bit arithmetic, for 64bit math, we had a number of internal "RTL" functions that would do the math.  We had functions for addition, subtraction and multiplication, but nobody had ever written a version that did division.

I'd never done numeric processing before, but I needed the function, so...  I went to Dave Cutler (since he knows everything) and asked him how to do the math, he told me and I trotted off to implement his algorithm.

Since the code was only used in a special case that was likely never to occur outside of the test labs, I didn't particularly spend a lot of time optimizing the function - I faithfully coded the algorithm in C - two versions, one which did a 64x64 division, another that did a 64x32 division.  I could have written an assembly language version of the 64x32 version (since the processor supported 64x32 division natively), but we were REALLY concerned about portability and since it wasn't going to be called that often, I went with the more portable solution (and I'd been told that if I ever wrote stuff in assembly language, I'd have my thumbs slowly roasted over a red hot hibachi).

So I wrote the API, wrote enough unit tests to confirm to myself that it worked, checked it in as a new RTL API and went away satisfied that I'd done my first piece of numerical processing.  I was SO happy :)

At this point, anyone reading this post who was on the NT team at the time is starting to see where this is going.

 

About three years later (long after I'd left NT and moved onto Exchange), I heard a story about something that happened some point after I'd left the team.  You see, Dave spent a fair amount of time during the NT 3.5 to NT 4 period doing analysis of the bottlenecks in the system and looking for ways of making the system faster.

At some point, his investigations brought him to (you guessed it) my little division functions.  It turns out that one of the core components in the system (I think it was GDI, but I'm not sure) decided that they needed to divide a 64bit numbers by a 32bit number. 

Because the compiler didn't have native support for 64bit arithmetic, they went looking through the same RTL functions I did, and they discovered my 64x32 division routine.  So they started using the routine.

In their innermost loop.

All of a sudden, the silly routine I'd written to solve a problem that would never be seen in real life all of a sudden became a part of the inner loop of a performance critical function inside the OS.

 

When Cutler discovered this bottleneck, he looked at the code in my little routine and he went through the roof.  I understand he went through the roof and started cursing my name repeatedly before he rewrote it in assembly language.

 

So the moral of the story is: When you're writing code that's going into a system library, make darned sure that it's written to be as performant as humanly possible, because you never know if someone's going to find your one-off piece of code and put it in their innermost loop.

Comments

  • Anonymous
    October 14, 2005
    You should have added a comment to the code:

    /* Don't use this function */

    :)
  • Anonymous
    October 14, 2005
    Larry,

    I know this is really off topic, but I use to work with VAXs and I went to the PDC where NT was launched. I saw Dave and a few of the other DEC/Microsoft guys. (I think the guy who worked on device drivers was called Mark L.)

    How many of the original 13-14 Ex-DEC people that Dave brought to Microsoft are still there?
  • Anonymous
    October 14, 2005
    Now on topic...

    One of my key phrases I use to judge ideas are "Wouldn't it be cool if..." NO IT WOULDN'T!!!

    Your story reminds me of another phrase I should add to my list. "But nobody would ever use it for/like that."

    Right...

    Many of the problems I run into are those new and creative ways people find to use existing software.
  • Anonymous
    October 14, 2005
    That's a great story.

    Dave Cutler has quite a reputation. Did people on the NT team - even MS veterans such as yourself - in any way feel intimidated about going to Dave to ask for help?
  • Anonymous
    October 14, 2005
    I suppose Dave didn't remember telling you to NOT right it in assembly, when he was cursing your name.
  • Anonymous
    October 14, 2005
    If you really need a division and don't test that it's fast enough, then you as a coder are broken. I mean, sure that not everyone has to know about pure-speed-optimizing, but just knowing that dividing can be costly would be fine ;)
  • Anonymous
    October 14, 2005
    Oops...

    It seems that I've written a lot of function in that fashion - Written something to solve one problem, thinking it's not used frequently never be too serious about it and leave it in the library as is.

    The fortunate thing is because I'm writting in C#, I can always add XML comment to say this is experimental...
  • Anonymous
    October 14, 2005
    Glad you finally got this one in. I guess I have to wait a few years for some of mine....
  • Anonymous
    October 15, 2005
    Of all the artificial pauses I've seen, this was definitely top-ten material...

    "In their innermost loop.".

    I think this story illuminates two very important things.

    1. No matter what you think, your code will often outlive both its purpose and even you. Many, perhaps even most of us, daily use code written by designers and developers now dead. Larry expected, assumed, this code to only be used in test labs. As most should be aware by now, assumption is the mother of all fsckups. :-)

    2. This also display a lack of proficiency of the GDI developer(s?) (at that time). Modifying or adding new functionality to performance-critical areas requires a certain amount of experience, insight, vigilance and willingness to research. Adding such a division (a division at all is bad enough) in a performance-critical piece of code without looking up at least the most common implementation (x86) seems to me like grounds for "We need to talk".

    I mean, for the user-land audio stuff Larry & co (finally) are making for Vista (and LH?), you don't do one (or heaven forbids, more) division/sample/audio-stream. Do you? :-)

    There is a third thing it says, implicit but still:
    Division is slow, really slow. :-)
  • Anonymous
    October 15, 2005
    Mike, we made the entire audio pipeline floating point. ANY manipulation of the audio samples is slow :)

    On the other hand, it turns out that by moving to floating point, even though individual operations may be somewhat slower, you need to do fewer of them, and you can do them more accurately (for instance, adjusting the volume on an audio stream is done by multiplying each sample by a floating point amplitude scalar - to do the same operation with integral math requires at least a division ).

    And there were many things that were done wrong by the people who wrote NT 3.1. IMHO, the biggest one was that one significant team chose to write all their code in C++. Unfortunately, the compiler didn't support C++ natively, so instead they use the cfront C++-to-C transmogrifier and fed the output of cfront to the C compiler. Needless to say, the quality of the resulting compiled code wasn't the highest.
  • Anonymous
    October 15, 2005
    The comment has been removed
  • Anonymous
    October 15, 2005
    Mike, Why go as low as MMX?

    SSE, SSE2 and SSE3 are especially there to replace the FPU functionality which should for a long time be considered deprecated nowdays.

    Even the Microsoft 2003+ compiler generates all the floating point code in SSE when compiled with the correct flags.

    When taking this into account it would be a shame NOT to write the complete pipeline in complete floating point in the year 2005.
    Floating point processing and variables makes everything much more sane; especially for audio I believe fix point should be put to rest.

    I would hate to think that Larry and his crew aren't implementing specific optimized code for SSE and its variants.. but I'm sure they do, right Larry? :)
  • Anonymous
    October 16, 2005
    90% of the engine just pumps bits from place to place, so it doesn't change anything.

    The 10% that does look at the values is optimized.
  • Anonymous
    October 16, 2005
    The comment has been removed
  • Anonymous
    October 17, 2005
    What greenlight said, but without the smilie. I mean, if you anticipated someone else needing to use it, optimizing it would be a good use of time, but every function can't be optimized for speed.
  • Anonymous
    October 17, 2005
    Great story. In the past, you mentioned that Gordon Letwin was the OS guru at MS. Did he participate in the NT development ?
  • Anonymous
    October 17, 2005
    > I could have written an assembly language
    > version of the 64x32 version (since the
    > processor supported 64x32 division
    > natively),
    The native support of 64x32 division in i386 is incomplete: if the quotient doesn't fit in 32 bits, the CPU generates an exception.
    I mean, you'd have to implement some maths even if you were allowed to code in assembly language.
  • Anonymous
    October 18, 2005
    "I went to Dave Cutler (since he knows everything)"

    Larry: I'm getting to the party a little late, but I hope you'll see this comment.

    Was that written sincerely? Or were your eyes pointed a little bit skyward when you wrote that? ;-)

    Also, is "performant" microsoftie-speak for "good performance"? Is that an adjective or a noun?
  • Anonymous
    October 18, 2005
    Tim Smith: Mark Lucovsky came w/ Cutler from DEC. If you've never read Showstopper!--a pretty good book despite its faults--there are a few pages in there about him.

    I hope this link works. Start at the bottom of the page and read the next few pages: http://www.amazon.com/gp/reader/0029356717/ref=sib_vae_pg_95/002-8335644-9960825?%5Fencoding=UTF8&keywords=lucovsky&p=S030&twc=23&checkSum=0C5LiET6qE83Is95N7Th3BDa6O%2Fifzyb48gcLnlRLEE%3D#reader-page
  • Anonymous
    October 18, 2005
    Vy: Letwin was the OS/2 architect at Microsoft and didn't get along with Cutler very well since their two products (initially) competed with each other. I'm sure Larry could answer better than me, but I don't think he was involved in the NT project.
  • Anonymous
    October 18, 2005
    Larry: It probably doesn't hurt to mention that the "C++ team" was Chuck Whitmer's graphics group. I didn't know that about the cfront transmogrifier, though. Very interesting!

    Weren't there any native x86 C++ compilers you could've used in the late 80's? Surely Intel had one?...
  • Anonymous
    October 18, 2005
    Mike: Nice tip, thanks! You can count me as someone who didn't know that. Then again, I don't write very many tight inner loops where speed is an issue.

    A quick question: since (my) compiler naturally promotes real numbers to doubles, is there a difference between
    float invRange = (float)1.0 / m_range;
    and
    float invRange = 1 / (float)m_range;
    (where m_range is an int) Is one better than the other?

    (I think I'd better stop now. 4 or 5 posts in the same topic is more than I deserve.)
  • Anonymous
    October 18, 2005
    The comment has been removed
  • Anonymous
    October 18, 2005
    Dan, it's not my place to lay blame or point fingers (except at my own foibles), I've made it a point not to name names when a team in the past has made mistakes.

    I try very hard not criticize other people for the work they've done, especially if I wasn't on the relevant team.

    Similarly, I'n not about to discuss the politics between Gordon and Dave. I have too much respect for both of them to do that.
  • Anonymous
    October 18, 2005
    The comment has been removed
  • Anonymous
    October 18, 2005
    Dan: NT wasn't being written for x86 in the late 80's, so it didn't matter if there was a C++ compiler from Intel. I think MIPS (Jazz machines?) was the big NT dev platform then.
  • Anonymous
    October 19, 2005
    Gabe, not true.

    MIPS was certainly the first platform on which NT booted, but x86 was VERY close behind. I was self hosting NT in the very late 80's running on x86 machines.
  • Anonymous
    October 19, 2005
    Just nitpickin', but was MIPS really the first platform on which NT booted? The following seems to suggest there was real i860 h/w, even that it didn't live up to expecteations, that booted NT before MIPS.

    http://www.winsupersite.com/reviews/winserver2k3_gold1.asp
  • Anonymous
    October 19, 2005
    Mike, I'm pretty sure we had the R4000 version booting before the i860 version (but I'm not totally sure). We had an i860 emulator that ran (very slowly) under OS/2, most of the initial development was done on that, but we had realized that the i860 was a dead end before we got silicon.
  • Anonymous
    October 20, 2005
    The comment has been removed
  • Anonymous
    October 21, 2005
    "...When you're writing code that's going into a system library, make darned sure that it's written to be as performant as humanly possible.."

    Or use OO languange (C++, for example) and make your function(s) "private" thus explicitely telling others you did not intend anybody else use your functions. It may be faster (a lot faster, perhaps) than making evey function in question as perfomant as possible.. :)