When new features expose old bugs.

Not quite "Riffing on Raymond" but he just wrote about this, and it reminded me of a story that was related to me by the dev lead for the security team (the guys who own the LSA and authentication in Windows, not the SWI team) here at Microsoft.

Raymond recently wrote about one of the changes for 32bit applications with the /LARGEADDRESSAWARE flag set - on 64bit windows, they get access to a full 32bit address space (4G) instead of the 31.5bit address space (3G) to which they previously had access.

In a much earlier post I'd written about transferring a pointer between processes, and why you'd want to do that.  Well, it turns out that two of the Windows components that needs to do this is the LSA and SSPI components.  For those that don't know, the LSA is the component that creates the initial user token when the user logs on (and maintains the local accounts database, etc)) and SSPI is a generalized authentication package API.  So the devlead had to do essentially the same thing that the RPC team had done - ensuring that their pointers could be marshalled across processes and guaranteeing that the shared memory structures were sized appropriately.  The work was done way back in 2000, for the first release of 64bit windows (on the Itanium), and mostly forgotten (as is all code that works reliably).

Until the very end of the testing of Win2K3 SP1. I mean the VERY end of testing.  The builds were in escrow.  For those that aren't familiar with the escrow process, when a build is in escrow, it means that it's almost ready to ship - the final test suites are being run, and the only bugs that will be fixed are bugs that would literally cause us to recall the product from manufacturing.

And the security lead got a report of an SSL failure in one of our server applications running under Wow64 (32bit application on Win64).  The application would work just fine for a really long time, and all of a sudden they'd start getting SSL failures. 

Now on NT, SSL is implemented by an SSPI provider.  Running the app under the debugger showed that they were getting a STATUS_ACCESS_VIOLATION error when they were trying to copy information from a client certificate from the server process into the LSA process.     Hmm.  That error only occurs when dealing with a bad pointer.

It turns out that way back in 2000, when the 32-bit->64-bit marshalling code was originally written, 32bit applications ran the same on x64 as they did on x32 - they had access to only 2G of address space (I believe this was even before the /LARGEADDRESSAWARE flag).  Later on in the process, the decision was made to grant these applications access to the entire 4G address space (thus allowing the Win64 platform to provide benefits to 32bit applications that are address space sensitive like Microsoft Exchange and SQL Server).

And it turns out that the code to convert pointers from the 32bit space to the 64bit space did its conversion using LONG datatypes.  And if the high bit was enabled on the 32bit value, it was quite happily sign extended into a really big negative number on the 64bit side.  This wasn't a problem when the 32bit apps only had access to 2G of RAM, since the high bit was never enabled.  But it only showed up in very limited cases - essentially the problem could only show up when the app had used up 2G of address space (since NT tends to allocate from the bottom of the address space to the top).  And then only when it used a particular subset of the LSA or SSPI APIs .

Whoops.  One way of looking at this is that the marshalling logic wasn't /LARGEADDRESSAWARE - it had an example of same bug that prevents the /LARGEADDRESSAWARE flag from being the default. Double whoops.

So the LSA developers went around and quickly found the cases with the error and fixed them all.  And, having been burned by this once, the base team went around and took a hard look at all of their integer conversions and found a few of those as well.

And of course, we now have BVT test cases that will attempt to catch this kind of error in the future.

Comments

  • Anonymous
    June 06, 2005
    Wouldn't unit testing have exposed these errors?
  • Anonymous
    June 06, 2005
    Travis,
    Highly unlikely. Unit tests exist to test for normal functionality. And this only occurred under relatively abnormal circumstances - witness the fact that it laid dormant for 5ish years.

    It's important to note that this only happened after 2G of address space had been consumed (not necessarily memory, but address space). And that wasn't a part of our normal test suites. That's since been fixed, as far as I know.
  • Anonymous
    June 06, 2005
    Quick addition, LARGEADDRESSAWARE was possible before the year 2000, I am fairly certain. I think it was introduced with SP3, for Enterprise Edition only. That would have made it available in early 1997, which means the flag would have been in place as well. If I'm not mistaken, SQL 6.0 and Exchange 5.5 were the first two apps to make use of it.
  • Anonymous
    June 06, 2005
    The comment has been removed
  • Anonymous
    June 06, 2005
    The effect of the fix was that SSL connections would fail, I believe (given the symptoms listed).

    So after a certain amount of runtime, the system would fail to accept new clients. Which is totally unacceptable.

    This was a must-fix.
  • Anonymous
    June 06, 2005
    Actually, 3GB is not 31.5 bits, it's 31.5849 (=ln(3)/ln(2) + 30). Sorry, just had to say it.
  • Anonymous
    June 07, 2005
    If we're straining at gnats of with it's 31.5 or 31.58(something something) bits, how does one have half a bit in the first place?
  • Anonymous
    June 07, 2005
    "how does one have half a bit in the first place?"

    For an example, take a look at arithmetic encoding:
    http://michael.dipperstein.com/arithmetic/arithmetic.html
  • Anonymous
    June 07, 2005
    > how does one have half a bit in the first
    > place?

    I was going to say that my cat half bit off more than she could chew, but something's missing from that.

    So I guess we'll have to ask a horse owner.
  • Anonymous
    June 07, 2005
    Speaking as the owner of two horses....

    Our horses chew on their bits, but they've never bit off more than they can chew. But if they could choose, they would chew on carrots, not bits.
  • Anonymous
    June 19, 2005
    The comment has been removed
  • Anonymous
    July 12, 2005
    When new features expose old bugs we are not really "engineering"
  • Anonymous
    July 29, 2005
    Apologies for reviving an old thread, google found this page for me just today.

    I'm also running into this 'sign-extend the 32bit pointer' problem. We have code that calls from a 32bit WOW64 module into a 64bit kernel driver. With /LARGEADDRESSAWARE and pointers above 2GB the kernel will get a bad address.

    My question is, does anyone know if there is a compiler fix for this? It's possible to work around in my code, but the 'right thing' to do seems a compiler fix.