What's wrong with this code, part 10 - the answers

Yesterday's article was a bit of a trick question, but was a real world example.  Our group encountered this in some code we were testing last week (in some pre-production code - it was not part of any product).

Somewhat surprisingly, it turns out that the code, as written, wasn't incorrect.  The hr = ERROR_NOT_SUPPORTED line was in fact correct - the code was using the fact that it was a success error code as a signal to a higher level component.  As many of you pointed out, that looked like an absolute error, and I was bitten by it as well: I was making an unrelated modification to the routine and noticed the hr = ERROR_NOT_SUPPORTED; expression and fixed it to be hr = HRESULT_FROM_WIN32(ERROR_NOT_SUPPORTED);

But that turned out to be a HORRIBLE mistake.  Even though the code had passed my unit tests, and several other developers had tested the code, one of our testers encountered a fairly reliable test failure on his machines.  One of the developers in our group (after spending a long day debugging) finally chased the problem down to this change.  It turns out that on certain audio cards, PerformAnOperation() was returning a failure (but only on those audio cards).

And my change propagated a failure code out of the routine (which used to be a SUCCESS return, with an informational error code).  And, because I'd changed the SUCCESS error code to a FAILED error code, the caller of this routine didn't handle the error correctly.

So what was wrong?  The mistake was that the semantics associated with returning a success return code was not called out anywhere in the code.  The code was correct (or rather, functioning as intended), but some of the assumptions associated with the code were not documented.  So someone (me) came along later and "fixed" the code and thus introduced a bug.  Correcting the function was as simple as documenting the assumption...

// ----------------------------------------------------------------------// Function:// CThing1::OnSomethingHappening()//// Description:// Called when something happens//// Return:// S_OK if successful.  If an error occurs while performing an operation, returns a success // error code with the error number set to ERROR_NOT_SUPPORTED//// ----------------------------------------------------------------------HRESULT CThing1::OnSomethingHappening(){    HRESULT hr;        :        :    <Do Some Stuff>        :        // Perform some operation...    hr = PerformAnOperation();    if (FAILED(hr))        hr = ERROR_NOT_SUPPORTED;  // Swallow the error from PerformAnOperation and return an indicator to the caller.    IF_FAILED_JUMP(hr, Error);Exit:    return hr;Error:    goto Exit;}

If that had been done, it would have saved us a great deal of pain.

And, as I mentioned above, this was NOT a problem in any production systems.  And we're reviewing our unit tests to see how we can improve them to ensure that problems like this issue get caught even earlier.

Ok, Kudos and comments...

Michael Ruck spotted the root cause of the error: That the failure was not specified, and thus the next poor schlump (me) coming along into the code fixes what appears to be a simple bug and thus breaks something.

Some other comments: Skywing was, of course the first person to notice the ERROR_NOT_SUPPORTED (he/she was also the first person to post) - I sort-of used him as bait because I intentionally didn't comment that the hr = ERROR_NOT_SUPPORTED was not an error - upon reflection, I realize that that was a mistake.

A number of people were concerned about the IF_FAILED_JUMP macro, that's just a convention we use in my group, I should have edited it out since it complicated the problem, that was my bad.

Another group of people were concerned about the goto Error/goto Exit paradigm.  This is actually a convention I picked up several years ago - you move all the error handling to the end of the function (out-of-line) and then jump back to a common "cleanup" or "exit" label.  That allows the "normal" case to be in-line and isolates the error handling logic to one location.

Skywing also pointed out that returning ERROR_NOT_SUPPORTED STILL isn't correct, the facility should have been set to FACILITY_WIN32, he's absolutely right.

A number of other people suggested that the routine should be changed to return either BOOL or bool, but (as was immediately pointed out) that throws away error code information.

And again, several people mentioned throwing exceptions, the code's neutral w.r.t. exception handling (although I disagree with the whole "throw an exception to return an error" paradigm - that's a different issue).

Comments

  • Anonymous
    March 16, 2005
    Isnt this an example of how horrendous a system based on error codes can get.

    With exceptions, if it succeeds it succeeds. If there is an exception it has to be handled, without exception
  • Anonymous
    March 16, 2005
    The comment has been removed
  • Anonymous
    March 16, 2005
    Michael,
    Thank you. We so often concentrate on the code bits and ignore the non-code bits, but they all go together.

    That's actually why I put the link in to my "programming with style" entries in as a hint.
  • Anonymous
    March 16, 2005
    There are some APIs that return HRESULTs but are documented as something like "returns S_OK on success". Since becoming the WinVerifyTrust tester I've taken that to heart. When that's what MSDN or the header says, I check for S_OK and treat anything else as failure. I don't use SUCCEEDED in those cases.
    The reason I do this has to do with the garbage lagacy code underneath WinVerifyTrust that sometimes returns an actual HRESULT, sometimes does HRESULT_FROM_WIN32, and sometimes just bubbles up an NTSTATUS or a Windows error DWORD instead of an actual HRESULT. Yeah - it's a mess, but the header is the contract and the header specifically tells callers to check vs S_OK, which works.

    Given all of that:
    Without knowing what PerformAnOperation can return, I have to assume that it might return a success code that is not S_OK. So I'm sticking by what I said yesterday when I claimed that the bug is that OnSomethingHappening doesn't necessarily return S_OK on success as it claims it does.
  • Anonymous
    March 16, 2005
    Drew, you're right (on all points). I also try to only check for S_OK (but I get pushback from the rest of my group).

    And you're right, the issue was that the contract for OnSomethingHappening as documented by the headers was incorrect.

    Sorry if I missed that - I got more responses yesterday than I could do justice to.
  • Anonymous
    March 16, 2005
    The comments should probably be changed, then. The part about S_OK on success should be removed/replaced/whatever.

    Believe it or not, if OnSomethingHappening had swallowed any additional status info and returned S_OK for all successes in the original version you posted yesterday I would have considered this function to have been correct. Poorly designed, certainly, and almost certainly I'd complain to the dev about it, but it would work as promised.

    I'm not claiming my statements reflect the way that HRESULTs are suposed to work. My comments here are all heavily biased by my experience (horror!) with real-world, brittle APIs.

    I wonder if I'm getting too jaded to be a tester.
  • Anonymous
    March 16, 2005
    The comment has been removed
  • Anonymous
    March 16, 2005
    Why is a constant with a name beginning with ERROR_ a success result? I'd expect everything called ERROR_* to be a failure.
  • Anonymous
    March 16, 2005
    The comment has been removed
  • Anonymous
    March 16, 2005
    "A number of other people suggested that the routine should be changed to return either BOOL or bool, but (as was immediately pointed out) that throws away error code information."

    How does only returning S_OK or ERROR_NOT_SUPPORTED convey any more information than a boolean would? If other return values are also valid, then the documentation for the function is still incomplete.
  • Anonymous
    March 16, 2005
    ERROR_NOT_SUPPORTED is not a success code. It is a number, whose value is unknown and not guaranteed. It is not <0 or =0 or >0. If Microsoft changed ERROR_NOT_SUPPORTED to 0xfedcba98 (and NULL to 0xeaeaeaea, come to that) and recompiled Windows, Windows should still work... as long, of course, as application software vendors followed suit.
    And it is pretty simple to bulletproof an application so that is not crucially dependent on the half-understood policies of another software vendor.
  • Anonymous
    March 16, 2005
    The comment has been removed
  • Anonymous
    March 20, 2005
    It really is horrible that you criticise exceptions and then do someone similar using macros and gotos! Readability conventions that use gotos?! What year is it?
  • Anonymous
    March 21, 2005
    You're 100% right Asd. I should have expunged the IF_FAILED before I posted it because it complicates the discussion.

    However I firmly believe that using "goto" dramatically enhances the readability of code IF it's used in a limited way. And the contortions you need to go through to avoid goto's can make code be totally unreadable.

    When was the last time you saw:
    do
    {
    ...<something>
    ...if (FAILED(result))
    ......break;
    } while (0);
    ?

    Someone inserted a gratuitous do/while loop just to enable the use of break as a replacement for goto.
  • Anonymous
    May 29, 2009
    PingBack from http://paidsurveyshub.info/story.php?title=larry-osterman-s-weblog-what-s-wrong-with-this-code-part-10-the