What to do when the unexpected happens

Quite recently I made a pretty visible change to the C# language service for Beta2.  For many operations that you can perform in the IDE (like generate method stub, or trying to set a named breakpoint) we will throw an exception in the case of failure.  These exceptions can happen for many reasons, especially as we call into many COM interfaces any of which can return a failure HRESULT which we won’t know how to deal with.  In the root of all code paths that can lead to an exception being thrown we have a catch for our root exception type so that we can pop up a message saying that we were unable to do what you requested.  While these exceptions are unexpected (as we believe they should be) we do not believe that it is wise for the entire application to be torn down just because we couldn’t do something like replace some text in the editor. 

 

So what have we changed?  Well starting with very recent builds we will now pop up a Watson dialog asking if you would send that information to us.  The dialog looks like the regular Watson dialog which you (hopefully) don’t know and love:

 

 

 

However, there are significant differences (though you probably won’t notice them since you’re so used to what this dialog normally means).  First of all, this isn’t a regular Watson dialog that indicates that the program terminated abnormally.  Rather we are in a state where we can know something has gone wrong but we can recover just fine from the issue.  We try to indicate this by telling you that the error is recoverable and that no information has been lost.  Second of all, we will only ask you to do this once while VS is running.  This is so that we don’t annoy you every time we encounter a problem and also it’s to ensure that in pathological cases after you dismiss this dialog you don’t just get one 1 second later if we’re in a loop that’s failing over and over again.

 

So why are we doing this?  Well, as much as we’d like to think that QA and good dev work will prevent all errors, we know that that’s simply not the case.  Users will encounter errors and things won’t work and it’s just going to suck for them.  By having this information get sent back to us during this beta we can see what real issues users are running into and we can know that we have to focus on that code and really investigate and determine why it’s failing.  How do we do that?  Well, when one of those exceptions is thrown the information that will get sent to us is the call-stack that happened when the error occurred.  The Watson system will collect those call-stacks and make them available to us to investigate.  It will also do very convenient things like calculating how many times this issue has been hit ever, how many times it was hit with the same callstack, etc.  It will also let us send out a survey to the user so that if we need some more info then the next batch of people who hit this will get asked if they can give us some extra information that might help us with this.  For example, we might ask them “are you using a source control system?  Did that system change the readonly status of the file outside the IDE? Etc.”

 

By sending this information to us we can now start working on issues that are actually you and we can do that in the order of how important it is (which we can determine by the quantity of cases, and what operation you’re being prevented from doing, as well as other factors).

 

So that’s all lead up to the debate we’ve been having internally.  We felt that this was a safe action to take because when these exceptions are thrown we know for certain that something has gone wrong.  We asked someone to do something and they returned a failure which has caused us to abort the current action.  However, there’s another type of error that occurs inside the IDE.  Specifically with our VS2003 codebase (which didn’t use exceptions) we used a very common coding metaphor of:

 

if (FAILED(hr = SomeCall(…))) {

    VSFAILF(“…”);

    return hr;

}

 

Basically what we’re doing is emulating exceptions in a codebase that didn’t use them.  After every call we check an error value and if we can’t handle it we propagate it up.  We also assert (VSFAILF) so that if we’re in a debug build we’ll stop executing so we can attach a debugger.  Recently I’ve started switching that code to use a macro instead:

 

#define IfFailAssertReturnHR(expr) \

    if (FAILED(hr = (expr)))) { \

        VSFAILF(“”); \

        return hr; \

    }

 

This macro has many benefits over the preceding code, but I don’t really want to get into them now.  One benefit that could be added to this is that we could change that macro to then have the line: “Watson::ReportUnexpectedCondition()” which would them proffer the dialog I showed earlier.

 

So why wouldn’t we do this?  Well, unfortunately, unlike the exception based code base we have where we know that an exception is something bad, these failure cases are not always unexpected.  For example, inside the IntelliSense engine one might be trying to iterate over the members of a type.  However, some of those members might be inaccessible to the caller because they’re private.  Unfortunately to indicate that a member cannot be retrieved an error HRESULT will get returned.  That error then gets bubbled up and is might cause some other operation to fail.  This is rather unfortunate, but without an inordinate amount of code churn we won’t be able to fix up this code to deal with these different concepts we can't change it.

 

So this means that 95% of the time when we get one of these asserts it’s a bad thing, but 5% of the time it’s actually ok and we’re just getting the assert because of poor architectural decisions.  This is rather high in my mind and it means that if you’re running the IDE (and thus performing millions of operations behind the scenes) you have a good chance of running into this.  If we then submit this report we’re likely to miss an important error and we’re also probably going to annoy you every time you run the Beta2.

 

So what to do instead?  Well, Watson has the concept of a “queued report”.  Rather than interrupting you when the problem happens the information is logged to a queue.  Later on (usually after a reboot in my experience) you’ll get a message saying something like “The following applications experienced problems during execution.  Would you like to send the information blah blah blah…”  This would allow us to still report immediately about significant problems (which also has the added benefit of letting you know that while we were unable to do what you just asked we’ll now know about it and we’ll hopefully be able to do something) while getting valuable feedback about these other errors that people are experiencing that we are recovering from.

 

Of course, like all things nowadays we’re incredibly wary about making changes (especially highly visible ones) so close to Beta2.  God forbid I have a bug in my code which submits this stuff to Watson and now instead of gracefully recovering from an exception like we normally do we crash 10% of the time.  So it’s a tough call between asking ourselves if the information gathered is worth the code risks, and also if the fundamental change in behavior will be ok with our users. 

 

Personally, I feel that it’s worth it.  We’re doing this for a beta and the point of a beta (in my mind) is to find out what needs work so that we can get everything up to extremely high quality for the final release.  We depend on users to help us with that by trying out the product and reporting back to us the problems they have.  But now we can make that process a lot easier by automatically collecting information that is incredibly valuable.  While exact repro steps are the most ideal for diagnosing and fixing issues, callstacks to problem areas are the next best thing.

 

How do you feel about this sort of thing happening to you while you use the next beta?  Also, how do you feel about this sort of thing happening while you use the final product?  Is it worth sending us the information so that we can better address issues that you’re running into?  Or is the intrusiveness just far too much to handle?  What other feelings ro concerns do you think exist with this?

Comments

  • Anonymous
    January 26, 2005
    Awesome stuff, we use exactly the same style of error reporting in our .NET application - we have the ability to report a caught exception or an unexpected condition (it's also hooked up to the top level exception handlers). We can either prompt the user to enter a description of what they were doing at the time (they usually just type particular four letter words), or just submit the error silently. The error reports include callstacks, product versions, system and environment information (memory, GDI resources, disk space, etc), as well as a list of recently executed sql commands, keystokes and control navigation, and a number of other pieces of data relevant to the application.

    If what you're proposing means that an error that occurs while we're developing can get reported to Microsoft and as a result gets fixed, I say bring it on - our experience as an end user will be all the better for it in the long run.

    Besides, we do the same thing to our users, so it would be hypocritical of us to object to Microsoft doing the same to us :)
  • Anonymous
    January 26, 2005
    I like the idea of using Watson infrastructure to report non-critical problems. I would even tolerate it in a shipping product if it's not very intrusive (e.g. there could be a "automatically send any reports and don't ask me again" option or something).

    May the OS Watson should also start reporting recoverable problems? I could see something like this:

    Application XYZ is impacting performance by generating too many page faults/using lots of memory/etc. Do you want to submit an error report?
  • Anonymous
    January 26, 2005
    It might confuse users... I'm used to the app being closed and my unsaved work being lost when I see this dialog; but the behaviour in this case will be different to what I expect. I'd recommend changing the dialog, rewording it to make it clearer... if this is going to be used elsewhere in VS for reporting purposes, the difference between the Watsons should be visually obvious (i.e. 1 means, the app is closing, your work is lost, the other means the app is ok, your work is ok, but please report the issue). Could this not be incorporated into customer feedback in some way...

    Overall I think it's a good idea in principle, but MS don't tend to ship service packs for the VS.net 1.x versions; back in the day, each VS release prior to .net would be periodically service packed. This was a Good Thing. Then VS .net came along in 2001 & the practise of patching bugs stopped (well, apart from VS 1.1, which was really just a service pack, although marketed as a new version). So, we've had two versions (or perhaps 1 and a half versions) in around 4 years. I've used VS for serious dev from beta1 (Dec 2000), so it probably seems a lot longer to me than it should.

    What I'm trying to say is, great, take diagnostic info to improve the product, but if you only ship a version once every year / 2 years then it's pretty much useless to me - I still have to live with the bugs!
  • Anonymous
    January 26, 2005
    I agree with all points made by Adam.

    In particular, make the dialog clearly different from the 'real' watson dialog, the example image looks too much like the normal 'crash dialog', also for seasoned developers.

    I don't think you should use the 'queued report'. It will be almost impossible for someone to remember what the heck they were doing, unless asked straight away. And, as a 'drawback' of the stability of windows these days, who reboots their computers regularly anymore? I think I reboot my main development machine at the office at most once per three weeks or so. Getting a message about sending an error report for something that happened two weeks ago is, to put it mildly, confusing.
  • Anonymous
    January 26, 2005
    I think for a regular application non critical error reporting would be too in your face, but for Visual Studio the target audience is very different and this sort of scheme works well. I recently experienced a crash with Lutz Roeder's reflector - it packaged the exception and asked if it could be sent with an optional user feedback comment (as to what i was doing). I sent the report and was shortly after contacted about the issue we managed to quickly narrow down the issue to a 1.0 compiled plugin.

    To me it makes sense to have such a system, and while the level of interaction that I had with Reflector may not scale to Studio I think you can leverage the intrinsic knowledge of your userbase and they will apreciate the involvement (especially with an optional don't bother me again style switch), especially as Adam states there are actually fixes forthcoming (and not just hotfixes - they aren't good enough).

    I would also agree on the issue of reports on reboots - I currently pretty much only reboot for Windows Updates and if I've got a lot of work on I'll even wait a few days on that (with the reminder dialog in the corner of my screen). In fact I'd say the same about Studio itself often it will be left on for weeks at a time when I have a major project on.
  • Anonymous
    January 26, 2005
    I also agree that this dialog needs distinguishing from those that are displayed when a fatal error occurs.

    How about having a setting (in Tools->Options say) to
    (a) Report all non-fatal errors silently
    (b) Ask what I want to do about a particular error
    (c) Never ask and never report non-fatal errors.
  • Anonymous
    January 26, 2005
    Over all, diagnostics is a very good idea.

    What is missing is diagnostics response. I mean, when I submit an error report/dump/trace (whatever), I will never know (today) if it's a known problem, there is a fix or workaround or something. This is something Dr. Watson/Mr.Hyde does not do.

    It would be more usefull, if the reporting system could also give me some feedback, like "Yes we kno this. Look this Q or download this fix" or "This a new one" (so you save the work to google it).

  • Anonymous
    January 26, 2005
    The comment has been removed
  • Anonymous
    January 27, 2005
    The comment has been removed
  • Anonymous
    January 27, 2005
    Won't it dramatically affect the precived quality of the product to constantly inform the user that something went wrong?
  • Anonymous
    January 27, 2005
    On my previous - clearly this applies only to production.

    On a beta, the more ways and times I can send you fault information, the better. I love the "queued report" thing, but for beta2 I wouldn't even care on that... I'd always send.

    While I wouldn't want you to crash in an instance where you used to gracefully and silently fail, I'd rather that 10% bump (during the beta) than have you guys never know about a scenario that fails, since that scenario may become somewhat common down the road. That's what a beta is for, in my opinion. I know that perception of quality and stability is important, even at beta2, but quality of RTM is much more important to me (especially since it may never be patched :) )
  • Anonymous
    January 27, 2005
    Lots of things to respond to. I'll start with Paul.

    Paul: "Won't it dramatically affect the precived quality of the product to constantly inform the user that something went wrong? "

    In return i might ask you: "won't it dramatically affect the perceived quality of teh product to constant fail to do something that you asked us to do?

    Should we look professional while acting incompetant, or should we recognize our shortcomings and make a best effort to try to fix them.
  • Anonymous
    January 27, 2005
    Pavel: "I like the idea of using Watson infrastructure to report non-critical problems. I would even tolerate it in a shipping product if it's not very intrusive (e.g. there could be a "automatically send any reports and don't ask me again" option or something). "

    Those options do exist in my implementation, but we havne't exposed them (And most likely won't for beta2). UI changes are very difficult to make considering all the issues like localization and accessibility that have to go into them. However, i will see if something like this can be done for the release.
  • Anonymous
    January 27, 2005
    Adam: "It might confuse users... I'm used to the app being closed and my unsaved work being lost when I see this dialog; but the behaviour in this case will be different to what I expect. I'd recommend changing the dialog, rewording it to make it clearer... if this is going to be used elsewhere in VS for reporting purposes, the difference between the Watsons should be visually obvious (i.e. 1 means, the app is closing, your work is lost, the other means the app is ok, your work is ok, but please report the issue). Could this not be incorporated into customer feedback in some way... "

    I agree with you about the confusion. This dialog is owned by the watson team, so the only customization that can be done is through the hooks that they give us. I'll see if i can make enough changes so that it appears wuite different from teh regular crash dialog.

    "Overall I think it's a good idea in principle, but MS don't tend to ship service packs for the VS.net 1.x versions; back in the day, each VS release prior to .net would be periodically service packed. This was a Good Thing. Then VS .net came along in 2001 & the practise of patching bugs stopped (well, apart from VS 1.1, which was really just a service pack, although marketed as a new version). So, we've had two versions (or perhaps 1 and a half versions) in around 4 years. I've used VS for serious dev from beta1 (Dec 2000), so it probably seems a lot longer to me than it should.

    What I'm trying to say is, great, take diagnostic info to improve the product, but if you only ship a version once every year / 2 years then it's pretty much useless to me - I still have to live with the bugs! "

    Understood. I'm not involved in the patching process for VS so i'm not sure why things have been so different for 2002 and 2003. You rightly percieved that 2003 was really a big service pack for 2002 (which is why it only cost something like 10$ to get). I'm hoping we do better for 2005.
  • Anonymous
    January 27, 2005
    Luc: "I don't think you should use the 'queued report'. It will be almost impossible for someone to remember what the heck they were doing, unless asked straight away."

    Luc, if we need info from the user then we can tell watson to not queue it up and instead ask the user for data. It's only when we don't need any extra info that it would be immediately queued.

    "And, as a 'drawback' of the stability of windows these days, who reboots their computers regularly anymore? I think I reboot my main development machine at the office at most once per three weeks or so. Getting a message about sending an error report for something that happened two weeks ago is, to put it mildly, confusing. "

    True, but is that really such a negative?
  • Anonymous
    January 27, 2005
    Alex: Thanks for the observation about our expected userbase. I agree that the VS audience is probably the best to understand and work with us to work out these kinks. It's nice when you're a developer writing software for developers.

    Also, the anecdote about Reflector was very encouraging.
  • Anonymous
    January 27, 2005
    Laura: "Over all, diagnostics is a very good idea.

    What is missing is diagnostics response. I mean, when I submit an error report/dump/trace (whatever), I will never know (today) if it's a known problem, there is a fix or workaround or something. This is something Dr. Watson/Mr.Hyde does not do.

    It would be more usefull, if the reporting system could also give me some feedback, like "Yes we kno this. Look this Q or download this fix" or "This a new one" (so you save the work to google it). "

    That's an excellent suggestion. I'll send that to the Ladybug team (msdn/ProductFeedback). It might be nice that if we od ship service packs if we then list the bugs that have been fixed since the previous release.
  • Anonymous
    January 27, 2005
    Philip: "If I had any hope that doing #1 would cause the bug to stop six months down the road, I'd do it. But why waste one minute times thousands of users fo no reason? "

    So true.

    I'm doing this so that we will be ablet o fix problems. If that doesn't happen i'll be very unhappy. At the very least i know we'll be fixing issues from beta2, and i hope that we do that as well for teh released product.
  • Anonymous
    January 27, 2005
    Personally, I have no problem sending error info back to the product group if it results in better products and service packs. This is true for betas or RTMs. Not only do I want VS to work very well for me, but also the MS Anti-spyware beta and the MSN Desktop Search beta. In short: I'm responding well to the "help me, help you" request. :-)
  • Anonymous
    January 27, 2005
    The comment has been removed
  • Anonymous
    January 28, 2005
    Paul: A couple of things should temper this opinion. First of all, we're making this change for the beta, and not necessarily for the final release. Second of all, when i've used products like Eclipse and Netbeans I've experienced times when an exception was thrown and caught a top level dialog popped up to inform me about it.

    So if there are people who delight in find out that beta products have bugs in them... well i guess that's just unfortunate. In the meantime i think we owe it to our user base to not provide the appearance of a solid product but to actually provide a solid product instead.

    I think it's worth the risk.

    We'll revaluate this for the final release when the audience becomes much more widespread.
  • Anonymous
    January 28, 2005
    I think this is a mistake. Don't put a debug button there and don't make it in any way/shape/form similar to Dr. Watson. Reason: if I had seen this dialog w/o reading this blog, I would let VS send the error report and then I would promptly close VS. I would simply not trust it to be in a good state. If you can get user's OK to send reports upfront, I wouldnt even ask that.
  • Anonymous
    January 31, 2005
    Please give me an option to always submit these for VS - then I won't be bothered by the dialogs disrupting my thinking, I won't perceive that the app is buggy, and I can know that the real bugs that I'm hitting are being looked at.

    It would be even better if I could know if my issue(s) had been looked at / fixed - then I would absolutely love this feature and really feel the improved quality enabled by it (like the Reflector case).
  • Anonymous
    January 31, 2005
    James: The options for configuring this are specified in my later blog post on this topic. By setting teh regkeys:

    HKCUSOFTWAREMICROSOFTVISUALSTUDIO8.0CSHARPEDITORWatson_SendExceptionReportWithoutPrompting - true

    HKCUSOFTWAREMICROSOFTVISUALSTUDIO8.0CSHARPEDITORWatson_SendFailureReportWithoutPrompting - true

    You will never be bothered while you're in the middle of something.
  • Anonymous
    February 03, 2005
    Hmm .. Will you trust me or nope - but I've just seen this dialog on my PC.

    It was not happy expirience for me.
    Actualy I do not care that much about how it looks like or that it suppossed to do.

    The biggest problem for me are IDE hang for at least minute (or two-three) before showing this dialog.

    And this is definely Cyrus fault "ModName: cslangsvc.dll" ;-)

    I've just created a repro and will fill a bug report on Product Feedback.
  • Anonymous
    February 03, 2005
    Done.

    FDBK21068

    http://lab.msdn.microsoft.com/ProductFeedback/viewFeedback.aspx?feedbackId=FDBK21068
  • Anonymous
    February 03, 2005
    AT: your product feedback was marked as spam. Can you send it to me through the contact link.

    And yes, we know that there will be a hang while the minidump is collected. We feel that that's acceptable behavior during the beta given the enormous feedback it will give us about problems you are having.
  • Anonymous
    February 03, 2005
    I'm sorry. It was me who marked it as spam to remove from database.

    I've clicked Submit twice and got two exactly the same reports on Product Feedback.

    Take a look on FDBK21069
  • Anonymous
    February 10, 2005
    The comment has been removed
  • Anonymous
    January 04, 2008
    PingBack from http://actors.247blogging.info/?p=4422