Resilience is NOT necessarily a good thing

I just ran into this post by Eric Brechner who is the director of Microsoft's Engineering Excellence center.

What really caught my eye was his opening paragraph:

I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it's better to crash and let Watson report the error than it is to catch the exception and try to correct it.

Wow.  I'm not going to mince words: What a profoundly stupid assertion to make.  Of course it's better to crash and let the OS handle the exception than to try to continue after an exception.

 

I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them.  In my experience handling exceptions and attempting to continue is a recipe for disaster.  At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve.  At it's worst, exception handling can either introduce security holes or render security mitigations irrelevant.

I have absolutely no problems with fail fast (which is what Eric suggests with his "Restart" option).  I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control).  In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive.  This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you've configured the OS to restart it).

I also agree with Eric's comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there's someone who is going to actually look at the log and can understand the contents of the log, otherwise the  logs just consume disk space). 

 

But I simply can't wrap my head around the idea that it's ok to catch exceptions and continue to run.  Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

The bottom line is that when an exception is thrown, your program is in an unknown state.  Attempting to continue in that unknown state is pointless and potentially extremely dangerous - you literally have no idea what's going on in your program.  Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem.  Any other attempt at continuing is a recipe for disaster.

 

-------

[1] To be clear: I'm not necessarily talking about C++ exceptions here, just structured exceptions.  For some C++ and C# exceptions, it's ok to catch the exception and continue, assuming that you understand the root cause of the exception.  But if you don't know the exact cause of the exception you should never proceed.  For instance, if your binary tree class throws a "Tree Corrupt" exception, you really shouldn't continue to run, but if opening a file throws a "file not found" exception, it's likely to be ok.  For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

 

Edit: Cleaned up wording in the footnote.

Comments

  • Anonymous
    May 01, 2008
    I like the principle: "You should handle an exception only if you know what to do with it."

  • Anonymous
    May 01, 2008
    Doug: Works for me, but only for C++ exceptions (and RPC exceptions, which are essentially the same as C++ exceptions except they're propogated by SEH).

  • Anonymous
    May 01, 2008
    Larry, I think I may be a little unclear so I ask for your help. In your example of the binary search tree, if it throws a tree corrupt exception, what would be wrong with wiping out the tree, making a new tree, and starting over? Also, I assume that other exceptions, such as a thrown exception because an access database you are trying to connect to doesn't exist in which case you tell the user to either enter a new path or give them the option to close the program gracefully, are not what you are talking about. Thank you for your assistance in helping me understand. P.S. I like the idea of "handle an exception if and only if you know what to do with it."  But I would extend that to languages such as those of .Net and Java.  Then again, you may have managed environments in a whole new category.   JamesNT

  • Anonymous
    May 01, 2008
    Maybe I'm not reading you right, but are you saying that if someone powers down Google's datacenter and my C#-implemented browser gets some sort of TimeoutException from the TCP stack, the correct thing is for my browser to crash?  

  • Anonymous
    May 01, 2008
    JamesNT: That might be ok, IF you can guarantee that the only cause of the tree corruption failure is that the trees internal state is corrupt.   But if the tree corruption error is thrown because of something else (I don't know, maybe it was because of an error in an underlying heap manager that was rethrown as a tree corruption error, you can't. And that's exactly my point.  When you encounter an exception you don't FULLY understand, you can make NO assumptions about the state of the process.  And the only safe action to take at that point is to die and let the OS restart you if possible. The "access database you are trying to connect to doesn't exist" scenario is analogous to my "file not found" example - in that case, the exception really isn't "exceptional", it's just a mechanism used by the database library to communicate an error and you handle it just like you handle any other error.

  • Anonymous
    May 01, 2008
    Reliability is a complicated thing.  There's a tradeoff between availability and integrity,  and that tradeoff becomes more severe as a system becomes larger and more distributed.  UNIX tends to choose availability over integrity,  and Windows does the opposite. You're more likely to find some funny characters at the end of a file on a UNIX system after a crash,  and more likely to have a Windows machine give up the ghost or let a badly written application lock up your desktop for a few minutes. Life-critical systems can't shut down just because something unexpected happened.  Neither can large scale web sites or e-commerce systems.  There's a whole art of system recovery,  partitioning of corruption,  and having the system stay in a 'sane' state that isn't necessarily correct. People have different expectations for desktop apps:  people expect to have them crash and lose their work.  That's one of the reasons why the world is giving up on desktop apps.

  • Anonymous
    May 01, 2008
    John: You're confusing exceptions and errors (it's really easy to confuse the two). Exceptions are supposed to be used to handle <i>exceptional</i> events (like corrupted internal state).  They're not the same as errors (which are used to express "normal" failures).   The only kind of exception handling that is unilaterally bad is structured exception handling (except in VERY limited circumstances like handling RPC failures and kernel mode probes of user mode addresses).   See my footnote: C++ and C# and Java exceptions <i>might</i> be ok IF you can guarantee you know the reason for the failure. I'm not aware of any networking stacks that use SEH to represent network failures.

  • Anonymous
    May 01, 2008
    Its not that hard if its a known exception that can be handled, handle it.  If its an unknown/unexpected exception crash and report .

  • Anonymous
    May 01, 2008
    From that article, I didn't get the idea that continuing from exceptions was considered a good practice. I got the idea that only that using Watson alone to handle crashes is not sufficient. MSN

  • Anonymous
    May 01, 2008
    It depends on your application domain.  In my desktop, userland world, crashing is a wonderful option. In my brother's medical device world, crashing means a kid stops breathing.  The FDA kinda insists that software in such devices fails in a safe way.  Crashing isn't a safe way if it's providing life support to the user.  Restarting the process may or may not be depending on the situation. Aside from that, I agree that assertions only belong in debug builds.  But if you did have something assertion-like in a release build, then it should be treated just as critically as an exception.  If your assertion failed, then your program is in an unknown, illegal, or improper state--just as it would be if an exception is thrown.  Even if the assertion itself is the bug, others who wrote the code that follows may be counting on the assumption it represents.  Report and bail out.

  • Anonymous
    May 01, 2008
    I agree with the statement "You should handle an exception only if you know what to do with it.". Somehow, it seems to me that the more you try, the harder you fall. If you intend to make a system more reliable, the few failures will be even bigger headaches.

  • Anonymous
    May 01, 2008
    I do believe I see where Larry is coming from now.  Exceptions for things such as missing files, incorrect database passwords, and things of that nature you can handle yourself since either you know the answer, can give the user a chance to answer what needs to be done (i.e. enter the correct password or path), or can allow the program to exit gracefully. But for those exceptions where you don't have the slightest idea as to what could have happened, don't try to continue since that is analogous to ignoring there is a problem.  Let the program die in flames, then open up a formal investigation to see what happened. Programs that attempt to continue after an unknown exception actually sound dubious when you think about it. JamesNT

  • Anonymous
    May 01, 2008
    I think Eric just picked a really bad quote to start off his article with.  His article doesn't really advocate "catching the exception and try to correct it" as the quote may suggest.  The closest thing to it that was advocated was "retry"ing an operation, and the examples he described has nothing to do w/ catching an exception and correct it. The overall point of the article is really to make error recovery less disruptive to the user experience, and that applications needs to be written with that in mind. This kinda reminds me of the MobileSafari browser on the iPhone and iPod Touch.  There's been several times where it clearly crashed, but what happens is that the OS simply closes the browser without telling the user anything.  It builds up the crash dumps silently on the device and those get send to Apple when you sync the device thru iTunes (of course they don't call it crash dumps, but something like "customer data to improve the software").  I honestly don't think this business of "hiding" the fact that it crashed is really that much of an improvement but I can see users getting fooled into thinking that things are working better than they actually do, and well, user perception is king.  (As an anecdote, this doesn't always work anyway; once my iPod Touch actually wound up in a hard freeze that required a full power off/power on to reset the device.)

  • Anonymous
    May 01, 2008
    Hang on....under Windows I thought structured exceptions were the basic exception type, and that C++ exceptions were implemented as structured exceptions. But you're saying that C++ exceptions are not implemented as SEs?

  • Anonymous
    May 01, 2008
    Exceptions should be used for its original intended purposes: exceptional circumstances. All those C++ exceptions for "errors" like file not found are just unnecessary complexity, they should be replaced by error codes. When an application catches an exception, it should save the work in progress as much as possible, and then exit and let error reporting take over.

  • Anonymous
    May 01, 2008
    Karellen, on some platforms C++ exceptions are implemented with SEH.  But that's one of those "ok" scenarios (as long as you never catch the SEH exception).

  • Anonymous
    May 01, 2008
    I spent a decade or so working on a large server application (one that you might remember).  It was written in C, meaning that C++ exceptions were not available, and used SEH to handle rare but survivable events: out of memory, database update errors, etc.  We very carefully filtered exceptions so that we only caught exceptions that we had raised.  System raised exceptions (e.g., bad pointer deref) we very carefully would not handle. By your argument above what we did was "profoundly stupid" because we dared to catch an SEH exception.  That doesn't seem right. I think you're conflating SEH exceptions with system exceptions, and C++ exceptions with app ones, but that isn't always the case. I prefer the "don't catch what you don't understand" rule above, with the proviso that most times you don't understand as much as you think.  If an app is adhering to the very strict rule "only catch what you raised yourself", why is it "stupid" to do so using SEH and perfectly OK using C++ exceptions?  Is it still stupid when SEH is the only exception mechanism available?

  • Anonymous
    May 01, 2008
    While I 100% agree with the main topic of your article the footnote is somewhat erroneous. I assume that you are familiar with C++ exception safety guarantees theory. If C++ code provides at least basic exception safety (and if it doesn't you shouldn't be using it, just like you don't use code with known buffer overflows!) then catching any C++ exception is always safe. Whether it is wise to continue as if nothing had happened is another matter, but you will  never get to broken invariants state that is possible with plain SEH. In C# and Java where exception are used for both SEH-like and C++-like purposes you are right: only <i>some</i> exceptions are safe.  

  • Anonymous
    May 01, 2008
    The comment has been removed

  • Anonymous
    May 01, 2008
    I definitely agree that catching and eating something like an unexpected access violation is a really bad idea, and that the best course of action is to eventually crash out. What I have to take specific exception to, is the idea that the app should just defer to the OS dump mechanism. This is useless to those of us who can't get WinQual accounts and is fairly inefficient if you need to get information that isn't covered by a normal minidump. It's often very valuable to log application-level information and to present an enhanced explanation to the user, and for typical user-level applications I don't think this unreasonably impacts security. The WER dialog is pretty much useless to the user, whereas in an app-customized report I can often give some indication as to what might have triggered the crash and how to avoid it. What I would love is the ability to have the OS auto-launch a second process whenever it sees a crash in my app's process, with the app frozen so the second app can analyze it safely. Unfortunately, unless I'm mistaken, the main choices only seem to be in-process handling (SEH or WER callback), or outright termination in case of severe failures.

  • Anonymous
    May 01, 2008
    Phaeron: I've forwarded your suggestion to the dev lead for the Watson team, it's an interesting idea.

  • Anonymous
    May 01, 2008
    I think you've misread Eric's post and set up a bit of a strawman here. His point is that instead of crashing out to the OS exception handling; you should inform the user that an error has occured and (at the limit, his 5th R) offer to get the user a new device. So fair enough; he has some fairly odd ideas about how to handle errors, (perhaps open a browser window at dell.com to buy a new PC?) but his point is that you should let the user know an error has occured; log it; return the user to a known state; and continue from there, rather than just crashing with a useless error dialog that most users will have no idea how to respond to. (Well, they do; just click "Don't Send") You've based your post on the idea that to "catch the exception and try to correct it" means to continue from an unknown state; which is not what he said.

  • Anonymous
    May 01, 2008
    Phaeron: Isn't that what happens if you install WinDbg as the system post mortem debugger instead of Watson, or have I misunderstood?

  • Anonymous
    May 02, 2008
    Phaeron: The Watson lead indicated to me that there is no cost to setting up a Winqual account, so he was confused about why a developer (or team of developers) wouldn't be able to get one.  Is it the effort of getting a code signing cert? Steve: My point is that if you "let the user know an error has ocurred", you're running code while your process is in an unknown state (the code to let the user know and log the error).  Let the OS let the user know an error has ocurred and restart your app.   The Watson team has literally spent years working on figuring out how to reliably and safely dump core from a corrupted process, it's an extraordinarily hard problem that should not be re-invented.

  • Anonymous
    May 02, 2008
    IIRC, one valid use of catching and continuing SEH exceptions was when using VirtualAlloc to reserve a huge array and then only commit a page or so at a time.

  • Anonymous
    May 02, 2008
    Upon re-reading Eric's article, I think Larry has misread his intent.  Eric is talking about error handling in environments that must be resilient.  Many of his examples are of non-exceptional errors. And other than his first R (Retry to get past a transient condition), the rest of his advice (Restart, Reboot, Reimage, Replace) is about getting the process and system back into a valid state. I don't think the points of view you two are arguing are that different.  I think he starts out provocatively, and that's what a lot of people are reacting to.  He's not saying you can trust the state.  He's saying you have to keep your service up, so find a way to get back to a known good state.

  • Anonymous
    May 02, 2008
    "let the OS exception handler dump core" Or have your own exception handler dump core.

  • Anonymous
    May 02, 2008
    The comment has been removed

  • Anonymous
    May 02, 2008
    The comment has been removed

  • Anonymous
    May 02, 2008
    Catch an exception, continue running, but only to save the current document - and then exit. For that to work you need, of course, to keep the document in an consistent state in memory...

  • Anonymous
    May 02, 2008
    Tony: That's the rub.  How do you know that the document is in a consistent state in memory? It's not an easy a challenge.  That's why I'm continuing to state that crashing is the right thing to do.

  • Anonymous
    May 02, 2008
    The comment has been removed

  • Anonymous
    May 02, 2008
    >[Larry]: How do you know that the document is in a consistent state in memory?  It's not an easy challenge.  That's why I'm continuing to state that crashing is the right thing to do. I read that as "it's hard, therefore we shouldn't try".  But I know that's not what you mean, because you have some great examples of exception handling done right. I think many of the commenters on both blogs (Eric's and yours) are looking at the problem too coarsely; too black and white.  How do you know that the document is in a consistent state in memory?  By designing in consistency checks.  Little extra validation routines that take advantage of a little extra redundancy built into your document's memory representation. This isn't really novel work.  It's just work that hasn't traditionally been a priority at Microsoft outside of specialized teams.  But I believe, and I think this is Eric's point as well, that Trustworthy Computing includes Highly Available Software and that we need to tackle the hard challenges associated with that.

  • Anonymous
    May 02, 2008
    One question. When JIT translates callvirt into x86 assembly. It turns it into mov eax, [ecx] call whatever The move will raise a exception if 'this' is null, because the processor will attempt to move from 0 location to eax. However, this exception is caught and eventually turned into NullReferenceException. But, [ecx] could equally reside outside the committed memory too and it would turn that into NullReferenceException too.

  • Anonymous
    May 03, 2008
    Badar: I think that it operates on the assumption that in the managed code all pointers to objects must be valid or null because you have no way to generate pointer to object which points into some random memory.

  • Anonymous
    May 03, 2008
    How the Watson decides what memory to insert into the dump send to the Winqual? Sometimes I needed to check a structure whose reference was passed as parameter a little above in the call stack. Unfortunately most of the time that memory was not included in the dump.

  • Anonymous
    May 03, 2008
    Alan, I'm willing to concede that for some exceptions it may be possible to catch them and terminate after saving state.  But for the many of them (like STATUS_ACCESS_VIOLATION, which is likely to be the most common one) there is no safe way of running ANY additional code in the process.  We have far too many examples of security vulnerabilities caused by people trying to be resiliant in the face of an access violation error for that kind of practice to be considered safe. Jiri, by default Watson generates a minidump, which consists of the stacks for each of threads and the thread contexts for each thread.  You may add additional data to the dump with WerRegisterMemoryBlock.  See here for more details: http://msdn2.microsoft.com/en-us/library/ms678713(VS.85).aspx Phaeron: As far as I know, the NT kernel doesn't have many asserts that are live.  Some of them (paged fault at raised IRQL) are sort-of asserts, but they exist because (a) in all circumstances, a page fault at IRQL2 is a bug, and more importantly (b) there is no way to satisfy the page fault.

  • Anonymous
    May 03, 2008
    > You may add additional data to the dump with WerRegisterMemoryBlock. What I would like is equivalent of MiniDumpWithIndirectlyReferencedMemory. While that might be doable with the WerRegisterMemoryBlock and manual stack walk, I do not think that attempting to do that from the exception handler or crashing process is a good idea.

  • Anonymous
    May 03, 2008
    Phaeron: The main reason the NT Kernel bluescreens (i.e. 'asserts') is to protect the disk data and metadata from corruption. I disagree with Larry's point that assertions should not be in released code.  I just think we don't do it for performance reasons.  

  • Anonymous
    May 04, 2008
    >[Larry]: But for the many of them (like STATUS_ACCESS_VIOLATION, which is likely to be the most common one) there is no safe way of running ANY additional code in the process. I think that statement also is too absolute (too black and white).  Even Dave LeBlanc points out (in a post inked above) that it's not about risk avoidance, it's about risk management -- shades of grey. But you yourself point out the perfect counterexample: catching access violations while probing parameters across an untrusted->trusted boundary.  In other words, there are patterns and practices where even catching access violations, in a controlled way, can increase both resiliency and security.

  • Anonymous
    May 04, 2008
    The comment has been removed

  • Anonymous
    May 04, 2008
    The comment has been removed

  • Anonymous
    May 04, 2008
    The comment has been removed

  • Anonymous
    May 04, 2008
    Alan, ah, that makes a lot more sense to me - I use locally scoped exception handlers a lot (I live in an error-code based world where exceptions are evil).  As I've said before, in certain limited scenarios (kernel/user parameter probes, RPC error handling, etc) locally scoped handlers can be quite useful.   I HAD assumed (based on the recovery behaviors that Eric was proposing) that you and he were promoting the idea of global top level exception handlers.

  • Anonymous
    May 05, 2008
    The comment has been removed

  • Anonymous
    May 07, 2008
    @HagenP: Microsoft typically can debug problems that happen on end-user PCs if the markers are clear enough (i.e. the crash happens close enough to the cause of the corruption) through the Watson facility.  From the perspective of a developer trying to diagnose crashes in the field, having something crash early is good. And by the time the product ships, there should be no known states that the programmer coulda, wouda, shoulda covered, but didn't.  If the feature in question is not of that quality level, then it should be pulled out or the release delayed.  At least this is the case in the commercial world.

  • Anonymous
    May 08, 2008
    @nksingh: "And by the time the product ships, there should be no known states that the programmer coulda, wouda, shoulda covered, but didn't." Of course we must apply the statement to all software involved, including OS, drivers, etc. If all these are covered, all known states handled, then the only thing LEFT to cause a crash is faulty hardware, correct? "If the feature in question is not of that quality level, then it should be pulled out or the release delayed.  At least this is the case in the commercial world." I totally agree with you here. The key word here being "should". Unfortunately, modern software is so complex that with this precondition nothing could be shipped anymore. Including Operating Systems.

  • Anonymous
    May 08, 2008
    Tell me about it.  I've just recently started working with an internal project maintained by another group of developers.  All of their public APIs are wrapped in code like this: __declspec(dllexport) BOOL SomeFunction(...) {   BOOL bResult = FALSE;   __try   {       __try      {         // do stuff      }      __finally      {         // clean up any resources      }   }   __except(EXCEPTION_EXECUTE_HANDLER)   {      // log some generic exception message   }   return bResult; } And they validate all of their parameters with IsValidWritePtr().  Needless to say I hate my life.

  • Anonymous
    May 08, 2008
    The comment has been removed

  • Anonymous
    May 08, 2008
    Wow Simon, long time no hear :).

  • Anonymous
    May 08, 2008
    I think that, like many other issues in writing commercial software, this one is compounded by conflicting interests from the business side and the software side: Business side wants software that never crashes, ever. Creates a better customer experience. Software side wants software that is easier to debug, since it will be more maintainable and improve faster. However, speaking from a purely theoretical standpoint, is it not possible to have an exception handler that is located in read-only memory, allocates it's own memory for storing variables to avoid corruption, and that examines the data included with the exception and the entire program state to determine exactly what is inconsistent about it, and possibly restore it to a consistent state? Is this something the Watson team have considered? Or would it be deemed a potential security vulnerability?

  • Anonymous
    May 10, 2008
    The comment has been removed

  • Anonymous
    May 11, 2008
    The comment has been removed

  • Anonymous
    May 12, 2008
    Jay: I'm not on the Watson team, so I can't explain the problems.  But the Watson folks have been trying to solve the problem (generating reliable crash dumps from corrupted processes) for a long time and it's still not perfect.  It's a very hard problem, and thus not one to be undertaken lightly.   Friday: Interesting, I've never had a problem with Word's autorecovery.  Go figure. Triangle: You can't make any assumptions when an exception happened.  You might be able to write out the minidump, but how would you "fix" the state of memory?

  • Anonymous
    May 12, 2008
    The comment has been removed

  • Anonymous
    May 12, 2008
    Triangle: On platform with DEP (or NX or W^X) enabled, you might (not always, but sometimes - it depends on what was happening at the time of the crash).  The problem is that Windows runs on platforms where DEP is not supported, and DEP is optional - not every process running on the machine has NX enabled. I'm not an area expert, but I have talked to the area expert, and he's pretty adamant that you can't trust the state of the process.

  • Anonymous
    May 12, 2008
    Triangle: I suspect one of the problems is that it isn't just your exception handling code that has to work.  If your handler is going to actually do anything, it has to call API functions, and (it is my understanding that) the API libraries also keep data in the process memory space. If the system heap is corrupt, or the handle table, or whatever, most API functions aren't going to be safe to call, and it is likely to be difficult to work out which ones are. Microsoft might be able to address this by creating a separate limited API for this specific purpose, but probably the only way to get very much done safely would be to launch a new process to examine the crashed one, as Phaeron suggested.

  • Anonymous
    May 12, 2008
    Harry, it's my understanding that the watson guys solved it by writing an app (werfault) that generates the dump information and processes it.  Essentially the same idea but much more reliable.

  • Anonymous
    May 13, 2008
    Larry, I did read your footnote, and it said "For some C++ and C# exceptions, it's ok to catch the exception and continue", but that "For structured exceptions, I know of NO circumstance under which it is appropriate to continue running." Given that we were writing in C, the C++ and C# exception handling mechanisms were not available.  If we wanted to use exceptions at all, SEH was our only option.  You know as well as I that "it will all be fine if you re-write your entire app in {Lisp|Prolog|C#|today's new language}" is rarely, if ever, a feasible alternative. We were as careful as possible (given the bugs in the SEH mechanism at the time!) to catch only those exceptions that we had raised, and what we did still doesn't seem "profoundly stupid". You could argue that C++ & C# exceptions are safer than SEH, for the standard reasons that separate address spaces are better, in that they avoid accidental collisions between different bodies of code. I think it's better, though, to argue that you should only handle your own exceptions.  C++ and C# might make that automatic, but when they're unavailable it's still possible to use SEH properly.  That's not what you argued, though. Don P.S.  In the 14 years I worked on that code, the Windows guys never once complained to me about exception handling, although possibly only because they were too busy complaining about how I used heaps.

  • Anonymous
    May 13, 2008
    Don, 14 years ago, you probably made the right decision.  And if your exception filter is carefully constructed, it's possible to do it right.   It's also possible to ride a motorcycle without a helmet or to walk a tightrope from one 100 story building to another or to strap a jet engine on your back and rollerskate. But I wouldn't advise any of the above unless you were REALLY careful and knew exactly what you were doing.

  • Anonymous
    May 14, 2008
    The comment has been removed

  • Anonymous
    May 14, 2008
    The comment has been removed

  • Anonymous
    May 14, 2008
    Larry, from MSDN, for RegisterApplicationRecoveryCallback: "If the application encounters an unhandled exception or becomes unresponsive, Windows Error Reporting (WER) calls the specified recovery callback." To me, that looks as if the crashed application is called.  I understand that another instance can be created automatically by the manager later. My problem (and yours, if I understand your article) is with the idea that it can ever be a good idea to try to continue when an unknown error has occurred. Restarting the app is ok, the application recovery callback is the stupid idea.

  • Anonymous
    June 08, 2008
    So if I write a kernel mode driver to access the CPU MSR registers it is OK to BSOD the OS because someone using the driver has attempted to access the MSR register which doesn't exist on a certain CPU? Bear in mind that new MSRs are added as new CPUs are made, and driver cannot be responsible for validating the input. The only thing driver can do is to catch the exception and return error just like it does at present: __try { __writemsr(reg, value); } __except (EXCEPTION_EXECUTE_HANDLER) { return GetExceptionCode(); } I also prefer the early crash but sometimes it is not neccessary. As for Restart manager, it is a nice idea but useless in its current incarnation. Once the application crashes the context has been lost. If there was no (auto) save and thus the work has been lost then the user can restart the application on his own, there is no benefit in offering him to restart automatically. What would be usefull however is if the restart manager could periodically take snapshots of the registered and running application and restore the snapshot on crash.

  • Anonymous
    June 10, 2008
    The comment has been removed

  • Anonymous
    June 16, 2008
    Monday, May 12, 2008 10:23 AM by LarryOsterman "Friday: Interesting, I've never had a problem with Word's autorecovery.  Go figure." Interesting.  I guess that means it works in English language versions of Word.  I've never seen it work though, just like Friday.