Error handling, part 4: error information recap
Picking up again the series on the error handling, here is the summary of what a good error handling system should contain and do (starting with a recap of the previous posts and then going further):
- Errors should be nestable, with the high-level information on top and the full details on the bottom.
- Errors should contain a machine-readable ID (a numeric ID or a class).
- Errors should contain a human-readable explanation of the error and of the ways to fix it. When the error involves some named object (such as a file), the text of the error should contain the name of the object.
- The human-readable explanation should be localizable.
- The error should contain a human-readable constant ID. This might be the same as the machine-readable numeric ID, or a separate numeric ID used by the localization subsystem, or some fixed text string in ASCII. It allows the support engineers to understand the error messages even if they're reported by a user in a different locale.
- The machine-readable ID should not be related to the formatting of the message. I.e. if at some point the developer decides to add one more piece of formatted information to the error message, that usually leads to the change of the message ID used by the localization subsystem. But it should not change the machine-readable ID, or it would require to re-compile all the code that uses it.
- The ID namespace should be modular. I.e. each software module should have its own namespace for defining the error IDs, and the full ID must include the module identity and the error identity.
- There should be a way to request only the machine-readable error ID on the machine-handler errors from the frequently handled functions. The formatting of the messages is expensive, so if a function is to be polled 100K times a second, you really wouldn't wait it to format the EAGAIN as a human-readable message. Even with the approaches like ETW that try to pass through the arguments for formatting in a message and leave the formatting itself until much later, the same problem still applies: just building the arguments may be more expensive than formatting. So you'd still would want to get the machine-readable ID without an ETW message.
- There should be an easy way to log the errors, to get their text for displaying in the dialog windows, and to read the written logs.
- When the errors are reported or logged, they should have an information about where they came from. I.e. if you have 50 threads reusing some library code, when you get an error, you'd want to know, in which thread it happened. I'm not entirely sure yet if this should be a part of the errors themselves or added during logging. My first cut at it uses the approach with adding this information when the error gets reported (i.e. written to a log), and it seems to work fairly well, but the other way may have its advantages.
- There should be a way to report the messages of varying severity (errors, warnings, informational), combine them in the nestable form, and propagate the severity to the root of the nesting.
So far I've done a couple of attempts at writing the error reporting subsystems. Neither of them implements all of these principles (I've only formulated them just now) but they do the decent subsets. I'll talk about them in the next installments.