Summary
In this module, we've discussed the post-incident review in depth. This is one of our most powerful tools for turning the incidents we all experience into the fuel for leveling up our operations practice. It's a key part of increasing our reliability.
We've explored some of the aspects of complex systems that make incidents inevitable. Given this inevitability, it makes sense to focus not only on trying to prevent a catastrophe, but also how we can respond to one. It gives us an incentive to find and use tools that can help improve that response as part of the analysis phase in the incident lifecycle.
This is where the post-incident review comes into play. After getting a good sense of what a post-incident review is (and isn't) plus its purpose, we made sure to explore the characteristics and components that are needed to make it effective.
Next came a discussion of the process and how to get started using tools available in Azure.
To improve the chances of success, we then explored how to avoid the common traps people fall into when running post-incident reviews, and some good practices that you can use to help make your post-incident reviews a resounding success.