Introduction

Completed

The Dickerson Hierarchy of Reliability offers a map for navigating reliability challenges; what needs to be addressed and in what order. Like other hierarchies of this sort, it's important that the level you're on is solid before moving up the pyramid.

This module addresses the level roughly in the middle of the pyramid. Having addressed your monitoring and your incident review (perhaps with the help of other Learn modules in this learning path), you now have the opportunity to focus on principles and practices that can help you level up your operations practice.

In this module, we'll be focusing on post-incident reviews that can help you learn from failure, resulting in improved reliability.

When you've completed this module, you will:

  • Discover the importance of learning from incidents.
  • Understand the aspects of complex systems that make learning from failure important.
  • Learn when and how to conduct a post-incident review.
  • Understand the purpose and goals of a post-incident review.
  • Know the components that go into a good post-incident review.
  • Be aware of common traps to avoid.
  • Identify helpful practices to conduct a better review.

An introductory story

To set the scene for this module, here's a true story (or half of it, actually; we'll get to the second part later in this module):

During World War II, the B-17 "Flying Fortress" aircraft was involved in a series of accidents. We don't know all of the details of these accidents, and we don't know exactly how many there were. It was wartime, and many of the details were secret and remain secret. What we know is that there were a significant number of similar incidents, involving many individual aircraft and—if it helps when talking about such a serious topic—we can be almost certain that nobody was gravely injured in any of them.

In each case, what would happen is this: A B-17 would come in to land, would land successfully, and then either on the runway or taxiing back to the hanger, something strange would happen. Something serious would happen. The B-17 would be on the ground and all of a sudden the landing gear would suddenly retract, and the plane would collapse onto the runway.

In each case, the investigators would look for evidence of mechanical or electrical failure, and in each case, they couldn't find any. So, what they concluded was that this was a case of pilot error, that the pilots had mistakenly retracted the landing gear.

Here's two additional pieces of information: the investigators were correct that no mechanical or electrical failures had occurred. The accidents kept happening.

This information might lead you to be dissatisfied with the initial conclusion reached about these accidents, perhaps leave you wondering if this is the whole story. In this module, we're going to propose that something is missing in this conclusion and in the investigations that led to it.