A critical junction in support issues: Root Cause VS. Relief
In the lifetime of a case, particularly one of high impact, a point is reached where a decision is necessary regarding the direction of the case. This decision impacts how the situation is approached going forward. The decision that must be made is:
“Do I want Root Cause or Relief?”
This decision is important because there are trade offs that have to be made.
You might ask “What is the difference?” In the case of relief, the goal is to restore service as quickly as possible, to determine what is failing and prevent the failure from occurring. In the case of root cause, effort is put forth to understand all the sequence of events and conditions causing the failure in order to pinpoint the specific action that resulted in the failure. Then action is taken to address the real reason of the failure, not just the symptoms of the failure.
So here is cliché analogy (BTW this is a fictious story):
Every 2-3 months my car’s voltage regulator goes bad. An average mechanic will simply locate failed part and replace it. Every 2-3 months the process is repeated. A better mechanic will realize after the second time something must be causing the Voltage Regulator to go bad. After diagnosing the system she discovers the alternator is not producing enough voltage to satisfy the needs of the vehicle. The alternator is not performing within spec because the coils are starting to corrode, therefore he replaces the alternator.
When it comes to computers, there is an additional challenge to Root Cause Analysis (RCA) vs. relief. The process of providing relief (replacing the voltage regulator) in most cases destroys the information necessary (the alternator) to perform root cause analysis. Relief may be to reboot the computer, but there may not be enough information in the various logs produced to determine what happened after the system is available again. Why not? Well, that is a never ending dilemma between performance versus supportability. There are strong arguments on both sides of that debate I don’t want to cover here. RCA is a labor and time intensive process, collecting information and examining the system in a failed state requires additional time which isn’t acceptable to companies operating under Service Level Agreements (SLAs). In the majority of cases it takes multiple occurrences of the problem in the customer’s environment to allow for RCA to be performed.
It is a common practice of mine to be very clear when working with customers when we arrive at that critical junction. This typically comes when state information could be lost resulting from the actions of providing relief (rebooting, killing a process, restoring, etc.). So can you have both Root Cause and Relief? I have to acknowledge the possibly, however in my experience providing both in the same steps occurs less than roughly 15% of the time.
I am curious to hear back how Root Cause vs. Relief is perceived and valued by your company when dealing with issues.
Comments
Anonymous
January 30, 2005
The real question is how do you satisfy a customer that wants both RCA and Relief? More often then not we dont have the luxury of providing a RCA because of SLA's but we are still required to provide one even though we dont know what it really was.Anonymous
January 31, 2005
But at what point is it a loss for Microsoft? If you do not fully invesitgate the RCA and just provide Relief, Microsoft has just lost out on the chance to futher their ability to root out the causes of software issues. Example right now I have been working with MS on a case for over 43 hours (phone hours) MS has a relief answer, but I am determined to get the RCA because I want to further our database of troubleshooting and to save us the call in the future. After all if I experience the issue once chances are good that it will come back in the future and I need to save my company the $245 support call. On the other hand I have had issues where I just want relief and no RCA. So it is a 50/50 issue I guess.Anonymous
February 01, 2005
Steve, you are absolutely right. Microsoft does loose out consistently in performing RCA on issues. It has to do with the decisions that are made by the customer during the incident regarding pursing RCA or relief. We have to honor what the customer wants to do and we can only be diligent about communicating this when we believe the junction has been reached, where further steps would jepordize the ability to provide RCA.
Unfortunatly with the computers, you typically loose evidence of the problem in the steps of providing relief... catch 22...Anonymous
February 02, 2005
The comment has been removedAnonymous
February 03, 2005
For me it depends in large part on two factors.
1. The level of business impact the issue is causing. If it is a critical system, root cause (while important) generally has to take a backseat to relief. If the problem has happened before, the balance shifts somewhat I suppose.
2. The confidence level of the engineer (and of me in the engineer's ability) that in losing RCA data the steps will actualy provide relief.Anonymous
February 09, 2005
It seems to be a catch 22 for either side. Businesses cannot afford the downtime and don't care to investigate the RCS. On the other hand, IT typically has to be able to identify the root cause to properly fix the problem.
Eventlogs and performance logs can typically lead you to the precise cause of failure. But in somecases, it may take an additional call to MS. It's important to relay everything that has happened to the support engineer in order to properly get a solid RCA.
All your really doing when contacting MS is brining in an additional resource for assistance. All they know is what you tell them. It goes for any outsourced consultant.
Althought, uptime is the most important thing, identifying the RCA has to be addressed by all times to fully support and update any SLA.Anonymous
February 10, 2005
The comment has been removedAnonymous
June 09, 2009
PingBack from http://insomniacuresite.info/story.php?id=9510Anonymous
June 18, 2009
PingBack from http://firepitidea.info/story.php?id=1348