다음을 통해 공유


Measuring Risk in Application Portfolio Management

I decided to take a few minutes of my vacation time to catch up on my reading, and I read through Mike Walker's article on MSDN on APM and EA.  It is an interesting and useful article. (I'd give it a B-).

One thing that I'd like to highlight in the practice of application portfolio management is that of risk management, an area that Mike implicitely touches on, but which I believe is fundamental to the business case of APM.

You see, there is nothing wrong with owning a bunch of stuff.  Think about it: how many chairs does your company own?  How many desks?  How often does the company spend money to replace every chair in every office?  If your business is typical, the answer to that question may very well be "never."

Yet, we do see projects where a company will replace four billing systems with a single billing system.  That happens.  Clearly, owning an application portfolio is different than owning an inventory of assets.

Key among the differences is risk... especially risk to business continuity.  There are many other factors, of course, and Mike covers some of them quite well in his article, but I want to focus on risk and risk management.

There is a substantial intersection between Application Portfolio Management and Risk Management.   However, I suspect that some folks who read this may not be aware of the area of risk management.  From wikipedia, here is a fairly good definition:

Risk management is the human activity which integrates recognition of risk, risk assessment, developing strategies to manage it, and mitigation of risk using managerial resources.

The strategies include transferring the risk to another party, avoiding the risk, reducing the negative effect of the risk, and accepting some or all of the consequences of a particular risk.

By way of example:

When you look at an inventory of chairs, you have risks.  If a chair gets old, and breaks, and an employee is injured, then the business faces insurance claims.  Morale suffers.  Productivity may decline due to lost work time and morale.  If the incident is public, then the company's reputation may suffer.

Managing that risk involves understanding the kinds of things that can go wrong (falls, wounds, productivity decline, etc) and determining the factors about a chair that may lead to them (poor condition, missing parts, wobbling, etc).  If you collect this information about your inventory, and then you group your chairs according to these attributes, you might get a few classes of chairs: (excellent, workable, frail, dangerous). 

With each category, you can determine the risk to the business for owning it.  Clearly, the risk to own dangerous chairs is higher than the risk to own workable chairs.  While it doesn't make sense to replace every chair, these statistics can provide an excellent business case for replacing the dangerous chairs (right away) and the frail chairs (over a finite period of time).

We use essentially the same process for applications.

What are the things that can happen to the business if an application fails?  Let's list out those things, and then create a set of attributes that an application has that will help to differentiate some applications from others.

Risk scenarios --> Attributes --> Data collection --> categorization

Within each category, you can determine the risks to the business that need to be mitigated.

Note that you can have many heirarchies, many categorizations.  You can group applications by their lifecycle stage (Strategic, Core, Maintain, and Sunset), and that is certainly useful for combining APM with PPM.  In other words, it is useful to know how much of your planned budget is devoted to improving strategic applications. (Mike mentions this in his MSDN article, with different definitions that we don't actually use internally in Microsoft IT).

Another useful categorization is application impact on operations.  The attribute to measure is  the speed at which a failure would impact operations of the business:

  • Instant (<6 hours)
  • Immediate (<2 days)
  • Rapid (< 10 days)
  • Serious (< 60 days)
  • Corrosive (within 9 months)
  • Hidden (gradual impacts on quality of customer experience or regulatory compliance)
  • Competitive (no impact on operations, but potential impact on ability to compete)
  • None (no one will miss this app if it goes away)

This is far more useful than a subjective measure like "strategic" or "core" when determining the value of investment in an application, and it also shows something else as well: the serious problems that may arise from a lack of investment.  

A terrific example was described in CIO magazine a while back, describing a situation where Comair Airlines kept putting off investments in a new crew management system, only to have the system crash during a heavy Christmas season that literally grounded the airline.

No one in the business would have considered a crew scheduling system to be 'strategic' and so an investment portfolio that breaks things down by how 'strategic' an application is would not have favored the replacement of that application.  On the other hand, a categorization that captures the application's impact on operations would clearly have placed that application in the Immediate category. 

Of course, correct categorization is only the first step.  Now you have to determine the risk of failure.

Categorization --> risk of failure --> cost to business --> priority for mitigation

By determining how likely an application is to fail, based on its risk categorization, you can select the applications that most need attention.  Now, that attention does not have to involve a rewrite.  There are lots of ways to mitigate risk.  You can move the risk by making someone outside the business responsible for handling that business capability.  You can reduce the risk of failure by introducing redundancy or failover.  You can reduce the cost to the business by moving non-essential decisions off of one application and onto another, more reliable, application. 

Mitigation review --> Comparison of alternatives --> investment in mitigation

I am not an employee of Comair, and I have no desire to criticize.  Their case is very public, but there are many more failures in IT that impact operations that are not so well described.  I refer to their misfortune as an example for us all to learn from.  In that vein: perhaps if there were a graph that showed the amount of investment against high risk applications, as opposed to the amount of investment against 'strategic' applications, then it would have helped to seal the business case for mitigating, and ultimately preventing, the heavy losses that the company faced when IT failed to keep the system running.

The key here is that IT has to work closely with business, something that IT folks are not very good at and that business folks often fail to understand the value of.  But by showing that some applications deserve mitigation, and by working as partners to reduce the risks faced by those applications, the business will willingly invest in IT the mitigations that are needed.  This is the visibility gap that APM can fill.

Success requires a conversation between IT and the business, one that Enterprise Architecture must foster.  And this is one area where EA and APM intersect.  One of many, but an important area that we must not forget. 

EA + APM + Proper Measurement = Risk Management

Comments