Process Maturity: When it comes to HA, it’s more important than Technology
When most of us think about system availability and maximizing uptime, our technical backgrounds and interests often pull us into thoughts of highly available stretched clusters, redundant servers, SQL mirroring and replication, dual data centers, and the like. It’s easy to get all technical and forget for a moment about the people and process matters that actually affect High Availability (HA) more than technology. In my 13 years of experience supporting SAP-on-Microsoft and other critical LOB systems, I’ve found that system availability relies not on MSCS or hot-pluggable power supplies but rather on good old-fashioned IT process discipline. A lack of process diligence and maturity significantly constrains SAP IT operations which in turn affects business system availability – and results in those nasty 3am calls about the system being down. Gartner found several years ago that 80% of availability issues related to mission critical systems were the result of process and people matters, not technology. If you sit back a minute and think about it, I bet you could have told Gartner the same thing.
Process discipline is one of the few change levers that not only helps you achieve operational excellence but pursue real innovation. How? Well, indirectly to be sure. I’m talking about the kind of innovation that comes as a result of being able to focus beyond merely running the business well, beyond worrying about whether the system will stay up and available through month-end close. If our SAP, JD Edwards, Oracle EBS, or other LOB system is highly available, I can use it as the backbone for my next project, i.e. an MS BI project focused on analyzing sales trends, or an interoperability or collaboration project focused on helping my occasional users access SAP through SharePoint, or a workflow automation project using BizTalk to quickly extend business processes to other LOB applications – projects that can make a real difference in how people get their work done.
So, getting back to process discipline, I’ve outlined quite a few processes that in my experience hold the keys to maintaining a highly available mission-critical system. In no real order, these include:
· Availability monitoring
· Downtime costing
· Technical change management
· Technical change testing/promote-to-production
· The failover/failback process itself
· Business/data recovery
· Data archiving
· Backup and recovery/restore
· Backup retention
· Offsite storage
· Architecture design
· Capacity planning
· Performance management
· Systems management (at every layer in the technology stack: datacenter facilities, network, server, disk, OS, database, SAP application layer, integration points, and overall SAP or other LOB business processes)
· Security management (again, at every layer in the stack)
· New server deployment
· New storage deployment
· IT staffing
· IT training
· Knowledge management
· Technical documentation
· Help desk issue tracking and escalation
· Issue remediation
The most complex HA issues arise when two or three of these process areas converge in a perfect mess of a storm. For example:
· If your Capacity Planning process is excellent but is not married to seamless actionable deployment and change management processes, you’ll have difficult prone-to-outage infrastructure on your hands.
· If your HA Failover/Failback processes are poorly documented, you’ll be lucky to keep your data intact much less avoid a lot of missteps and downtime if you’re missing a key subject matter expert during the HA Failback process.
· In the same way, if your DR Crash Kit (a key component of your technical documentation) is not well-maintained (including current-state and process documentation as well as up-to-date software images), only your top-notch SMEs will get you through the disaster – if indeed they’re even available.
· If your Promote to Production process lacks a load testing component (across the board, from hardware/firmware, through network, software, database, and application layers of the SAP technology stack), you’ll be at the mercy of your end users as they create their own load and test your system’s scalability during the peak pre-holiday seasons.
· If your primary DBA leaves you with little notice and you suffer a fatal SAN disk controller failure that scrambles the database, you better have a good Staffing Plan that encompasses on-call and emergency resources.
· If only a single person can approve all your proposed technical changes, that person better never go on vacation lest you're OK with that emergency patch sitting in a queue for two more weeks.
· If there’s no penalty for implementing unapproved changes (I’d recommend termination for a second offense), and your change management process is speedier than it should be, you’d better have an amazingly proactive systems monitoring and management process in place. And a great backout process as well.
· If knowledge is maintained in the heads and on the laptops of only a very few IT associates, rather than in a robustly protected and highly accessible knowledge repository, you’ll affect troubleshooting, the speed with which OJT and informal training can be absorbed, cross-training, potential new-employee onboarding, and so on.
· If any of these processes are missing key components (such as the promote-to-production process being devoid of a load testing component, as mentioned earlier), personnel workload balancing can't help but become an issue. Why? Because your most senior folks will find themselves overworked trying to cobble together a system that stays up while their less capable colleagues enjoy additional time off because they’re unable to take on more load and responsibilities.
So the next time you’re asked to increase your system’s availability through a new HA-enhancing piece of technology, ask yourself first whether or not your processes are mature enough to even make the new technology investment worthwhile. One mistake executing a poorly designed, tested, or incomplete process, or one extended leave of absence, or one failed laptop containing the only copy of the company’s DR or failback/failover plan – combined with another process (or technology, or people) failure – could easily result in hours if not several days of unplanned downtime. Will your end-users understand? Probably not, but they’re a forgiving lot for the most part. More to the point, is your job worth the risk?
george