Resiliency and continuity overview
How does Microsoft ensure business continuity if a disaster or other threat to service availability occurs?
Microsoft's Enterprise Resilience and Crisis Management (ERCM) team oversees business continuity management and disaster recovery activities across Microsoft services and cloud offerings. Representatives from Microsoft business units coordinate with the ERCM team to develop business continuity plans and validate compliance with business continuity requirements.
The Business Continuity Management (BCM) lifecycle is at the core of our BCM methodology. This three-phase process is designed to be adaptable so it can be implemented by a wide variety of business models across Microsoft. It begins with an Assessment phase to identify critical processes and objectives that should be included in the business continuity program. The Assessment phase also requires a Business Impact Analysis (BIA). The Planning phase focuses on developing and implementing resilience and recovery strategies and documenting them in official business continuity plans. Finally, Capability Validation tests business continuity plans and their implementations to verify effectiveness and identify potential improvements.
Microsoft online services business continuity strategies use hardware, network, and datacenter redundancy. Data replication between datacenters provides high availability and reliability during a catastrophic incident. It also increases resilience to mundane incidents such as isolated hardware failure or data corruption.
How does Microsoft test business continuity and disaster recovery plans?
Microsoft's Enterprise Resilience and Crisis Management (ERCM) policy stipulates that all Microsoft business continuity and disaster recovery plans must be tested, updated, and reviewed on an annual basis. Microsoft online services test their business continuity plans at least annually per ERCM policies. After Action reports are created and reviewed to validate, test results and inform plan updates in response to any problems discovered during testing.
To validate resilience and recovery strategies against a wide range of potential incidents, the ERCM Program defines multiple categories of test scenarios affecting people, locations, and technology. The level of validation required for each service is based on the service's criticality, with more critical services receiving more rigorous validation. Each Microsoft online service team tests their business continuity plan according to ERCM guidelines to measure the plan's effectiveness and the service team's readiness to execute the plan.
Per ERCM guidelines, annual reviews of business continuity plans and capability validation must take place within 12 months of the last review. Capability validation must include review of supporting documentation, such as the BIA, to ensure it remains accurate. Microsoft makes capability validation results for select Microsoft online services available to our customers through quarterly reports.
How do Microsoft online services ensure system capacity meets demand?
Capacity planning helps service teams allocate the resources necessary to support Microsoft online service availability. Regular capacity planning is required as part of Microsoft's ERCM program. Service teams review capacity data during quarterly reviews, and during emergency situations that warrant more capacity review.
The raw data for capacity planning is maintained by each service team and includes metrics like system processing, memory, and hardware capacity. Scheduled reviews use a model of the system's current capacity and test it against projected needs in emergency situations. If the model indicates gaps in capacity, proposed changes to system capacity are submitted to service team leadership for review. Approved changes are incorporated into a new model before implementation by service team engineers.
How do Microsoft online services maintain service availability during routine system failures?
Microsoft online services achieve service resilience through redundant architecture, data replication, and automated integrity checking. Redundant architecture involves deploying multiple instances of a service on geographically and physically separate hardware, providing increased fault-tolerance for Microsoft online services. Data replication ensures there are always multiple copies of customer data in different fault-zones, allowing critical customer data to be recovered if corrupted, lost, or even accidentally deleted by the customer. Automated integrity checking increases data availability by automatically restoring data impacted by many kinds of physical or logical corruption.
Related external regulations & certifications
Microsoft's online services are regularly audited for compliance with external regulations and certifications. Refer to the following table for validation of controls related to resiliency and continuity.
Azure and Dynamics 365
External audits | Section | Latest report date |
---|---|---|
ISO 27001 Statement of Applicability Certificate |
A.17.1: Information security continuity A.17.2: Redundancies |
April 8, 2024 |
ISO 22301 Certificate |
All controls | April 8, 2024 |
SOC 1 SOC 2 SOC 3 |
BC-1: Business continuity plans BC-3: Business continuity and disaster recovery procedures BC-4: BCDR testing BC-7: Datacenter business continuity plans BC-8: Datacenter business continuity testing BC-9: Datacenter resiliency assessment DS-5: Backup key service components DS-6: Redundancy of critical components DS-7: Automatic replication of customer data DS-8: Backup schedule DS-9: Backup restoration procedures DS-11: Offsite backups DS-14: Automatic restoration of customer services |
May 20, 2024 |
Microsoft 365
External audits | Section | Latest report date |
---|---|---|
FedRAMP | CP-2: Contingency plan CP-3: Contingency training CP-4: Contingency plan testing CP-6: Alternate storage site CP-7: Alternate processing site CP-9: Information system backup CP-10: Information system recovery and reconstitution |
August 21, 2024 |
ISO 27001 Statement of Applicability Certificate |
A.17.1: Information security continuity A.17.2: Redundancies |
March 2024 |
ISO 22301 Certificate |
All controls | March 2024 |
SOC 1 SOC 2 |
CA-49: Backup policies CA-50: Business continuity CA-51: Data replication |
August 1, 2024 |
SOC 3 | CUEC-09: EXO email restoration | January 23, 2024 |
Resources
- Microsoft Cloud ERCM: Business Continuity and Disaster Recovery Plan Validation Report FY24 Q4
- Enterprise Resilience and Crisis Management (ERCM) Program