Data resiliency in Microsoft 365
Given the complex nature of cloud computing, Microsoft is mindful that it's not a case of if things go wrong, but rather when. We design our cloud services to maximize reliability and minimize the negative effects on customers when things do go wrong. We have moved beyond the traditional strategy of relying on complex physical infrastructure, and we have built redundancy directly into our cloud services. We use a combination of less complex physical infrastructure and more intelligent software that builds data resiliency into our services and delivers high availability to our customers.
Resiliency and recoverability are built in
Building in resiliency and recovery starts with the assumption that underlying infrastructure and processes fail at some point: hardware (infrastructure) fails, humans make mistakes, and software will have bugs. While it would be incorrect to say that software developers weren't thinking about these things before the cloud, how these issues were handled in a typical IT implementation was different before the cloud:
- First, hardware, and infrastructure protections were significant. This structure meant having datacenters with 99.99% reliability required significant power and network redundancy, and servers were implemented with hardware-based clustering, dual power supplies, dual network interfaces, and the like.
- Second, process was paramount. Operations teams maintained rigorous procedures, change windows were employed, and there was often significant project management overhead.
- Third, deployment took place at a glacial pace. Deploying code without owning the source meant waiting for patch releases, and major version releases involved hardware replacement and significant capital outlay. Moreover, the only way to correct a problem was to roll back. Thus, most IT organizations would deploy only major releases to avoid the work to keep up to date.
- Finally, the scale of deployed systems and the level of their interconnectedness was historically much smaller than it's now.
Today, customers expect continuous innovation from Microsoft without compromising quality, and this is one of the reasons why Microsoft's services and software are built with resiliency and recoverability in mind.
Microsoft 365 data resiliency principles
Resiliency refers to the ability of a cloud-based service to withstand certain types of failures and yet remain fully functional from the customers' perspective. Data resiliency means that no matter what failures occur within Microsoft 365, critical customer data remains intact and unaffected. To that end, Microsoft 365 services have been designed around five specific resiliency principles:
- There's critical and noncritical data. Noncritical data (for example, whether a message was read) can be dropped in rare failure scenarios. Critical data (for example, customer data such as email messages) should be protected at extreme cost. As a design goal, delivered mail messages are always critical, and things like whether a message has been read is noncritical.
- Copies of customer data must be separated into different fault zones or as many fault domains as possible (for example, datacenters, accessible by single credentials (process, server, or operator)) to provide failure isolation.
- Critical customer data must be monitored for failing any part of Atomicity, Consistency, Isolation, Durability (ACID).
- Customer data must be protected from corruption. It must be actively scanned or monitored, repairable, and recoverable.
- Most data loss results from customer actions, so allow customers to recover on their own using a GUI that enables them to restore accidentally deleted items.
Through the building of our cloud services to these principles, coupled with robust testing and validation, Microsoft 365 is able to meet and exceed the requirements of customers while ensuring a platform for continuous innovation and improvement.