你当前正在访问 Microsoft Azure Global Edition 技术文档网站。 如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站,请访问 https://docs.azure.cn。
Protect your cloud estate
This article provides best practices for maintaining the reliability and security of your Azure cloud estate. Reliability ensures your cloud services remain operational with minimal downtime. Security safeguards the confidentiality, integrity, and availability of your resources. Both reliability and security are critical for successful cloud operations.
Manage reliability
Reliability management involves using redundancy, replication, and defined recovery strategies to minimize downtime and protect your business. Table 1 provides an example of three workload priorities, reliability requirements (uptime SLO, max downtime, redundancy, load balancing, replication), and example scenarios that align with service-level objectives (SLOs)
Table 1. Example of workload priority and reliability requirements.
Priority | Business impact | Minimum uptime SLO | Max downtime per month | Architecture redundancy | Load balancing | Data replication and backups | Example scenario |
---|---|---|---|---|---|---|---|
High (mission-critical) | Immediate and severe effects on company reputation or revenue. | 99.99% | 4.32 minutes | Multi-region & Multiple availability zones in each region | Active-active | Synchronous, cross-region data replication & backups for recovery | Mission-critical baseline |
Medium | Measurable effects on company reputation or revenue. | 99.9% | 43.20 minutes | Multiple region & Multiple availability zones in each region | Active-passive | Asynchronous, cross-region data replication & backups for recovery | Reliable web app pattern |
Low | No effect on company reputation, processes, or profits. | 99% | 7.20 hours | Single region & Multiple availability zones | Availability zone redundancy | Synchronous data replication across availability zones & backups for recovery | App Service baseline Virtual machine baseline |
Identify reliability responsibilities
Reliability responsibilities vary by deployment model. Use the following table to identify your management responsibilities for infrastructure (IaaS), platform (PaaS), software (SaaS), and on-premises deployments.
Responsibility | On-premises | IaaS (Azure) | PaaS (Azure) | SaaS |
---|---|---|---|---|
Data | ✔️ | ✔️ | ✔️ | ✔️ |
Code and runtime | ✔️ | ✔️ | ✔️ | |
Cloud resources | ✔️ | ✔️ | ✔️ | |
Physical hardware | ✔️ |
For more information, see Shared responsibility for reliability.
Define reliability requirements
Clearly defined reliability requirements are critical for uptime targets, recovery, and data loss tolerance. Follow these steps to define reliability requirements:
Prioritize workloads. Assign high, medium (default), or low priorities to workloads based on business criticality and financial investment levels. Regularly review priorities to maintain alignment with business goals.
Assign uptime service level objective (SLO) to all workloads. Establish uptime targets according to workload priority. Higher-priority workloads require stricter uptime goals. Your SLO influences your architecture, data management strategies, recovery processes, and costs.
Identify service level indicators (SLIs). Use SLIs to measure uptime performance against your SLO. Examples include service health monitoring and error rates.
Assign a recovery time objective (RTO) to all workloads. The RTO defines the maximum acceptable downtime for your workload. RTO should be shorter than your annual downtime allowance. For example, an uptime SLO 99.99% requires less than 52 minutes of annual downtime (4.32 minutes per month). Follow these steps:
Estimate the number of failures. Estimate how often you think each workload might fail per year. For workloads with operational history, use your SLIs. For new workloads, perform a failure mode analysis to get an accurate estimate.
Estimate the RTO. Divide your annual allowable downtime by the estimated number of failures. If you estimate four failures per year, then your RTO must be 13 minutes or less (52 minutes / 4 failures = 13-minute RTO).
Test your recovery time. Track the average time it takes to recover during failover tests and live failures. The time it takes you to recover from failure must be less than your RTO. If your business continuity solution takes hours to
Define recovery point objectives (RPO) for all workloads. Determine how much data loss your business can tolerate. This objective influences how frequently you replicate and back up your data.
Define workload reliability targets. For workload reliability targets, see the Well-Architected Framework’s Recommendations for defining reliability targets.
Manage data reliability
Data reliability involves data replication (replicas) and backups (point in time copies) to maintain availability and consistency. See Table 2 for examples of workload priority aligned with data reliability targets.
Table 2. Workload priority with example data reliability configurations.
Workload priority | Uptime SLO | Data replication | Data backups | Example scenario |
---|---|---|---|---|
High | 99.99% | Synchronous data replication across regions Synchronous data replication across availability zones |
High frequency, cross-region backups. Frequency should support RTO and RPO. | Mission-critical data platform |
Medium | 99.9% | Synchronous data replication across regions Synchronous data replication across availability zones |
Cross-region backups. Frequency should support RTO and RPO. | Database and storage solution in the Reliable Web App pattern |
Low | 99% | Synchronous data replication across availability zones | Cross-region backups. Frequency should support RTO and RPO. | Data resiliency in baseline web app with zone redundancy |
Your approach must align the data reliability configurations with the RTO and RPO requirements of your workloads. Follow these steps:
Manage data replication. Replicate your data synchronously or asynchronously according to your workload’s RTO and RPO requirements.
Data distribution Data replication Load balancing configuration Across availability zones Synchronous (near real-time) Most PaaS services handle cross-zone load balancing natively Across regions (active-active) Synchronous Active-active load balancing Across regions (active-passive) Asynchronous (periodic) Active-passive configuration For more information, see Replication: Redundancy for data.
Manage data backups. Backups are for disaster recovery (service failure), data recovery (deletion or corruption), and incident response (security). Backups must support your RTO and RPO requirements for each workload. Choose backup solutions that align with your RTO and RPO goals. Prefer Azure’s built-in solutions, such as Azure Cosmos DB and Azure SQL Database native backups. For other cases, including on-premises data, use Azure Backup. For more information, see Backup.
Design workload data reliability. For workload data reliability design, see the Well-Architected Framework Data partitioning guide and Azure service guides (start with the Reliability section).
Manage code and runtime reliability
Code and runtime are workload responsibilities. Follow the Well-Architected Framework’s self-healing and self-preservation guide.
Manage cloud resources reliability
Managing the reliability of your cloud resources often requires architecture redundancy (duplicate service instances) and an effective load-balancing strategy. See Table 3 for examples of architecture redundancy aligned with workload priority.
Table 3. Workload priority and architecture redundancy examples.
Workload priority | Architecture redundancy | Load balancing approach | Azure load balancing solution | Example scenario |
---|---|---|---|---|
High | Two regions & availability zones | Active-active | Azure Front Door (HTTP) Azure Traffic Manager (non-HTTP) |
Mission-critical baseline application platform |
Medium | Two regions & availability zones | Active-passive | Azure Front Door (HTTP) Azure Traffic Manager (non-HTTP) |
Reliable web app pattern architecture guidance |
Low | Single region & availability zones | Across availability zones | Azure Application Gateway Add Azure Load Balancer for virtual machines |
App Service baseline Virtual machine baseline |
Your approach must implement architecture redundancy to meet the reliability requirements of your workloads. Follow these steps:
Estimate the uptime of your architectures. For each workload, calculate the composite SLA. Only include services that could cause the workload to fail (critical path). Follow these steps:
Gather the Microsoft uptime SLAs for every service on the critical path of your workload.
If you have no independent critical paths, calculate single-region composite SLA by multiplying the uptime percentages of each relevant service. If you have independent critical paths, move to step 3 before calculating.
When two Azure services provide independent critical paths, apply the independent critical paths formula to those services.
For multi-region applications, input the single-region composite SLA (N) into the multi-region uptime formula.
Compare your calculated uptime with your uptime SLO. Adjust service tiers or architecture redundancy if necessary.
Use case Formula Variables Example Explanation Single-region uptime estimate N = S1 × S2 × S3 × … × Un N: Composite SLA of Azure services on a single-region critical path.
S: SLA uptime percentage of each Azure service.
n: Total number of Azure services on critical path.N = 99.99% (app) × 99.95% (database) × 99.9% (cache) Simple workload with app (99.99%), database (99.95%), and cache (99.9%) in a single critical path. Independent critical paths estimate S1 x 1 - [(1 - S2) × (1 - S3)] S: SLA uptime percentage for Azure services providing independent critical paths. 99.99% (app) × (1 - [(1 - 99.95% database) × (1 - 99.9% cache)]) Two independent critical paths. Either database (99.95%) or cache (99.9%) can fail without downtime. Multi-region uptime estimate M = 1 - (1 - N)^R M: Multi-region uptime estimate.
N: Single-region composite SLA.
R: Number of regions used.If N = 99.95% and R = 2, then M = 1 - (1 - 99.95%)^2 Workload deployed in two regions. Adjust service tiers. Before modifying architectures, evaluate whether different Azure service tiers (SKUs) can meet your reliability requirements. Some Azure service tiers can have different uptime SLAs, such as Azure Managed Disks.
Add architecture redundancy. If your current uptime estimate falls short of your SLO, increase redundancy:
Use multiple availability zones. Configure your workloads to use multiple availability zones. How availability zones improve your uptime can be difficult to estimate. Only a select number of services have uptime SLAs that account for availability zones. Where SLAs account for availability zones, use them in your uptime estimates. See the following table for some examples.
Azure service type Azure services with Availability Zone SLAs Compute Platform App Service,
Azure Kubernetes Service,
Virtual MachinesDatastore Azure Service Bus,
Azure Storage Accounts,
Azure Cache for Redis,
Azure Files Premium TierDatabase Azure Cosmos DB,
Azure SQL Database,
Azure Database for MySQL,
Azure Database for PostgreSQL,
Azure Managed Instance for Apache CassandraLoad Balancer Application Gateway Security Azure Firewall Use multiple regions. Multiple regions are often necessary to meet uptime SLOs. Use global load balancers (Azure Front Door or Traffic Manager) for traffic distribution. Multi-region architectures require careful data consistency management.
Manage architecture redundancy. Decide how to use redundancy: You can use architecture redundancy as part of daily operations (active). Or you can use architecture redundancy in disaster recovery scenarios (passive). For examples, see Table 3.
Load balance across availability zones. Use all availability actively. Many Azure PaaS services manage load balancing across availability zones automatically. IaaS workloads must use an internal load balancer to load balance across availability zones.
Load balance across regions. Determine whether multi-region workloads should run active-active or active-passive based on reliability needs.
Manage service configurations. Consistently apply configurations across redundant instances of Azure resources, so the resources behave in the same way. Use infrastructure as code to maintain consistency. For more information, see Duplicate resource configuration.
Design workload reliability. For workload reliability design, see the Well-Architected Framework:
Workload reliability Guidance Reliability pillar Highly available multi-region design
Designing for redundancy
Using availability zones and regionsService guide Azure service guides (start with the Reliability section)
For more information, see Redundancy.
Manage business continuity
Recovering from a failure requires a clear strategy to restore services quickly and minimize disruption to maintain user satisfaction. Follow these steps:
Prepare for failures. Create separate recovery procedures for workloads based on high, medium, and low priorities. Data reliability, code and runtime reliability, and cloud resource reliability are the foundation of preparing for failure. Select other recovery tools to assist with business continuity preparation. For example, use Azure Site Recovery for on-premises and virtual-machine based server workloads.
Test and document recovery plan. Regularly test your failover and failback processes to confirm your workloads meet recovery time objectives (RTO) and recovery point objectives (RPO). Clearly document each step of the recovery plan for easy reference during incidents. Verify that recovery tools, such as Azure Site Recovery, consistently meet your specified RTO.
Detect failures. Adopt a proactive approach to identifying outages quickly, even if this method increases false positives. Prioritize customer experience by minimizing downtime and maintaining user trust.
Monitor failures. Monitor workloads to detect outages within one minute. Use Azure Service Health and Azure Resources Health and use Azure Monitor alerts to notify relevant teams. Integrate these alerts with Azure DevOps or IT Service Management (ITSM) tools.
Collect service level indicators (SLIs). Track performance by defining and gathering metrics that serve as SLIs. Ensure your teams use these metrics to measure workload performance against your service level objectives (SLOs).
Respond to failures. Align your recovery response to the workload priority. Implement failover procedures to reroute requests to redundant infrastructure and data replicas immediately. Once systems stabilize, resolve the root cause, synchronize data, and execute failback procedures. For more information, see Failover and failback.
Analyze failures. Identify the root causes of the issues and then address the problem. Document any lessons and make the necessary changes.
Manage workload failures. For workload disaster recovery, see the Well-Architected Framework's disaster recovery guide and Azure service guides (start with the Reliability section).
Azure reliability tools
Use case | Solution |
---|---|
Data replication, backup, and business continuity | Azure service guides (start with the Reliability section) Quick reference: Azure Cosmos DB Azure SQL Database Azure Blob Storage Azure Files |
Data backup | Azure Backup |
Business continuity (IaaS) | Azure Site Recovery |
Multi-region load balancer | Azure Front Door (HTTP) Azure Traffic Manager (non-HTTP) |
Multi-availability zone load balancer | Azure Application Gateway (HTTP) Azure Load Balancer (non-HTTP) |
Manage security
Use an iterative security process to identify and mitigate threats in your cloud environment. Follow these steps:
Manage security controls
Manage your security controls to detect threats to your cloud estate. Follow these steps:
Standardize security tooling. Use standardized tools to detect threats, fix vulnerabilities, investigate issues, secure data, harden resources, and enforce compliance at scale. Refer to Azure security tools.
Baseline your environment. Document the normal state of your cloud estate. Monitor security and document network traffic patterns and user behaviors. Use Azure security baselines and Azure service guides to develop baseline configurations for services. This baseline makes it easier to detect anomalies and potential security weaknesses.
Apply security controls. Implement security measures, such as access controls, encryption, and multifactor authentication, strengthens the environment and reduces the probability of compromise. For more information, see Manage security.
Assign security responsibilities. Designate responsibility for security monitoring across your cloud environment. Regular monitoring and comparisons to the baseline enable quick identification of incidents, such as unauthorized access or unusual data transfers. Regular updates and audits keep your security baseline effective against evolving threats.
For more information, see CAF Secure.
Manage security incidents
Adopt a process and tools to recover from security incidents, such as ransomware, denial of service, or threat actor intrusion. Follow these steps:
Prepare for incidents. Develop an incident response plan that clearly defines roles for investigation, mitigation, and communication. Regularly test the effectiveness of your plan. Evaluate and implement vulnerability management tools, threat detection systems, and infrastructure monitoring solutions. Reduce your attack surface through infrastructure hardening and create workload-specific recovery strategies. See Incident response overview and Incident response playbooks.
Detect incidents. Use security information and event management (SIEM) tool, like Microsoft Sentinel, to centralize your security data. Use Microsoft Sentinel’s security orchestration, automation, and response capabilities (SOAR) to automate routine security tasks. Integrate threat intelligence feeds into your SIEM to gain insights into adversary tactics relevant to your cloud environment. Use Microsoft Defender for Cloud to regularly scan Azure for vulnerabilities. Microsoft Defender integrates with Microsoft Sentinel to provide a unified view of security events.
Respond to incidents. Immediately activate your incident response plan upon detecting an incident. Quickly start investigation and mitigation procedures. Activate your disaster recovery plan to restore affected systems, and clearly communicate incident details to your team.
Analyze security incidents. After each incident, review threat intelligence and update your incident response plan based on lessons learned and insights from public resources, such as the MITRE ATT&CK knowledge base. Evaluate the effectiveness of your vulnerability management and detection tools and refine strategies based on post-incident analysis.
For more information, see Manage incident response (CAF Secure).
Azure security tools
Security capability | Microsoft solution |
---|---|
Identity and access management | Microsoft Entra ID |
Role-based access control | Azure role-based access control |
Threat detection | Microsoft Defender for Cloud |
Security information management | Microsoft Sentinel |
Data security and governance | Microsoft Purview |
Cloud resource security | Azure security baselines |
Cloud governance | Azure Policy |
Endpoint security | Microsoft Defender for Endpoint |
Network security | Azure Network Watcher |
Industrial security | Microsoft Defender for IoT |