Rediger

Del via


Protect your cloud estate

This article provides best practices for maintaining the reliability and security of your Azure cloud estate. Reliability ensures your cloud services remain operational with minimal downtime. Security safeguards the confidentiality, integrity, and availability of your resources. Both reliability and security are critical for successful cloud operations.

Manage reliability

Reliability management involves using redundancy, replication, and defined recovery strategies to minimize downtime and protect your business. Table 1 provides an example of three workload priorities, reliability requirements (uptime SLO, max downtime, redundancy, load balancing, replication), and example scenarios that align with service-level objectives (SLOs)

Table 1. Example of workload priority and reliability requirements.

Priority Business impact Minimum uptime SLO Max downtime per month Architecture redundancy Load balancing Data replication and backups Example scenario
High (mission-critical) Immediate and severe effects on company reputation or revenue. 99.99% 4.32 minutes Multi-region & Multiple availability zones in each region Active-active Synchronous, cross-region data replication & backups for recovery Mission-critical baseline
Medium Measurable effects on company reputation or revenue. 99.9% 43.20 minutes Multiple region & Multiple availability zones in each region Active-passive Asynchronous, cross-region data replication & backups for recovery Reliable web app pattern
Low No effect on company reputation, processes, or profits. 99% 7.20 hours Single region & Multiple availability zones Availability zone redundancy Synchronous data replication across availability zones & backups for recovery App Service baseline
Virtual machine baseline

Identify reliability responsibilities

Reliability responsibilities vary by deployment model. Use the following table to identify your management responsibilities for infrastructure (IaaS), platform (PaaS), software (SaaS), and on-premises deployments.

Responsibility On-premises IaaS (Azure) PaaS (Azure) SaaS
Data ✔️ ✔️ ✔️ ✔️
Code and runtime ✔️ ✔️ ✔️
Cloud resources ✔️ ✔️ ✔️
Physical hardware ✔️

For more information, see Shared responsibility for reliability.

Define reliability requirements

Clearly defined reliability requirements are critical for uptime targets, recovery, and data loss tolerance. Follow these steps to define reliability requirements:

  1. Prioritize workloads. Assign high, medium (default), or low priorities to workloads based on business criticality and financial investment levels. Regularly review priorities to maintain alignment with business goals.

  2. Assign uptime service level objective (SLO) to all workloads. Establish uptime targets according to workload priority. Higher-priority workloads require stricter uptime goals. Your SLO influences your architecture, data management strategies, recovery processes, and costs.

  3. Identify service level indicators (SLIs). Use SLIs to measure uptime performance against your SLO. Examples include service health monitoring and error rates.

  4. Assign a recovery time objective (RTO) to all workloads. The RTO defines the maximum acceptable downtime for your workload. RTO should be shorter than your annual downtime allowance. For example, an uptime SLO 99.99% requires less than 52 minutes of annual downtime (4.32 minutes per month). Follow these steps:

    1. Estimate the number of failures. Estimate how often you think each workload might fail per year. For workloads with operational history, use your SLIs. For new workloads, perform a failure mode analysis to get an accurate estimate.

    2. Estimate the RTO. Divide your annual allowable downtime by the estimated number of failures. If you estimate four failures per year, then your RTO must be 13 minutes or less (52 minutes / 4 failures = 13-minute RTO).

    3. Test your recovery time. Track the average time it takes to recover during failover tests and live failures. The time it takes you to recover from failure must be less than your RTO. If your business continuity solution takes hours to

  5. Define recovery point objectives (RPO) for all workloads. Determine how much data loss your business can tolerate. This objective influences how frequently you replicate and back up your data.

  6. Define workload reliability targets. For workload reliability targets, see the Well-Architected Framework’s Recommendations for defining reliability targets.

Manage data reliability

Data reliability involves data replication (replicas) and backups (point in time copies) to maintain availability and consistency. See Table 2 for examples of workload priority aligned with data reliability targets.

Table 2. Workload priority with example data reliability configurations.

Workload priority Uptime SLO Data replication Data backups Example scenario
High 99.99% Synchronous data replication across regions

Synchronous data replication across availability zones
High frequency, cross-region backups. Frequency should support RTO and RPO. Mission-critical data platform
Medium 99.9% Synchronous data replication across regions

Synchronous data replication across availability zones
Cross-region backups. Frequency should support RTO and RPO. Database and storage solution in the Reliable Web App pattern
Low 99% Synchronous data replication across availability zones Cross-region backups. Frequency should support RTO and RPO. Data resiliency in baseline web app with zone redundancy

Your approach must align the data reliability configurations with the RTO and RPO requirements of your workloads. Follow these steps:

  1. Manage data replication. Replicate your data synchronously or asynchronously according to your workload’s RTO and RPO requirements.

    Data distribution Data replication Load balancing configuration
    Across availability zones Synchronous (near real-time) Most PaaS services handle cross-zone load balancing natively
    Across regions (active-active) Synchronous Active-active load balancing
    Across regions (active-passive) Asynchronous (periodic) Active-passive configuration

    For more information, see Replication: Redundancy for data.

  2. Manage data backups. Backups are for disaster recovery (service failure), data recovery (deletion or corruption), and incident response (security). Backups must support your RTO and RPO requirements for each workload. Choose backup solutions that align with your RTO and RPO goals. Prefer Azure’s built-in solutions, such as Azure Cosmos DB and Azure SQL Database native backups. For other cases, including on-premises data, use Azure Backup. For more information, see Backup.

  3. Design workload data reliability. For workload data reliability design, see the Well-Architected Framework Data partitioning guide and Azure service guides (start with the Reliability section).

Manage code and runtime reliability

Code and runtime are workload responsibilities. Follow the Well-Architected Framework’s self-healing and self-preservation guide.

Manage cloud resources reliability

Managing the reliability of your cloud resources often requires architecture redundancy (duplicate service instances) and an effective load-balancing strategy. See Table 3 for examples of architecture redundancy aligned with workload priority.

Table 3. Workload priority and architecture redundancy examples.

Workload priority Architecture redundancy Load balancing approach Azure load balancing solution Example scenario
High Two regions & availability zones Active-active Azure Front Door (HTTP)

Azure Traffic Manager (non-HTTP)
Mission-critical baseline application platform
Medium Two regions & availability zones Active-passive Azure Front Door (HTTP)

Azure Traffic Manager (non-HTTP)
Reliable web app pattern architecture guidance
Low Single region & availability zones Across availability zones Azure Application Gateway

Add Azure Load Balancer for virtual machines
App Service baseline
Virtual machine baseline

Your approach must implement architecture redundancy to meet the reliability requirements of your workloads. Follow these steps:

  1. Estimate the uptime of your architectures. For each workload, calculate the composite SLA. Only include services that could cause the workload to fail (critical path). Follow these steps:

    1. Gather the Microsoft uptime SLAs for every service on the critical path of your workload.

    2. If you have no independent critical paths, calculate single-region composite SLA by multiplying the uptime percentages of each relevant service. If you have independent critical paths, move to step 3 before calculating.

    3. When two Azure services provide independent critical paths, apply the independent critical paths formula to those services.

    4. For multi-region applications, input the single-region composite SLA (N) into the multi-region uptime formula.

    5. Compare your calculated uptime with your uptime SLO. Adjust service tiers or architecture redundancy if necessary.

    Use case Formula Variables Example Explanation
    Single-region uptime estimate N = S1 × S2 × S3 × … × Un N: Composite SLA of Azure services on a single-region critical path.
    S: SLA uptime percentage of each Azure service.
    n: Total number of Azure services on critical path.
    N = 99.99% (app) × 99.95% (database) × 99.9% (cache) Simple workload with app (99.99%), database (99.95%), and cache (99.9%) in a single critical path.
    Independent critical paths estimate S1 x 1 - [(1 - S2) × (1 - S3)] S: SLA uptime percentage for Azure services providing independent critical paths. 99.99% (app) × (1 - [(1 - 99.95% database) × (1 - 99.9% cache)]) Two independent critical paths. Either database (99.95%) or cache (99.9%) can fail without downtime.
    Multi-region uptime estimate M = 1 - (1 - N)^R M: Multi-region uptime estimate.
    N: Single-region composite SLA.
    R: Number of regions used.
    If N = 99.95% and R = 2, then M = 1 - (1 - 99.95%)^2 Workload deployed in two regions.
  2. Adjust service tiers. Before modifying architectures, evaluate whether different Azure service tiers (SKUs) can meet your reliability requirements. Some Azure service tiers can have different uptime SLAs, such as Azure Managed Disks.

  3. Add architecture redundancy. If your current uptime estimate falls short of your SLO, increase redundancy:

    1. Use multiple availability zones. Configure your workloads to use multiple availability zones. How availability zones improve your uptime can be difficult to estimate. Only a select number of services have uptime SLAs that account for availability zones. Where SLAs account for availability zones, use them in your uptime estimates. See the following table for some examples.

      Azure service type Azure services with Availability Zone SLAs
      Compute Platform App Service,
      Azure Kubernetes Service,
      Virtual Machines
      Datastore Azure Service Bus,
      Azure Storage Accounts,
      Azure Cache for Redis,
      Azure Files Premium Tier
      Database Azure Cosmos DB,
      Azure SQL Database,
      Azure Database for MySQL,
      Azure Database for PostgreSQL,
      Azure Managed Instance for Apache Cassandra
      Load Balancer Application Gateway
      Security Azure Firewall
    2. Use multiple regions. Multiple regions are often necessary to meet uptime SLOs. Use global load balancers (Azure Front Door or Traffic Manager) for traffic distribution. Multi-region architectures require careful data consistency management.

  4. Manage architecture redundancy. Decide how to use redundancy: You can use architecture redundancy as part of daily operations (active). Or you can use architecture redundancy in disaster recovery scenarios (passive). For examples, see Table 3.

    1. Load balance across availability zones. Use all availability actively. Many Azure PaaS services manage load balancing across availability zones automatically. IaaS workloads must use an internal load balancer to load balance across availability zones.

    2. Load balance across regions. Determine whether multi-region workloads should run active-active or active-passive based on reliability needs.

  5. Manage service configurations. Consistently apply configurations across redundant instances of Azure resources, so the resources behave in the same way. Use infrastructure as code to maintain consistency. For more information, see Duplicate resource configuration.

  6. Design workload reliability. For workload reliability design, see the Well-Architected Framework:

    Workload reliability Guidance
    Reliability pillar Highly available multi-region design
    Designing for redundancy
    Using availability zones and regions
    Service guide Azure service guides (start with the Reliability section)

For more information, see Redundancy.

Manage business continuity

Recovering from a failure requires a clear strategy to restore services quickly and minimize disruption to maintain user satisfaction. Follow these steps:

  1. Prepare for failures. Create separate recovery procedures for workloads based on high, medium, and low priorities. Data reliability, code and runtime reliability, and cloud resource reliability are the foundation of preparing for failure. Select other recovery tools to assist with business continuity preparation. For example, use Azure Site Recovery for on-premises and virtual-machine based server workloads.

  2. Test and document recovery plan. Regularly test your failover and failback processes to confirm your workloads meet recovery time objectives (RTO) and recovery point objectives (RPO). Clearly document each step of the recovery plan for easy reference during incidents. Verify that recovery tools, such as Azure Site Recovery, consistently meet your specified RTO.

  3. Detect failures. Adopt a proactive approach to identifying outages quickly, even if this method increases false positives. Prioritize customer experience by minimizing downtime and maintaining user trust.

    1. Monitor failures. Monitor workloads to detect outages within one minute. Use Azure Service Health and Azure Resources Health and use Azure Monitor alerts to notify relevant teams. Integrate these alerts with Azure DevOps or IT Service Management (ITSM) tools.

    2. Collect service level indicators (SLIs). Track performance by defining and gathering metrics that serve as SLIs. Ensure your teams use these metrics to measure workload performance against your service level objectives (SLOs).

  4. Respond to failures. Align your recovery response to the workload priority. Implement failover procedures to reroute requests to redundant infrastructure and data replicas immediately. Once systems stabilize, resolve the root cause, synchronize data, and execute failback procedures. For more information, see Failover and failback.

  5. Analyze failures. Identify the root causes of the issues and then address the problem. Document any lessons and make the necessary changes.

  6. Manage workload failures. For workload disaster recovery, see the Well-Architected Framework's disaster recovery guide and Azure service guides (start with the Reliability section).

Azure reliability tools

Use case Solution
Data replication, backup, and business continuity Azure service guides (start with the Reliability section)

Quick reference:
Azure Cosmos DB
Azure SQL Database
Azure Blob Storage
Azure Files
Data backup Azure Backup
Business continuity (IaaS) Azure Site Recovery
Multi-region load balancer Azure Front Door (HTTP)
Azure Traffic Manager (non-HTTP)
Multi-availability zone load balancer Azure Application Gateway (HTTP)
Azure Load Balancer (non-HTTP)

Manage security

Use an iterative security process to identify and mitigate threats in your cloud environment. Follow these steps:

Manage security controls

Manage your security controls to detect threats to your cloud estate. Follow these steps:

  1. Standardize security tooling. Use standardized tools to detect threats, fix vulnerabilities, investigate issues, secure data, harden resources, and enforce compliance at scale. Refer to Azure security tools.

  2. Baseline your environment. Document the normal state of your cloud estate. Monitor security and document network traffic patterns and user behaviors. Use Azure security baselines and Azure service guides to develop baseline configurations for services. This baseline makes it easier to detect anomalies and potential security weaknesses.

  3. Apply security controls. Implement security measures, such as access controls, encryption, and multifactor authentication, strengthens the environment and reduces the probability of compromise. For more information, see Manage security.

  4. Assign security responsibilities. Designate responsibility for security monitoring across your cloud environment. Regular monitoring and comparisons to the baseline enable quick identification of incidents, such as unauthorized access or unusual data transfers. Regular updates and audits keep your security baseline effective against evolving threats.

For more information, see CAF Secure.

Manage security incidents

Adopt a process and tools to recover from security incidents, such as ransomware, denial of service, or threat actor intrusion. Follow these steps:

  1. Prepare for incidents. Develop an incident response plan that clearly defines roles for investigation, mitigation, and communication. Regularly test the effectiveness of your plan. Evaluate and implement vulnerability management tools, threat detection systems, and infrastructure monitoring solutions. Reduce your attack surface through infrastructure hardening and create workload-specific recovery strategies. See Incident response overview and Incident response playbooks.

  2. Detect incidents. Use security information and event management (SIEM) tool, like Microsoft Sentinel, to centralize your security data. Use Microsoft Sentinel’s security orchestration, automation, and response capabilities (SOAR) to automate routine security tasks. Integrate threat intelligence feeds into your SIEM to gain insights into adversary tactics relevant to your cloud environment. Use Microsoft Defender for Cloud to regularly scan Azure for vulnerabilities. Microsoft Defender integrates with Microsoft Sentinel to provide a unified view of security events.

  3. Respond to incidents. Immediately activate your incident response plan upon detecting an incident. Quickly start investigation and mitigation procedures. Activate your disaster recovery plan to restore affected systems, and clearly communicate incident details to your team.

  4. Analyze security incidents. After each incident, review threat intelligence and update your incident response plan based on lessons learned and insights from public resources, such as the MITRE ATT&CK knowledge base. Evaluate the effectiveness of your vulnerability management and detection tools and refine strategies based on post-incident analysis.

For more information, see Manage incident response (CAF Secure).

Azure security tools

Security capability Microsoft solution
Identity and access management Microsoft Entra ID
Role-based access control Azure role-based access control
Threat detection Microsoft Defender for Cloud
Security information management Microsoft Sentinel
Data security and governance Microsoft Purview
Cloud resource security Azure security baselines
Cloud governance Azure Policy
Endpoint security Microsoft Defender for Endpoint
Network security Azure Network Watcher
Industrial security Microsoft Defender for IoT