Reliability best practices in Azure Monitor

In the cloud, we acknowledge that failures happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. Use the following information to monitor your virtual machines and their client workloads for failure.

This article describes Reliability for Azure Monitor as part of the Azure Well-Architected Framework. The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a workload. The framework consists of five pillars of architectural excellence:

  • Reliability
  • Security
  • Cost Optimization
  • Operational Excellence
  • Performance Efficiency

Azure Monitor Logs

Log Analytics workspaces offer a high degree of reliability. The ingestion pipeline, which sends collected data to the Log Analytics workspace, validates that the Log Analytics workspace successfully processes each log record before it removes the record from the pipe. If the ingestion pipeline isn’t available, the agents that send the data buffer and retry sending the logs for many hours.

Azure Monitor Logs features that enhance resilience

Azure Monitor Logs offers several features that enhance workspaces resilience to various types of issues. You can use these features individually or in combination, depending on your needs.

This video provides an overview of reliability and resilience options available for Log Analytics workspaces:

In-region protection using availability zones

Each Azure region that supports availability zones has a set of datacenters equipped with independent power, cooling, and networking infrastructure.

Azure Monitor Logs availability zones are redundant, which means that Microsoft spreads service requests and replicates data across different zones in supported regions. If an incident affects one zone, Microsoft uses a different availability zone in the region instead, automatically. You don't need to take any action because switching between zones is seamless.

In most regions, Azure Monitor Logs availability zones support data resilience, which means your stored data is protected against data loss related to zonal failures, but service operations might still be impacted by regional incidents. If the service is unable to run queries, you can't view the logs until the issue is resolved.

A subset of the availability zones that support data resilience also support service resilience, which means that Azure Monitor Logs service operations - for example, log ingestion, queries, and alerts - can continue in the event of a zone failure.

Availability zones protect against infrastructure-related incidents, such as storage failures. They don’t protect against application-level issues, such as faulty code deployments or certificate failures, which impact the entire region.

Backup of data from specific tables using continuous export

You can continuously export data sent to specific tables in your Log Analytics workspace to Azure storage accounts.

The storage account you export data to must be in the same region as your Log Analytics workspace. To protect and have access to your ingested logs, even if the workspace region is down, use a geo-redundant storage account, as explained in Configuration recommendations.

The export mechanism doesn’t provide protection from incidents impacting the ingestion pipeline or the export process itself.

Note

You can access data in a storage account from Azure Monitor Logs using the externaldata operator. However, the exported data is stored in five-minute blobs and analyzing data spanning multiple blobs can be cumbersome. Therefore, exporting data to a storage account is a good data backup mechanism, but having the backed up data in a storage account is not ideal if you need it for analysis in Azure Monitor Logs. You can query large volumes of blob data using Azure Data Explorer, Azure Data Factory, or any other storage access tool.

Cross-regional data protection and service resilience using workspace replication (preview)

Workspace replication (preview) is the most extensive resilience solution as it replicates the Log Analytics workspace and incoming logs to another region.

Workspace replication protects both your logs and the service operations, and allows you to continue monitoring your systems in the event of infrastructure or application-related region-wide incidents.

In contrast with availability zones, which Microsoft manages end-to-end, you need to monitor your primary workspace's health and decide when to switch over to the workspace in the secondary region and back.

Design checklist

  • To ensure service and data resilience to region-wide incidents, enable workspace replication.
  • To ensure in-region protection against datacenter failure, create your workspace in a region that supports availability zones.
  • For cross-regional backup of data in specific tables, use the continuous export feature to send data to a geo-replicated storage account.
  • Monitor the health of your Log Analytics workspaces.

Configuration recommendations

Recommendation Benefit
To ensure the greatest degree of resilience, enable workspace replication. Cross-regional resilience for workspace data and service operations.

Workspace replication (preview) ensures high availability by creating a secondary instance of your workspace in another region and ingesting your logs to both workspaces.

When needed, switch to your secondary workspace until the issues impacting your primary workspace are resolved. You can continue ingesting logs, querying data, using dashboards, alerts, and Sentinel in your secondary workspace. You also have access to logs ingested before the region switch.

This is a paid feature, so consider whether you want to replicate all of your incoming logs, or only some data streams.
If possible, create your workspace in a region that supports Azure Monitor service-resilience. In-region resilience of workspace data and service operations in the event of datacenter issues.

Availability zones that support service resilience also support data resilience. This means that even if an entire datacenter becomes unavailable, the redundancy between zones allows Azure Monitor service operations, like ingestion and querying, to continue to work, and your ingested logs to remain available.

Availability zones provide in-region protection, but don't protect against issues that impact the entire region.

For information about which regions support data resilience, see Enhance data and service resilience in Azure Monitor Logs with availability zones.
Create your workspace in a region that supports data resilience. In-region protection against loss of the logs in your workspace in the event of datacenter issues.

Creating your workspace in a region that supports data resilience means that even if the entire datacenter becomes unavailable, your ingested logs are safe.
If the service is unable to run queries, you can't view the logs until the issue is resolved.

For information about which regions support data resilience, see Enhance data and service resilience in Azure Monitor Logs with availability zones.
Configure data export from specific tables to a storage account that's replicated across regions. Maintain a backup copy of your log data in a different region.

The data export feature of Azure Monitor allows you to continuously export data sent to specific tables to Azure storage where it can be retained for extended periods. Use a geo-redundant storage (GRS) or geo-zone-redundant storage (GZRS) account to keep your data safe even if an entire region becomes unavailable. To make your data readable from the other regions, configure your storage account for read access to the secondary region. For more information, see Azure Storage redundancy on a secondary region and Azure Storage read access to data in the secondary region.

For tables that don't supported continuous data export, you can use other methods of exporting data, including Logic Apps, to protect your data. This is primarily a solution to meet compliance for data retention since the data can be difficult to analyze and restore to the workspace.

Data export is susceptible to regional incidents because it relies on the stability of the Azure Monitor ingestion pipeline in your region. It doesn't provide resiliency against incidents impacting the regional ingestion pipeline.
Monitor the health of your Log Analytics workspaces. Use Log Analytics workspace insights to track failed queries and create health status alert to proactively notify you if a workspace becomes unavailable because of a datacenter or regional failure.

Compare Azure Monitor Logs resilience features

Feature Service resilience Data backup High availability Scope of protection Setup Cost
Workspace replication Cross-region protection against region-wide incidents Enable replication of the workspace and related data collection rules. Switch between regions as needed. Based on the number of replicated GBs and region.
Availability zones
In supported regions
In-region protection against datacenter issues Automatically enabled in supported regions. No cost
Continuous data export Protection from data loss because of a regional failure 1 Enable per table. Cost of data export + Storage blob or Event Hubs

1 Data export provides cross-region protection if you export logs to a geo-replicated storage account. In the event of an incident, previously exported data is backed up and readily available; however, further export might fail, depending on the nature of the incident.

Alerts

Azure Monitor alerts offer a high degree of reliability without any design decisions. Conditions where a temporary loss of alert data loss may occur are often mitigated by features of other Azure Monitor components.

Design checklist

  • Configure service health alert rules.
  • Configure resource health alert rules.
  • Avoid service limits for alert rules that produce large scale notifications.

Configuration recommendations

Recommendation Benefit
Configure service health alert rules. Service health alerts send you notifications for outages, service disruptions, planned maintenance and security advisories. See Create or edit an alert rule.
Configure resource health alert rules. Resource Health alerts can notify you in near real-time when these resources have a change in their health status. See Create or edit an alert rule.
Avoid service limits for alert rules that produce large scale notifications. If you have alert rules that would send a large number of notifications, you may reach your service limits for the service you use to send email or SMS notifications. Configure programmatic actions or choose an alternate notification method or provider to handle large scale notifications. See Service limits for notifications.

Virtual machines

Design checklist

  • Create availability alert rules for Azure VMs.
  • Create agent heartbeat alert rule to verify agent health.
  • Configure data collection and alerting for monitoring reliability of client workflows.

Configuration recommendations

Recommendation Description
Create availability alert rules for Azure VMs. Use the availability metric (preview) to track when an Azure VM is running. While you can quickly enable an availability alert rule for an individual machine using recommended alerts, a single alert rule targeting a resource group or subscription enables availability alerting for all VMs in that scope for a particular region. This is easier to manage than creating an alert rule for each VM and ensures that any new VMs created in the scope are automatically monitored. This alert rule doesn't require the Azure Monitor agent to be installed on the VM, but it isn't available for VMs outside of Azure.
Create agent heartbeat alert rule to verify agent health. The Azure Monitor agent sends a heartbeat to the Log Analytics workspace every minute. Use a log search alert rule using the agent heartbeat to be alerted when an agent stops sending heartbeats, which is an indicator that either the VM is down or the agent is unhealthy and client workloads aren't being monitored. This alert rule requires that the Azure Monitor agent is installed on the VM and applies to both Azure and non-Azure VMs.
Configure data collection and alerting for monitoring reliability of client workflows. Use the information at Monitor virtual machines with Monitor virtual machines with Azure Monitor: Collect data to configure client event collection indicating potential issues with your client workloads. Use the information at Monitor virtual machines with Monitor virtual machines with Azure Monitor: Alerts to create alert rules to be proactively notified of any potential operational issues with your client workloads.

Containers

Design checklist

  • Enable scraping of Prometheus metrics for your cluster.
  • Enable Container insights for collection of logs and performance data from your cluster.
  • Create diagnostic settings to collect control plane logs for AKS clusters.
  • Enable recommended Prometheus alerts.
  • Ensure the availability of the Log Analytics workspace supporting Container insights.

Configuration recommendations

Recommendation Benefit
Enable scraping of Prometheus metrics for your cluster. Enable Prometheus on your cluster with Azure Monitor managed service for Prometheus if you don't already have a Prometheus environment. Use Azure Managed Grafana to analyze the Prometheus data collected. See Customize scraping of Prometheus metrics in Azure Monitor managed service for Prometheus to collect additional metrics beyond the default configuration.
Enable Container insights for collection of logs and performance data from your cluster. Container insights collects stdout/stderr logs, performance metrics, and Kubernetes events from each node in your cluster. It provides dashboards and reports for analyzing this data, including the availability of your nodes and other components. Use Log Analytics to identify any availability errors in your collected logs.
Create diagnostic settings to collect control plane logs for AKS clusters. AKS implements control planes logs as resource logs in Azure Monitor. Create a diagnostic setting to send these logs to your Log Analytics workspace so you can use log queries to identify errors and issues affecting availability.
Enable recommended Prometheus alerts. Alerts in Azure Monitor proactively notify you when issues are detected. Start with a set of recommended Prometheus alert rules that detect the most common availability and performance issues with your cluster. Potentially add log search alerts using data collected by Container insights.
Ensure the availability of the Log Analytics workspace supporting Container insights. Container insights relies on a Log Analytics workspace. See Best practices for Azure Monitor Logs for recommendations to ensure the reliability of the workspace.

Next step