Operational excellence best practices in Azure Monitor
Operational excellence refers to operations processes required keep a service running reliably in production. Use the following information to minimize the operational requirements for monitoring of your virtual machines.
This article describes Operational excellence for Azure Monitor as part of the Azure Well-Architected Framework. The Azure Well-Architected Framework is a set of guiding tenets that can be used to improve the quality of a workload. The framework consists of five pillars of architectural excellence:
- Reliability
- Security
- Cost Optimization
- Operational Excellence
- Performance Efficiency
Azure Monitor Logs
Design checklist
- Design a workspace architecture with the minimal number of workspaces to meet your business requirements.
- Use Infrastructure as Code (IaC) when managing multiple workspaces.
- Use Log Analytics workspace insights to track the health and performance of your Log Analytics workspaces.
- Create alert rules to be proactively notified of operational issues in the workspace.
- Ensure that you have a well-defined operational process for data segregation.
Configuration recommendations
Recommendation | Benefit |
---|---|
Design a workspace strategy to meet your business requirements. | See Design a Log Analytics workspace architecture for guidance on designing a strategy for your Log Analytics workspaces including how many to create and where to place them. A single or at least minimal number of workspaces will maximize your operational efficiency since it limits the distribution of your operational and security data, increasing your visibility into potential issues, making patterns easier to identify, and minimizing your maintenance requirements. You might have requirements for multiple workspaces such as multiple tenants, or you might need workspaces in multiple regions to support your availability requirements. In these cases, ensure that you have appropriate processes in place to manage this increased complexity. |
Use Infrastructure as Code (IaC) when managing multiple workspaces. | Use Infrastructure as Code (IaC) to define the details of your workspaces in ARM, BICEP, or Terraform. This allows you to you leverage your existing DevOps processes to deploy new workspaces and Azure Policy to enforce their configuration. |
Use Log Analytics workspace insights to track the health and performance of your Log Analytics workspaces. | Log Analytics workspace insights provides a unified view of the usage, performance, health, agents, queries, and change log for all your workspaces. Review this information on a regular basis to track the health and operation of each of your workspaces. |
Create alert rules to be proactively notified of operational issues in the workspace. | Each workspace has an operation table that logs important activities affecting workspace. Create alert rules based on this table to be proactively notified when an operational issue occurs. You can use recommended alerts for the workspace to simplify the creation of the most critical alert rules. |
Ensure that you have a well-defined operational process for data segregation. | You may have different requirements for different types of data stored in your workspace. Make sure that you clearly understand such requirements as data retention and security when designing your workspace strategy and configuring settings such as permissions and long-term retention. You should also have a clearly defined process for occasionally purging data with personal information that's accidentally collected. |
Alerts
Design checklist
- Use dynamic thresholds in metric alert rules where appropriate.
- Whenever possible, use one alert rule to monitor multiple resources.
- To control behavior at scale, use alert processing rules.
- Leverage custom properties to enhance diagnostics
- Leverage Logic Apps to customize, enrich, and integrate with a variety of systems
Configuration recommendations
Recommendation | Benefit |
---|---|
Use dynamic thresholds in metric alert rules where appropriate. | You may be unsure of the correct numbers to use as the thresholds for your alert rules. Dynamic thresholds use machine learning and use a set of algorithms and methods to determine the correct thresholds based on trends, so you don't need to know the correct predefined threshold in advance. Dynamic thresholds are also useful for rules that monitor multiple resources, and a single threshold can't be configured for all of the resources. See Dynamic thresholds in metric alerts. |
Whenever possible, use one alert rule to monitor multiple resources. | Using alert rules that monitor multiple resources reduces management overhead, by allowing you to manage one rule to monitor a large number of resources. |
To control behavior at scale, use alert processing rules. | Alert processing rules can be used to reduce the number of alert rules you need to create and manage. |
Use custom properties to enhance diagnostics. | If the alert rule uses action groups, you can add your own properties to include in the alert notification payload. You can use these properties in the actions called by the action group, such as webhook, Azure function or logic app actions. |
Use Logic Apps to customize the notification workflow and integrate with various systems. | You can use Azure Logic Apps to build and customize workflows for integration. Use Logic Apps to customize your alert notifications. You can: - Customize the alerts email by using your own email subject and body format. - Customize the alert metadata by looking up tags for affected resources or fetching a log query search result. - Integrate with external services by using existing connectors like Outlook, Microsoft Teams, Slack, and PagerDuty. You can also configure the logic app for your own services. |
Virtual machines
Design checklist
- Migrate from legacy agents to Azure Monitor agent.
- Use Azure Arc to monitor your VMs outside of Azure.
- Use Azure Policy to deploy agents and assign data collection rules.
- Establish a strategy for structure of data collection rules.
- Consider migrating System Center Operations Manager (SCOM) client management packs to Azure Monitor.
Configuration recommendations
Recommendation | Description |
---|---|
Migrate from legacy agents to Azure Monitor agent. | The Azure Monitor agent is simpler to manage than the legacy Log Analytics agent and allows more flexibility in your Log Analytics workspace design. Both the Windows and Linux agents allow multihoming, which means they can connect to multiple workspaces. Data collection rules allow you to manage your data collection settings at scale and define unique, scoped configurations for subsets of machines. See Migrate to Azure Monitor Agent from Log Analytics agent for considerations and migration methods. |
Use Azure Arc to monitor your VMs outside of Azure. | Azure Arc for servers allows you to manage physical servers and virtual machines hosted outside of Azure, on your corporate network, or other cloud provider. With the Azure Connected machine agent in place, you can deploy the Azure Monitor agent to these VMs using the same method that you do for your Azure VMs and then monitor your entire collection of VMs using the same Azure Monitor tools. |
Use Azure Policy to deploy agents and assign data collection rules. | Azure Policy allows you to have agents automatically deployed to sets of existing VMs and any new VMs that are created. This ensures that all VMs are monitored with minimal intervention by administrators. If you use VM insights, see Enable VM insights by using Azure Policy. If you want to manage Azure Monitor agent without VM insights, see Enable Azure Monitor Agent using Azure Policy. See [Manage data collection rule associations in Azure Monitor](../essentials/data-collection-rule-associations.md#create-new-association for a template to create a data collection rule association. |
Establish a strategy for structure of data collection rules. | Data collection rules define data to collect from virtual machines with the Azure Monitor agent and where to send that data. Each DCR can include multiple collection scenarios and be associated with any number of VMs. Establish a strategy for configuring DCRs to collect only required data for different groups of VMs while minimizing the number of DCRs that you need to manage. |
Consider migrating SCOM client management packs to Azure Monitor. | If you have an existing SCOM environment for monitoring client workloads, you may be able to migrate enough of the management pack logic to Azure Monitor to allow you to retire your SCOM environment, or at least to retire certain management packs. See Migrate from System Center Operations Manager (SCOM) to Azure Monitor. |
Containers
Design checklist
- Review guidance for monitoring all layers of your Kubernetes environment.
- Use Azure Arc-enabled Kubernetes to monitor your clusters outside of Azure.
- Use Azure managed services for cloud native tools.
- Integrate AKS clusters into your existing monitoring tools.
- Use Azure policy to enable data collection from your Kubernetes cluster.
Configuration recommendations
Recommendation | Benefit |
---|---|
Review guidance for monitoring all layers of your Kubernetes environment. | Monitor your Kubernetes cluster performance with Container insights includes guidance and best practices for monitoring your entire Kubernetes environment from the network, cluster, and application layers. |
Use Azure Arc-enabled Kubernetes to monitor your clusters outside of Azure. | Azure Arc-enabled Kubernetes allows your Kubernetes clusters running in other clouds to be monitored using the same tools as your AKS clusters, including Container insights and Azure Monitor managed service for Prometheus. |
Use Azure managed services for cloud native tools. | Azure Monitor managed service for Prometheus and Azure managed Grafana support all the features of the cloud native tools Prometheus and Grafana without having to operate their underlying infrastructure. You can quickly provision these tools and onboard your Kubernetes clusters with minimal overhead. These services allow you to access an extensive library of community rules and dashboards to monitor your Kubernetes environment. |
Integrate AKS clusters into your existing monitoring tools. | If you have an existing investment in Prometheus and Grafana, integrate your AKS clusters and Azure managed services into your existing environment using the guidance in Monitor Kubernetes clusters using Azure services and cloud native tools. |
Use Azure policy to enable data collection from your Kubernetes cluster. | Use Azure Policy to enable data collection for enabling Prometheus metrics, Container insights, and diagnostic settings. This ensures that any new clusters are automatically monitored and enforces their monitoring configuration. |