Reliability recommendations
Azure Advisor helps you ensure and improve the continuity of your business-critical applications. You can get reliability recommendations on the Reliability tab on the Advisor dashboard.
Sign in to the Azure portal.
Search for and select Advisor from any page.
On the Advisor dashboard, select the Reliability tab.
AgFood Platform
Upgrade to the latest ADMA DotNet SDK version
We identified calls to an ADMA DotNet SDK version that is scheduled for deprecation. To ensure uninterrupted access to ADMA, latest features, and performance improvements, switch to the latest SDK version.
Potential benefits: Ensure uninterrupted access to ADMA
For More information, see What is Azure Data Manager for Agriculture?
Upgrade to the latest ADMA Java SDK version
We have identified calls to a ADMA Java Sdk version that is scheduled for deprecation. We recommend switching to the latest Sdk version to ensure uninterrupted access to ADMA, latest features, and performance improvements.
Potential benefits: Ensure uninterrupted access to ADMA
For More information, see What is Azure Data Manager for Agriculture?
Upgrade to the latest ADMA Python SDK version
We identified calls to an ADMA Python SDK version that is scheduled for deprecation. To ensure uninterrupted access to ADMA, latest features, and performance improvements, switch to the latest SDK version.
Potential benefits: Ensure uninterrupted access to ADMA
For More information, see What is Azure Data Manager for Agriculture?
Upgrade to the latest ADMA JavaScript SDK version
We identified calls to an ADMA JavaScript SDK version that is scheduled for deprecation. To ensure uninterrupted access to ADMA, latest features, and performance improvements, switch to the latest SDK version.
Potential benefits: Ensure uninterrupted access to ADMA
For More information, see What is Azure Data Manager for Agriculture?
API Management
Migrate API Management service to stv2 platform
Support for API Management instances hosted on the stv1 platform will be retired by 31 August 2024. Migrate to stv2 based platform before that to avoid service disruption.
Potential benefits: Improve service stability and leverage new platform features
For More information, see API Management stv1 platform retirement - Global Azure cloud (August 2024)
Hostname certificate rotation failed
The API Management service failing to refresh the hostname certificate from the Key Vault can lead to the service using a stale certificate and runtime API traffic being blocked. Ensure that the certificate exists in the Key Vault, and the API Management service identity is granted secret read access.
Potential benefits: Ensure service availability
For More information, see Configure a custom domain name for your Azure API Management instance
The legacy portal was deprecated 3 years ago and retired in October 2023. However, we are seeing active usage of the portal which may cause service disruption soon when we disable it.
We highly recommend that you migrate to the new developer portal as soon as possible to continue enjoying our services and take advantage of the new features and improvements.
Potential benefits: Ensure business continuity
For More information, see Migrate to the new developer portal
Dependency network status check failed
Azure API Management service dependency not available. Please, check virtual network configuration.
Potential benefits: Improve service stability
For More information, see Deploy your Azure API Management instance to a virtual network - external mode
SSL/TLS renegotiation blocked
SSL/TLS renegotiation attempt blocked; secure communication might fail. To support client certificate authentication scenarios, enable 'Negotiate client certificate' on listed hostnames. For browser-based clients, this option might result in a certificate prompt being presented to the client.
Potential benefits: Ensure service availability
For More information, see How to secure APIs using client certificate authentication in API Management
Deploy an Azure API Management instance to multiple Azure regions for increased service availability
Azure API Management supports multi-region deployment, which enables API publishers to add regional API gateways to an existing API Management instance. Multi-region deployment helps reduce request latency perceived by geographically distributed API consumers and improves service availability.
Potential benefits: Increased resilience against regional failures
For More information, see Deploy an Azure API Management instance to multiple Azure regions
Enable and configure autoscale for API Management instance on production workloads.
API Management instance in production service tiers can be scaled by adding and removing units. The autoscaling feature can dynamically adjust the units of an API Management instance to accommodate a change in load without manual intervention.
Potential benefits: Increase scalability and optimize cost.
For More information, see Automatically scale an Azure API Management instance
App Service
Scale out your App Service plan to avoid CPU exhaustion
High CPU utilization can lead to runtime issues with applications. Your application exceeded 90% CPU over the last couple of days. To reduce CPU usage and avoid runtime issues, scale out the application.
Potential benefits: Keep your app healthy
For More information, see Best practices for Azure App Service
Check your app's service health issues
We have a recommendation related to your app's service health. Open the Azure Portal, go to the app, click the Diagnose and Solve to see more details.
Potential benefits: Keep your app healthy
For More information, see Best practices for Azure App Service
Fix the backup database settings of your App Service resource
When an application has an invalid database configuration, its backups fail. For details, see your application's backup history on your app management page.
Potential benefits: Ensure business continuity
For More information, see Best practices for Azure App Service
Fix the backup storage settings of your App Service resource
When an application has invalid storage settings, its backups fail. For details, see your application's backup history on your app management page.
Potential benefits: Ensure business continuity
For More information, see Best practices for Azure App Service
Scale up your App Service plan SKU to avoid memory problems
The App Service Plan containing your application exceeded 85% memory allocation. High memory consumption can lead to runtime issues your applications. Find the problem application and scale it up to a higher plan with more memory resources.
Potential benefits: Keep your app healthy
For More information, see Best practices for Azure App Service
Scale out your App Service plan
Consider scaling out your App Service Plan to at least two instances to avoid cold start delays and service interruptions during routine maintenance.
Potential benefits: Optimize user experience and availability
For More information, see https://aka.ms/appsvcnuminstances
Fix application code, a worker process crashed due to an unhandled exception
A worker process in your application crashed due to an unhandled exception. To identify the root cause, collect memory dumps and call stack information at the time of the crash.
Potential benefits: Keep your app healthy and highly available
For More information, see https://aka.ms/appsvcproactivecrashmonitoring
Upgrade your App Service to a Standard plan to avoid request rejects
When an application is part of a shared App Service plan and meets its quota multiple times, incoming requests might be rejected. Your web application can’t accept incoming requests after meeting a quota. To remove the quota, upgrade to a Standard plan.
Potential benefits: Keep your app healthy
For More information, see Azure App Service plan overview
Move your App Service resource to Standard or higher and use deployment slots
When an application is deployed multiple times in a week, problems might occur. You deployed your application multiple times last week. To help you reduce deployment impact to your production web application, move your App Service resource to the Standard (or higher) plan, and use deployment slots.
Potential benefits: Keep your app healthy while updating
For More information, see Set up staging environments in Azure App Service
Consider upgrading the hosting plan of the Static Web App(s) in this subscription to Standard SKU.
The combined bandwidth used by all the Free SKU Static Web Apps in this subscription is exceeding the monthly limit of 100GB. Consider upgrading these applications to Standard SKU to avoid throttling.
Potential benefits: Higher availability for the apps by avoiding throttling.
For More information, see Pricing – Static Web Apps
Use deployment slots for your App Service resource
When an application is deployed multiple times in a week, problems might occur. You deployed your application multiple times over the last week. To help you manage changes and help reduce deployment impact to your production web application, use deployment slots.
Potential benefits: Keep your app healthy while updating
For More information, see Set up staging environments in Azure App Service
Consider changing your application architecture to 64-bit
Your App Service is configured as 32-bit, and its memory consumption is approaching the limit of 2 GB. If your application supports, consider recompiling your application and changing the App Service configuration to 64-bit instead.
Potential benefits: Improve your application reliability
For More information, see Application performance FAQs for Web Apps in Azure
CX Observer Personalized Recommendation
CX Observer Personalized Recommendation
Potential benefits: NA
App Service Certificates
Domain verification required to issue your App Service Certificate
You have an App Service Certificate that's currently in a Pending Issuance status and requires domain verification. Failure to validate domain ownership will result in an unsuccessful certificate issuance. Domain verification isn't automated for App Service Certificates and will require action. If you've recently verified domain ownership and have been issued a certificate, you may disregard this message.
Potential benefits: Ensure successful issuance of App Service Certificate.
For More information, see Add and manage TLS/SSL certificates in Azure App Service
Application Gateway
Upgrade your SKU or add more instances
Deploying two or more medium or large sized instances ensures business continuity (fault tolerance) during outages caused by planned or unplanned maintenance.
Potential benefits: Ensure business continuity through application gateway resilience
For More information, see Multi-region load balancing - Azure Reference Architectures
Avoid hostname override to ensure site integrity
Avoid overriding the hostname when configuring Application Gateway. Having a domain on the frontend of Application Gateway different than the one used to access the backend, can lead to broken cookies or redirect URLs. Make sure the backend is able to deal with the domain difference, or update the Application Gateway configuration so the hostname doesn't need to be overwritten towards the backend. When used with App Service, attach a custom domain name to the Web App and avoid use of the *.azurewebsites.net host name towards the backend. Note that a different frontend domain isn't a problem in all situations, and certain categories of backends like REST APIs, are less sensitive in general.
Potential benefits: Ensure site integrity and avoid broken cookies or redirect urls through a resilient Application Gateway configuration.
For More information, see Troubleshoot App Service issues in Application Gateway
Implement ExpressRoute Monitor on Network Performance Monitor
When ExpressRoute circuit isn't monitored by ExpressRoute Monitor on Network Performance, you miss notifications of loss, latency, and performance of on-premises to Azure resources, and Azure to on-premises resources. For end-to-end monitoring, implement ExpressRoute Monitor on Network Performance.
Potential benefits: Improve time-to-detect and time-to-mitigate issues in your network and provide insights on your network path via ExpressRoute
For More information, see Configure Network Performance Monitor for ExpressRoute (deprecated)
Implement multiple ExpressRoute circuits in your Virtual Network for cross premises resiliency
When an ExpressRoute gateway only has one ExpressRoute circuit associated to it, resiliency issues might occur. To ensure peering location redundancy and resiliency, connect one or more additional circuits to your gateway.
Potential benefits: Improve resiliency in case of ExpressRoute peering location failure
For More information, see Designing for high availability with ExpressRoute
Add at least one more endpoint to the profile, preferably in another Azure region
Profiles need more than one endpoint to ensure availability if one of the endpoints fails. We also recommend that endpoints be in different regions.
Potential benefits: Improve resiliency by allowing failover
For More information, see Traffic Manager endpoints
Add an endpoint configured to "All (World)"
For geographic routing, traffic is routed to endpoints in defined regions. When a region fails, there is no pre-defined failover. Having an endpoint where the Regional Grouping is configured to "All (World)" for geographic profiles avoids traffic black holing and guarantees service availablity.
Potential benefits: Improve resiliency by avoiding traffic black holes
For More information, see Add, disable, enable, delete, or move endpoints
Add or move one endpoint to another Azure region
All endpoints associated to this proximity profile are in the same region. Users from other regions may experience long latency when attempting to connect. Adding or moving an endpoint to another region will improve overall performance for proximity routing and provide better availability if all endpoints in one region fail.
Potential benefits: Improve resiliency by allowing failover to another region
For More information, see Configure the performance traffic routing method
Move to production gateway SKUs from Basic gateways
The Basic VPN SKU is for development or testing scenarios. If you're using the VPN gateway for production, move to a production SKU, which offers higher numbers of tunnels, Border Gateway Protocol (BGP), active-active configuration, custom IPsec/IKE policy, and increased stability and availability.
Potential benefits: Additional available features and higher stability and availability
For More information, see About VPN Gateway configuration settings
Enable Active-Active gateways for redundancy
In active-active configuration, both instances of the VPN gateway establish site-to-site (S2S) VPN tunnels to your on-premise VPN device. When a planned maintenance or unplanned event happens to one gateway instance, traffic is automatically switched over to the other active IPsec tunnel.
Potential benefits: Ensure business continuity through connection resilience
For More information, see Design highly available gateway connectivity for cross-premises and VNet-to-VNet connections
Disable health probes when there is only one origin in an origin group
If you only have a single origin, Front Door always routes traffic to that origin even if its health probe reports an unhealthy status. The status of the health probe doesn't do anything to change Front Door's behavior. In this scenario, health probes don't provide a benefit.
Potential benefits: Ensure service availability by reducing unnecessary health probe traffic
For More information, see Best practices for Front Door
Use managed TLS certificates
When Front Door manages your TLS certificates, it reduces your operational costs, and helps you to avoid costly outages caused by forgetting to renew a certificate. Front Door automatically issues and rotates the managed TLS certificates.
Potential benefits: Ensure service availability by having Front Door manage and rotate your certificates
For More information, see Best practices for Front Door
Use NAT gateway for outbound connectivity
Prevent connectivity failures due to source network address translation (SNAT) port exhaustion by using NAT gateway for outbound traffic from your virtual networks. NAT gateway scales dynamically and provides secure connections for traffic headed to the internet.
Potential benefits: Prevent outbound connection failures with NAT gateway
For More information, see Use Source Network Address Translation (SNAT) for outbound connections
Deploy your Application Gateway across Availability Zones
Achieve zone redundancy by deploying Application Gateway across Availability Zones. Zone redundancy boosts resilience by enabling Application Gateway to survive various outages, which ensures continuity even if one zone is affected, and enhances overall reliability.
Potential benefits: Resiliency of Application Gateways is considerably increased when using Availability Zones.
For More information, see Scaling Application Gateway v2 and WAF v2
Update VNet permission of Application Gateway users
To improve security and provide a more consistent experience across Azure, all users must pass a permission check to create or update an Application Gateway in a Virtual Network. The users or service principals minimum permission required is Microsoft.Network/virtualNetworks/subnets/join/action.
Potential benefits: Avoid disruptions in management of Application Gateway resource
For More information, see Application Gateway infrastructure configuration
Use the same domain name on Front Door and your origin
When you rewrite the Host header, request cookies and URL redirections might break. When you use platforms like Azure App Service, features like session affinity and authentication and authorization might not work correctly. Make sure to validate whether your application is going to work correctly.
Potential benefits: Ensure application integrity by preserving original host name
For More information, see Best practices for Front Door
Implement Site Resiliency for ExpressRoute
To ensure maximum resiliency, Microsoft recommends that you connect to two ExpressRoute circuits in two peering locations. The goal of Maximum Resiliency is to enhance availability and ensure the highest level of resilience for critical workloads.
Potential benefits: Maximum Resiliency in ExpressRoute is designed to ensure there isn’t a single point of failure within the Microsoft network path. This is achieved by offering dual (2) circuits across two different locations for site diversity in ExpressRoute. The goal of Maximum Resiliency is to enhance availability and ensure the highest level of resilience for critical workloads.
For More information, see Design and architect Azure ExpressRoute for resiliency
Implement Zone Redundant ExpressRoute Gateways
Implement zone-redundant Virtual Network Gateway in Azure Availability Zones. This brings resiliency, scalability, and higher availability to your Virtual Network Gateways.
Potential benefits: Provides zonal resiliency and redundancy for ExpressRoute
For More information, see Create a zone-redundant virtual network gateway in availability zones
Ensure autoscaling is used for increased performance and resiliency
When configuring the Application Gateway, it's recommended to provision autoscaling to scale in and out in response to changes in demand. This helps to minimize the effects of a single failing component.
Potential benefits: Increase performance and resiliency.
For More information, see Scaling Application Gateway v2 and WAF v2
ExpressRoute IP routes nearing specified limit
Your ExpressRoute circuit is close to reaching its IP route limits. Exceeding these limits will disrupt the connectivity. Connectivity will restore once routes are within limits Suggestions: Regularly monitor route counts. Explore Virtual WAN RouteMap to reduce advertised IP routes.
Potential benefits: Monitoring IP route counts prevents connectivity issues and ensures stability.
For More information, see Virtual WAN FAQ
Avoid placing Traffic Manager behind Front Door
Using Traffic Manager as one of the origins for Front Door isn't recommended, as this can lead to routing issues. If you need both services in a high availability architecture, always place Traffic Manager in front of Azure Front Door.
Potential benefits: Increase your workload resiliency
For More information, see Best practices for Front Door
Consider having at least two origins
Multiple origins support redundancy by distributing traffic across multiple instances of the application. If one instance is unavailable, then other backend origins can still receive traffic.
Potential benefits: Increase your workload resiliency
For More information, see Azure Well-Architected Framework perspective on Azure Front Door
Change subnet of V1 gateway named GatewaySubnet as it's reserved for VPN/Express Route
Your Application Gateway is at risk of deletion after October 2024 due to a failed internal upgrade. This is due to subnet named Gatewaysubnet, which is reserved for VPN/ExpressRoute. To resolve, please change the subnet or migrate to V2. Allow a day for the message to disappear once fixed
Potential benefits: Avoid disruption in management of Application Gateway V1 resource
For More information, see Frequently asked questions about Application Gateway
Change subnet of V1 gateway as the current subnet contains a NAT gateway
Your Application Gateway may be deleted after October 2024 due to a failed internal upgrade. This is because it lacks a dedicated subnet and contains a NAT Gateway. To resolve, either change the subnet, remove the NAT Gateway, or migrate to V2. Allow a day for the message to disappear once fixed
Potential benefits: Avoid disruption in management of Application Gateway V1 resource
For More information, see Frequently asked questions about Application Gateway
Reactivate the Subscription to unblock internal upgrade for V1 gateway
Your Application Gateway is at risk of deletion after October 2024 due to a failed internal upgrade. This is because the subscription is in a non Active state. To fix this, please activate the subscription. Allow a day for this message to disappear once the issue is fixed.
Potential benefits: Avoid disruption in management of Application Gateway V1 resource
For More information, see Reactivate a disabled Azure subscription
Application Gateway for Containers
Migrate to supported version of AGC
The version of Application Gateway for Containers was provisioned with a preview version and isn't supported for production. Ensure you provision a new gateway using the latest API version.
Potential benefits: Ensure supportability and resiliency for production workloads
For More information, see What is Application Gateway for Containers?
Azure AI Search
Create a Standard search service (2GB)
When you exceed your storage quota, indexing operations stop working. You're close to exceeding your storage quota of 2GB. If you need more storage, create a Standard search service or add extra partitions.
Potential benefits: capability to handle more data
For More information, see https://aka.ms/azs/search-limits-quotas-capacity
Create a Standard search service (50MB)
When you exceed your storage quota, indexing operations stop working. You're close to exceeding your storage quota of 50MB. To maintain operations, create a Basic or Standard search service.
Potential benefits: capability to handle more data
For More information, see https://aka.ms/azs/search-limits-quotas-capacity
Avoid exceeding your available storage quota by adding more partitions
When you exceed your storage quota, you can still query, but indexing operations stop working. You're close to exceeding your available storage quota. If you need more storage, add extra partitions.
Potential benefits: Able to index additional data
For More information, see https://aka.ms/azs/search-limits-quotas-capacity
Azure Arc-enabled Kubernetes
Upgrade to the latest agent version of Azure Arc-enabled Kubernetes
For the best Azure Arc enabled Kubernetes experience, improved stability and new functionality, upgrade to the latest agent version.
Potential benefits: Arc-enabled K8s latest agent version
For More information, see Upgrade Azure Arc-enabled Kubernetes agents
Azure Arc-enabled Kubernetes Configuration
Upgrade Microsoft Flux extension to the newest major version
The Microsoft Flux extension has a major version release. Plan for a manual upgrade to the latest major version for Microsoft Flux for all Azure Arc-enabled Kubernetes and Azure Kubernetes Service (AKS) clusters within 6 months for continued support and new functionality.
Potential benefits: Continued support and new functionality
For More information, see Available extensions for Azure Arc-enabled Kubernetes clusters
Upcoming Breaking Changes for Microsoft Flux Extension
The Microsoft Flux extension frequently receives updates for security and stability. The upcoming update, in line with the OSS Flux Project, will modify the HelmRelease and HelmChart APIs by removing deprecated fields. To avoid disruption to your workloads, necessary action is needed.
Potential benefits: Improved stability, security, and new functionality
For More information, see Available extensions for Azure Arc-enabled Kubernetes clusters
Upgrade Microsoft Flux extension to a supported version
Current version of Microsoft Flux on one or more Azure Arc enabled clusters and Azure Kubernetes clusters is out of support. To get security patches, bug fixes and Microsoft support, upgrade to a supported version.
Potential benefits: Get security patches, bug fixes and Microsoft support
For More information, see Available extensions for Azure Arc-enabled Kubernetes clusters
Azure Arc-enabled servers
Upgrade to the latest version of the Azure Connected Machine agent
The Azure Connected Machine agent is updated regularly with bug fixes, stability enhancements, and new functionality. For the best Azure Arc experience, upgrade your agent to the latest version.
Potential benefits: Improved stability and new functionality
For More information, see Managing and maintaining the Connected Machine agent
Azure Cache for Redis
Increase fragmentation memory reservation
Fragmentation and memory pressure can cause availability incidents. To help in reduce cache failures when running under high memory pressure, increase reservation of memory for fragmentation through the maxfragmentationmemory-reserved setting available in the Advanced Settings options.
Potential benefits: Avoid availability incidents when your cache has high memory fragmentation
For More information, see How to configure Azure Cache for Redis
Configure geo-replication for Cache for Redis instances to increase durability of applications
Geo-Replication enables disaster recovery for cached data, even in the unlikely event of a widespread regional failure. This can be essential for mission-critical applications. We recommend that you configure passive geo-replication for Premium Azure Cache for Redis instances.
Potential benefits: Geo-Replication enables disaster recovery for cached data.
For More information, see Configure passive geo-replication for Premium Azure Cache for Redis instances
Azure Container Apps
Re-create your your Container Apps environment to avoid DNS issues
There's a potential networking issue with your Container Apps environments that might cause DNS issues. We recommend that you create a new Container Apps environment, re-create your Container Apps in the new environment, and delete the old Container Apps environment.
Potential benefits: Avoid DNS failures in your Container Apps Environment.
For More information, see Quickstart: Deploy your first container app using the Azure portal
Renew custom domain certificate
The custom domain certificate you uploaded is near expiration. To prevent possible service downtime, renew your certificate and upload the new certificate for your container apps.
Potential benefits: Your service wont fail because of expired certificate.
For More information, see Custom domain names and bring your own certificates in Azure Container Apps
An issue has been detected that is preventing the renewal of your Managed Certificate.
We detected the managed certificate used by the Container App has failed to auto renew. Follow the documentation link to make sure that the DNS settings of your custom domain are correct.
Potential benefits: Avoid downtime due to an expired certificate.
For More information, see Custom domain names and free managed certificates in Azure Container Apps
Increase the minimal replica count for your containerized application
The minimal replica count set for your Azure Container App containerized application might be too low, which can cause resilience, scalability, and load balancing issues. For better availability, consider increasing the minimal replica count.
Potential benefits: Better availability for your container app.
For More information, see Set scaling rules in Azure Container Apps
Azure Cosmos DB
Configure Azure Cosmos DB containers with a partition key
When Azure Cosmos DB nonpartitioned collections reach their provisioned storage quota, you lose the ability to add data. Your Cosmos DB nonpartitioned collections are approaching their provisioned storage quota. Migrate these collections to new collections with a partition key definition so they can automatically be scaled out by the service.
Potential benefits: Scale your containers seamlessly with increase in storage or request rates without running into any limits
For More information, see Partitioning and horizontal scaling in Azure Cosmos DB
Use static Cosmos DB client instances in your code and cache the names of databases and collections
A high number of metadata operations on an account can result in rate limiting. Metadata operations have a system-reserved request unit (RU) limit. Avoid rate limiting from metadata operations by using static Cosmos DB client instances in your code and caching the names of databases and collections.
Potential benefits: Optimize your RU usage and avoid rate limiting
For More information, see Performance tips for Azure Cosmos DB and .NET SDK v2
Check linked Azure Key Vault hosting your encryption key
When an Azure Cosmos DB account can't access its linked Azure Key Vault hosting the encyrption key, data access and security issues might happen. Your Azure Key Vault's configuration is preventing your Cosmos DB account from contacting the key vault to access your managed encryption keys. If you recently performed a key rotation, ensure that the previous key, or key version, remains enabled and available until Cosmos DB completes the rotation. The previous key or key version can be disabled after 24 hours, or after the Azure Key Vault audit logs don't show any activity from Azure Cosmos DB on that key or key version.
Potential benefits: Update your configurations to continue using customer-managed keys and access your data
For More information, see Configure customer-managed keys for your Azure Cosmos DB account with Azure Key Vault
Configure consistent indexing mode on Azure Cosmos DB containers
Azure Cosmos containers configured with the Lazy indexing mode update asynchronously, which improves write performance, but can impact query freshness. Your container is configured with the Lazy indexing mode. If query freshness is critical, use Consistent Indexing Mode for immediate index updates.
Potential benefits: Improve query result consistency and reliability
For More information, see Manage indexing policies in Azure Cosmos DB
Hotfix - Upgrade to 2.6.14 version of the Async Java SDK v2 or to Java SDK v4
There's a critical bug in version 2.6.13 (and lower) of the Azure Cosmos DB Async Java SDK v2 causing errors when a Global logical sequence number (LSN) greater than the Max Integer value is reached. The error happens transparently to you by the service after a large volume of transactions occur in the lifetime of an Azure Cosmos DB container. Note: While this is a critical hotfix for the Async Java SDK v2, we still highly recommend you migrate to the Java SDK v4.
Potential benefits: If action isn’t taken, all create, read, update, and delete operations may begin to fail with NumberFormatException
For More information, see Azure Cosmos DB Async Java SDK for API for NoSQL (legacy): Release notes and resources
Critical issue - Upgrade to the current recommended version of the Java SDK v4
There's a critical bug in version 4.15 and lower of the Azure Cosmos DB Java SDK v4 causing errors when a Global logical sequence number (LSN) greater than the Max Integer value is reached. This happens transparently to you by the service after a large volume of transactions occur in the lifetime of an Azure Cosmos DB container. Avoid this problem by upgrading to the current recommended version of the Java SDK v4
Potential benefits: If action isn’t taken, all create, read, update, and delete operations may begin to fail with NumberFormatException
For More information, see Azure Cosmos DB Java SDK v4 for API for NoSQL: release notes and resources
Use the new 3.6+ endpoint to connect to your upgraded Azure Cosmos DB's API for MongoDB account
Some of your applications are connecting to your upgraded Azure Cosmos DB's API for MongoDB account using the legacy 3.2 endpoint - [accountname].documents.azure.com. Use the new endpoint - [accountname].mongo.cosmos.azure.com (or its equivalent in sovereign, government, or restricted clouds).
Potential benefits: Take advantage of the latest features in version 3.6+ of Azure Cosmos DB's API for MongoDB
For More information, see Azure Cosmos DB for MongoDB (4.0 server version): supported features and syntax
Upgrade your Azure Cosmos DB API for MongoDB account to v4.2 to save on query/storage costs and utilize new features
Your Azure Cosmos DB API for MongoDB account is eligible to upgrade to version 4.2. Upgrading to v4.2 can reduce your storage costs by up to 55% and your query costs by up to 45% by leveraging a new storage format. Numerous additional features such as multi-document transactions are also included in v4.2.
Potential benefits: Improved reliability, query/storage efficiency, performance, and new feature capabilities
For More information, see Upgrade the API version of your Azure Cosmos DB for MongoDB account
Enable Server Side Retry (SSR) on your Azure Cosmos DB's API for MongoDB account
When an account is throwing a TooManyRequests error with the 16500 error code, enabling Server Side Retry (SSR) can help mitigate the issue.
Potential benefits: Prevent throttling and improve your query reliability and performance
Add a second region to your production workloads on Azure Cosmos DB
Production workloads on Azure Cosmos DB run in a single region might have availability issues, this appears to be the case with some of your Cosmos DB accounts. Increase their availability by configuring them to span at least two Azure regions. NOTE: Additional regions incur additional costs.
Potential benefits: Improve the availability of your production workloads
For More information, see High availability (Reliability) in Azure Cosmos DB for NoSQL
Upgrade old Azure Cosmos DB SDK to the latest version
An Azure Cosmos DB account using an old version of the SDK lacks the latest fixes and improvements. Your Azure Cosmos DB account is using an old version of the SDK. For the latest fixes, performance improvements, and new feature capabilities, upgrade to the latest version.
Potential benefits: Improved reliability, performance, and new feature capabilities
For More information, see Azure Cosmos DB documentation
Upgrade outdated Azure Cosmos DB SDK to the latest version
An Azure Cosmos DB account using an old version of the SDK lacks the latest fixes and improvements. Your Azure Cosmos DB account is using an outdated version of the SDK. We recommend upgrading to the latest version for the latest fixes, performance improvements, and new feature capabilities.
Potential benefits: Improved reliability, performance, and new feature capabilities
For More information, see Azure Cosmos DB documentation
Enable service managed failover for Cosmos DB account
Enable service managed failover for Cosmos DB account to ensure high availability of the account. Service managed failover automatically switches the write region to the secondary region in case of a primary region outage. This ensures that the application continues to function without any downtime.
Potential benefits: Azure's Service-Managed Failover feature enhances system availability by automating failover processes, reducing downtime, and improving resilience.
For More information, see High availability (Reliability) in Azure Cosmos DB for NoSQL
Enable HA for your Production workload
Many clusters with consistent workloads do not have high availability (HA) enabled. It's recommended to activate HA from the Scale page in the Azure Portal to prevent database downtime in case of unexpected node failures and to qualify for SLA guarantees.
Potential benefits: Activate HA to avoid database downtime in case of an unexpected node failure
For More information, see Scaling and configuring Your Azure Cosmos DB for MongoDB vCore cluster
Enable zone redundancy for multi-region Cosmos DB accounts
This recommendation suggests enabling zone redundancy for multi-region Cosmos DB accounts to improve high availability and reduce the risk of data loss in case of a regional outage.
Potential benefits: Improved high availability and reduced risk of data loss
For More information, see High availability (Reliability) in Azure Cosmos DB for NoSQL
Add at least one data center in another Azure region
Your Azure Managed Instance for Apache Cassandra cluster is designated as a production cluster but is currently deployed in a single Azure region. For production clusters, we recommend adding at least one more data center in another Azure region to guard against disaster recovery scenarios.
Potential benefits: Ensure applications have another region in case of disaster recovery
For More information, see Best practices for high availability and disaster recovery
Avoid being rate limited for Control Plane operation
We found high number of Control Plane operations on your account through resource provider. Request that exceeds the documented limits at sustained levels over consecutive 5-minute periods may experience request being throttling as well failed or incomplete operation on Azure Cosmos DB resources.
Potential benefits: Optimize control plane operation and avoid operation failure due to rate limiting
For More information, see Azure Cosmos DB service quotas
Azure Data Explorer
Resolve virtual network issues
Service failed to install or resume due to virtual network (VNet) issues. To resolve this issue, follow the steps in the troubleshooting guide.
Potential benefits: Improve reliability, availability, performance, and new feature capabilities
For More information, see Troubleshoot access, ingestion, and operation of your Azure Data Explorer cluster in your virtual network
Add subnet delegation for 'Microsoft.Kusto/clusters'
If a subnet isn’t delegated, the associated Azure service won’t be able to operate within it. Your subnet doesn’t have the required delegation. Delegate your subnet for 'Microsoft.Kusto/clusters'.
Potential benefits: Improve reliability, availability, performance, and new feature capabilities
For More information, see What is subnet delegation?
Azure Database for MySQL
High Availability - Add primary key to the table that currently doesn't have one.
Our internal monitoring system has identified significant replication lag on the High Availability standby server. This lag is primarily caused by the standby server replaying relay logs on a table that lacks a primary key. To address this issue and adhere to best practices, it's recommended to add primary keys to all tables. Once this is done, proceed to disable and then re-enable High Availability to mitigate the problem.
Potential benefits: By implementing this approach, the standby server will be shielded from the adverse effects of high replication lag caused by the absence of a primary key on any table. This approach can contribute to reduced failover times, ultimately supporting the goal of maintaining business continuity.
For More information, see Troubleshoot replication latency in Azure Database for MySQL - Flexible Server
Replication - Add a primary key to the table that currently doesn't have one
Our internal monitoring observed significant replication lag on your replica server because the replica server is replaying relay logs on a table that lacks a primary key. To ensure that the replica server can effectively synchronize with the primary and keep up with changes, add primary keys to the tables in the primary server and then recreate the replica server.
Potential benefits: By implementing this approach, the replica server will achieve a state of close synchronization with the primary server.
For More information, see Troubleshoot replication latency in Azure Database for MySQL - Flexible Server
Azure Database for PostgreSQL
Remove inactive logical replication slots (important)
Inactive logical replication slots can result in degraded server performance and unavailability due to write ahead log (WAL) file retention and buildup of snapshot files. Your Azure Database for PostgreSQL flexible server might have inactive logical replication slots. THIS NEEDS IMMEDIATE ATTENTION. Either delete the inactive replication slots, or start consuming the changes from these slots, so that the slots' Log Sequence Number (LSN) advances and is close to the current LSN of the server.
Potential benefits: Improve PostgreSQL availability by removing inactive logical replication slots
For More information, see Logical replication and logical decoding in Azure Database for PostgreSQL - Flexible Server
Remove inactive logical replication slots
When an Orcas PostgreSQL flexible server has inactive logical replication slots, degraded server performance and unavailability due to write ahead log (WAL) file retention and buildup of snapshot files might occur. THIS NEEDS IMMEDIATE ATTENTION. Either delete the inactive replication slots, or start consuming the changes from these slots, so that the slots' Log Sequence Number (LSN) advances and is close to the current LSN of the server.
Potential benefits: Improve PostgreSQL availability by removing inactive logical replication slots
For More information, see Logical decoding
Configure geo redundant backup storage
Configure GRS to ensure that your database meets its availability and durability targets even in the face of failures or disasters.
Potential benefits: Ensures recovery from regional failure or disaster.
For More information, see Backup and restore in Azure Database for PostgreSQL - Flexible Server
Define custom maintenance windows to occur during low-peak hours
When specifying preferences for the maintenance schedule, you can pick a day of the week and a time window. If you don't specify, the system will pick times between 11pm and 7am in your server's region time. Pick a day and time where usage is low.
Potential benefits: Configure maintenance window enables avoiding maintenance during system peak.
For More information, see Scheduled maintenance in Azure Database for PostgreSQL - Flexible Server
Azure IoT Hub
Upgrade Microsoft Edge device runtime to a supported version for IoT Hub
When Edge devices use outdated versions, performance degradation might occur. We recommend you upgrade to the latest supported version of the Azure IoT Edge runtime.
Potential benefits: Ensure business continuity with latest supported version for your Edge devices
For More information, see Update IoT Edge
Upgrade device client SDK to a supported version for IotHub
When devices use an outdated SDK, performance degradation can occur. Some or all of your devices are using an outdated SDK. We recommend you upgrade to a supported SDK version.
Potential benefits: Ensure business continuity with supported SDK for your devices
For More information, see Azure IoT Hub SDKs
IoT Hub Potential Device Storm Detected
This is when two or more devices are trying to connect to the IoT Hub using the same device ID credentials. When the second device (B) connects, it causes the first one (A) to become disconnected. Then (A) attempts to reconnect again, which causes (B) to get disconnected.
Potential benefits: Improve connectivity of your devices
For More information, see Understand and resolve Azure IoT Hub errors
Upgrade Device Update for IoT Hub SDK to a supported version
When a Device Update for IoT Hub instance uses an outdated version of the SDK, it doesn't get the latest upgrades. For the latest fixes, performance improvements, and new feature capabilities, upgrade to the latest Device Update for IoT Hub SDK version.
Potential benefits: Ensure business continuity with supported SDK
For More information, see What is Device Update for IoT Hub?
Add IoT Hub units or increase SKU level
When an IoT Hub exceeds its daily message quota, operation and cost problems might occur. To ensure smooth operation in the future, add units or increase the SKU level.
Potential benefits: The IoT Hub can receive messages again.
For More information, see Understand and resolve Azure IoT Hub errors
Azure Kubernetes Service (AKS)
Enable Autoscaling for your system node pools
To ensure your system pods are scheduled even during times of high load, enable autoscaling on your system node pool.
Potential benefits: Enabling Autoscaler for system node pool ensures system pods are scheduled and cluster can function.
For More information, see Use the cluster autoscaler in Azure Kubernetes Service (AKS)
Have at least 2 nodes in your system node pool
Ensure your system node pools have at least 2 nodes for reliability of your system pods. With a single node, your cluster can fail in the event of a node or hardware failure.
Potential benefits: Having 2 nodes ensures resiliency against node failures.
For More information, see Manage system node pools in Azure Kubernetes Service (AKS)
Create a dedicated system node pool
A cluster without a dedicated system node pool is less reliable. We recommend you dedicate system node pools to only serve critical system pods, preventing resource starvation between system and competing user pods. Enforce this behavior with the CriticalAddonsOnly=true:NoSchedule taint on the pool.
Potential benefits: Ensures cluster reliability by preventing resource scarcity for core system pods
For More information, see Manage system node pools in Azure Kubernetes Service (AKS)
Ensure B-series Virtual Machine's (VMs) aren't used in production environments
When a cluster has one or more node pools using a non-recommended burstable VM SKU, full vCPU capability 100% is unguaranteed. Ensure B-series VM's aren't used in production environments.
Potential benefits: Best practice for consistent performance
For More information, see Bv1 sizes series
Azure NetApp Files
Configure AD DS Site for Azure Netapp Files AD Connector
If Azure NetApp Files can't reach assigned AD DS site domain controllers, the domain controller discovery process queries all domain controllers. Unreachable domain controllers may be used, causing issues with volume creation, client queries, authentication, and AD connection modifications.
Potential benefits: Optimize DNS Connectivity with Azure Netapp Files
For More information, see Understand guidelines for Active Directory Domain Services site design and planning for Azure NetApp Files
Ensure Roles assigned to Microsoft.NetApp Delegated Subnet has Subnet Read Permissions
Roles that are required for the management of Azure NetApp Files resources, must have "Microsoft.network/virtualNetworks/subnets/read" permissions on the subnet that is delegated to Microsoft.NetApp If the role, whether Custom or Built-In doesn't have this permission, then Volume Creations will fail
Potential benefits: Prevent volume creation failures by ensuring subnet/read permissions
Review SAP configuration for timeout values used with Azure NetApp Files
High availability of SAP while used with Azure NetApp Files relies on setting proper timeout values to prevent disruption to your application. Review the 'Learn more' link to ensure your configuration meets the timeout values as noted in the documentation.
Potential benefits: Improve resiliency of SAP Application on ANF
For More information, see Use Azure to host and run SAP workload scenarios
Implement disaster recovery strategies for your Azure NetApp Files resources
To avoid data or functionality loss during a regional or zonal disaster, implement common disaster recovery techniques such as cross region replication or cross zone replication for your Azure NetApp Files volumes.
Potential benefits: Manage disaster recovery easily with Azure NetApp Files replication features
For More information, see Understand data protection and disaster recovery options in Azure NetApp Files
Azure Netapp Files - Enable Continuous Availability for SMB Volumes
For Continuous Availability, we recommend enabling Server Message Block (SMB) volume for your Azure Netapp Files.
Potential benefits: Prevent application disruptions by enabling Continuous Availability for SMB volumes
For More information, see Enable Continuous Availability on existing SMB volumes
Azure Site Recovery
Enable soft delete for your Recovery Services vaults
Soft delete helps you retain your backup data in the Recovery Services vault for an additional duration after deletion, giving you an opportunity to retrieve it before it's permanently deleted.
Potential benefits: Helps recovery of backup data in cases of accidental deletion
For More information, see Soft delete for Azure Backup
Enable Cross Region Restore for your recovery Services Vault
Cross Region Restore (CRR) allows you to restore Azure VMs in a secondary region (an Azure paired region), helping with disaster recovery.
Potential benefits: As one of the restore options, Cross Region Restore (CRR) allows you to restore Azure VMs in a secondary region, which is an Azure paired region.
For More information, see How to restore Azure VM data in Azure portal
Azure Spring Apps
Upgrade Application Configuration Service to Gen 2
We notice you are still using Application Configuration Service Gen1 which will be end of support by April 2024. Application Configuration Service Gen2 provides better performance compared to Gen1 and the upgrade from Gen1 to Gen2 is zero downtime so we recommend to upgrade as soon as possible.
Potential benefits: Higher stability and availability
For More information, see Use Application Configuration Service for Tanzu
Azure SQL Database
Enable cross region disaster recovery for SQL Database
Enable cross region disaster recovery for Azure SQL Database for business continuity in the event of regional outage.
Potential benefits: Enabling disaster recovery creates a continuously synchronized readable secondary database for a primary database.
For More information, see Overview of business continuity with Azure SQL Database
Enable zone redundancy for Azure SQL Database to achieve high availability and resiliency.
To achieve high availability and resiliency, enable zone redundancy for the SQL database or elastic pool to use availability zones and ensure the database or elastic pool is resilient to zonal failures.
Potential benefits: Enabling zone redundancy ensures Azure SQL Database is resilient to zonal hardware and software failures and the recovery is transparent to applications.
For More information, see Availability through redundancy - Azure SQL Database
Azure Stack HCI
Upgrade to the latest version of AKS enabled by Arc
Upgrade to the latest version of API/SDK of AKS enabled by Azure Arc for new functionality and improved stability.
Potential benefits: The latest version of AKS enabled by Azure Arc with new functionality and improved stability.
For More information, see https://azure.github.io/azure-sdk/releases/latest/index.html
Upgrade to the latest version of AKS enabled by Arc
Upgrade to the latest version of API/SDK of AKS enabled by Azure Arc for new functionality and improved stability.
Potential benefits: The latest version of AKS enabled by Azure Arc with new functionality and improved stability.
For More information, see https://azure.github.io/azure-sdk/releases/latest/index.html
Classic deployment model storage
Action required: Migrate classic storage accounts by 8/30/2024.
Migrate your classic storage accounts to Azure Resource Manager to ensure business continuity. Azure Resource Manager will provide all of the same functionality plus a consistent management layer, resource grouping, and access to new features and updates.
Potential benefits: Ensure the ability to manage your data by migrating your classic storage account(s)
Classic deployment model virtual machine
Migrate off Cloud Services (classic) before 31 August 2024
Cloud Services (classic) is retiring. To avoid any loss of data or business continuity, migrate off before 31 Aug 2024.
Potential benefits: Continuity of your service
For More information, see Migrate Azure Cloud Services (classic) to Azure Cloud Services (extended support)
Cognitive Services
Upgrade your application to use the latest API version from Azure OpenAI
An Azure OpenAI resource with an older API version lacks the latest features and functionalities. We recommend that you use the latest REST API version.
Potential benefits: Our new API versions contain the latest and greatest features and capabilities.
For More information, see Azure OpenAI Service REST API reference
Quota exceeded for this resource, wait or upgrade to unblock
If the quota for your resource is exceeded your resource becomes blocked. You can wait for the quota to automatically get replenished soon, or, to use the resource again now, upgrade it to a paid SKU.
Potential benefits: If you upgrade to a paid SKU you can use the resource again today.
For More information, see Plan and manage costs for Azure AI Studio
Container Registry
Use Premium tier for critical production workloads
Premium registries provide the highest amount of included storage, concurrent operations and network bandwidth, enabling high-volume scenarios. The Premium tier also adds features such as geo-replication, availability zone support, content-trust, customer-managed keys and private endpoints.
Potential benefits: The Premium tier provides the highest amount of performance, scale and resiliency options
For More information, see Azure Container Registry service tiers
Ensure Geo-replication is enabled for resilience
Geo-replication enables workloads to use a single image, tag and registry name across regions, provides network-close registry access, reduced data transfer costs and regional Registry resilience if a regional outage occurs. This feature is only available in the Premium service tier.
Potential benefits: Improved resilience and pull performance, simplified registry management and reduced data transfer costs
For More information, see Geo-replication in Azure Container Registry
Content Delivery Network
Azure CDN From Edgio, Managed Certificate Renewal Unsuccessful. Additional Validation Required.
Azure CDN from Edgio employs CNAME delegation to renew certificates with DigiCert for managed certificate renewals. It's essential that Custom Domains resolve to an azureedge.net endpoint for the automatic renewal process with DigiCert to be successful. Ensure your Custom Domain's CNAME and CAA records are configured correctly. Should you require further assistance, please submit a support case to Azure to re-attempt the renewal request.
Potential benefits: Ensure service availability.
Renew the expired Azure Front Door customer certificate to avoid service disruption
When customer certificates for Azure Front Door Standard and Premium profiles expire, you might have service disruptions. To avoid service disruption, renew the certificate before it expires.
Potential benefits: Ensure service availability.
For More information, see Configure HTTPS on an Azure Front Door custom domain by using the Azure portal
Re-validate domain ownership for the Azure Front Door managed certificate renewal
Azure Front Door (AFD) can't automatically renew the managed certificate because the domain isn't CNAME mapped to AFD endpoint. For the managed certificate to be automatically renewed, revalidate domain ownership.
Potential benefits: undefined
For More information, see Configure a custom domain on Azure Front Door by using the Azure portal
Switch Secret version to 'Latest' for the Azure Front Door customer certificate
Configure the Azure Front Door (AFD) customer certificate secret to 'Latest' for the AFD to refer to the latest secret version in Azure Key Vault, allowing the secret can be automatically rotated.
Potential benefits: Latest’ version can be automatically rotated.
For More information, see Configure HTTPS on an Azure Front Door custom domain by using the Azure portal
Validate domain ownership by adding DNS TXT record to DNS provider
Validate domain ownership by adding the DNS TXT record to your DNS provider. Validating domain ownership through TXT records enhances security and ensures proper control over your domain.
Potential benefits: Ensure service availability.
For More information, see Configure a custom domain on Azure Front Door by using the Azure portal
Migrate away from Azure CDN from Edgio by January 15, 2025
Migrate from Azure CDN Standard/Premium by Edgio before 15 January 2025 when the Edgio platform is scheduled to shut down. It's recommended to move to Azure Front Door for compatibility. Alternatively, consider using Azure Traffic Manager or Akamai CDN available in the Azure Marketplace.
Potential benefits: Avoid downtime and ensure business continuity.
For More information, see Azure updates
Data Factory
Implement BCDR strategy for cross region redundancy in Azure Data Factory
Implementing BCDR strategy improves high availability and reduced risk of data loss
Potential benefits: Improves high availability and reduced risk of data loss
For More information, see BCDR for Azure Data Factory and Azure Synapse Analytics pipelines - Azure Architecture Center
Enable auto upgrade on your SHIR
Auto-upgrade of Self-hosted Integration runtime has been disabled. Know that you aren't getting the latest changes and bug fixes on the Self-Hosted Integration runtime. Review them to enable the SHIR auto upgrade
Potential benefits: To get the latest changes and bug fixes on the Self-Hosted Integration runtime
For More information, see Self-hosted integration runtime autoupdate and expire notification
Fluid Relay
Azure Fluid Relay client library should be upgraded
If the Azure Fluid Relay service is invoked with an old client library, it might cause appplication problems. To ensure your application remains operational, upgrade your Azure Fluid Relay client library to the latest version. Upgrading provides the most up-to-date functionality, and enhancements in performance and stability.
Potential benefits: Improved reliability
For More information, see Version compatibility with Fluid Framework releases
HDInsight
Apply critical updates by dropping and recreating your HDInsight clusters (certificate rotation round 2)
The HDInsight service attempted to apply a critical certificate update on your running clusters. However, due to some custom configuration changes, we're unable to apply the updates on all clusters. To prevent those clusters from becoming unhealthy and unusable, drop and recreate your clusters.
Potential benefits: Ensure cluster health and stability
For More information, see Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more
Non-ESP ABFS clusters [Cluster Permissions for Word Readable]
Plan to introduce a change in non-ESP ABFS clusters, which restricts non-Hadoop group users from running Hadoop commands for storage operations. This change is to improve cluster security posture. Customers need to plan for the updates before September 30, 2023.
Potential benefits: This change is to improve cluster security posture
For More information, see Azure HDInsight release notes
Restart brokers on your Kafka Cluster Disks
When data disks used by Kafka brokers in HDInsight clusters are almost full, the Apache Kafka broker process can't start and fails. To mitigate, find the retention time for every topic, back up the files that are older, and restart the brokers.
Potential benefits: Avoid Kafka broker issues
For More information, see Scenario: Brokers are unhealthy or can't restart due to disk space full issue
Cluster Name length update
The max length of cluster name will be changed to 45 from 59 characters, to improve the security posture of clusters. This change will be implemented by September 30th, 2023.
Potential benefits: Security posture improvement for HDInsight
For More information, see Azure HDInsight release notes
Upgrade your cluster to the the latest HDInsight image
A cluster created one year ago doesn't have the latest image upgrades. Your cluster was created 1 year ago. As part of the best practices, we recommend you use the latest HDInsight images for the best open source updates, Azure updates, and security fixes. The recommended maximum duration for cluster upgrades is less than six months.
Potential benefits: Get the latest fixes and features
For More information, see Consider the below points before starting to create a cluster.
Upgrade your HDInsight Cluster
A cluster not using the latest image doesn't have the latest upgrades. Your cluster isn't using the latest image. We recommend you use the latest versions of HDInsight images for the best of open source updates, Azure updates, and security fixes. HDInsight releases happen every 30 to 60 days.
Potential benefits: Get the latest fixes and features
For More information, see Azure HDInsight release notes
Gateway or virtual machine not reachable
We have detected a Network prob failure, it indicates unreachable gateway or a virtual machine. Verify all cluster hosts’ availability. Restart virtual machine to recover. If you need further assistance, don't hesitate to contact Azure support for help.
Potential benefits: Improved availability
VM agent is 9.9.9.9. Upgrade the cluster.
Our records indicate that one or more of your clusters are using images dated February 2022 or older (image versions 2202xxxxxx or older). There is a potential reliability issue on HDInsight clusters that use images dated February 2022 or older.Consider rebuilding your clusters with latest image.
Potential benefits: Improved Reliability in Scaling and Network connectivity
Media Services
Increase Media Services quotas or limits
When a media account hits its quota limits, disruption of service might occur. To avoid any disruption of service, review current usage of assets, content key policies, and stream policies and increase quota limits for the entities that are close to hitting the limit. You can request quota limits be increased by opening a ticket and adding relevant details. TIP: Don't create additional Azure Media accounts in an attempt to obtain higher limits.
Potential benefits: Avoid any disruption to service due to customer exceeding quota limits.
For More information, see Azure Media Services quotas and limits
Service Bus
Use Service Bus premium tier for improved resilience
When running critical applications, the Service Bus premium tier offers better resource isolation at the CPU and memory level, enhancing availability. It also supports Geo-disaster recovery feature enabling easier recovery from regional disasters without having to change application configurations.
Potential benefits: Service Bus premium tier offers better resiliency with CPU and memory resource isolation as well as Geo-disaster recovery
For More information, see Service Bus premium messaging tier
Use Service Bus autoscaling feature in the premium tier for improved resilience
When running critical applications, enabling the auto scale feature allows you to have enough capacity to handle the load on your application. Having the right amount of resources running can reduce throttling and provide a better user experience.
Potential benefits: Enabling autoscale prevents users from capacity constraints
For More information, see Automatically update messaging units of an Azure Service Bus namespace
SQL Server on Azure Virtual Machines
Enable Azure backup for SQL on your virtual machines
For the benefits of zero-infrastructure backup, point-in-time restore, and central management with SQL AG integration, enable backups for SQL databases on your virtual machines using Azure backup.
Potential benefits: SQL aware backups with no-infra for backup, centralized management, AG integration and point-in-time restore
For More information, see About SQL Server Backup in Azure VMs
Storage
Use Managed Disks for storage accounts reaching capacity limit
When Premium SSD unmanaged disks in storage accounts are about to reach their Premium Storage capacity limit, failures might occur. To avoid failures when this limit is reached, migrate to Managed Disks that don't have an account capacity limit. This migration can be done through the portal in less than 5 minutes.
Potential benefits: Avoid scale issues when account reaches capacity limit
For More information, see Scalability and performance targets for standard storage accounts
Configure blob backup
Azure blob backup helps protect data from accidental or malicious deletion. We recommend that you configure blob backup.
Potential benefits: Protect data from accidental or malicious deletion
For More information, see Overview of Azure Blob backup
Subscriptions
Turn on Azure Backup to get simple, reliable, and cost-effective protection for your data
Keep your information and applications safe with robust, one click backup from Azure. Activate Azure Backup to get cost-effective protection for a wide range of workloads including VMs, SQL databases, applications, and file shares.
Potential benefits: Ensure your business-critical applications stay protected
For More information, see Azure Backup Documentation - Azure Backup
Create an Azure Service Health alert
Azure Service Health alerts keep you informed about issues and advisories in four areas (Service issues, Planned maintenance, Security and Health advisories). These alerts are personalized to notify you about disruptions or potential impacts on your chosen Azure regions and services.
Potential benefits: Stay informed about issues and advisories across 4 areas (Service issues, Planned maintenance, Security advisories and Health advisories)
For More information, see Create activity log alerts on service notifications using the Azure portal
Virtual Machines
Improve data reliability by using Managed Disks
Virtual machines in an Availability Set with disks that share either storage accounts or storage scale units aren't resilient to single storage scale unit failures during outages. Migrate to Azure Managed Disks to ensure that the disks of different VMs in the Availability Set are sufficiently isolated to avoid a single point of failure.
Potential benefits: Ensure business continuity through data resilience
For More information, see https://aka.ms/aa_avset_manageddisk_learnmore
Enable virtual machine replication to protect your applications from regional outage
Virtual machines are resilient to regional outages when replication to another region is enabled. To reduce adverse business impact during an Azure region outage, we recommend enabling replication of all business-critical virtual machines.
Potential benefits: Ensure business continuity in case of any Azure region outage
For More information, see Quickstart: Set up disaster recovery to a secondary Azure region for an Azure VM
Update your outbound connectivity protocol to Service Tags for Azure Site Recovery
IP address-based allowlisting is a vulnerable way to control outbound connectivity for firewalls, Service Tags are a good alternative. We highly recommend the use of Service Tags, to allow connectivity to Azure Site Recovery services for the machines.
Potential benefits: Ensures better security, stability and resiliency than hard coded IP Addresses
For More information, see About networking in Azure VM disaster recovery
Upgrade the standard disks attached to your premium-capable VM to premium disks
Using Standard SSD disks with premium VMs may lead to suboptimal performance and latency issues. We recommend that you consider upgrading the standard disks to premium disks. For any Single Instance Virtual Machine using premium storage for all Operating System Disks and Data Disks, we guarantee Virtual Machine Connectivity of at least 99.9%. When choosing to upgrade, there are two factors to consider. The first factor is that upgrading requires a VM reboot and that takes 3-5 minutes to complete. The second is if the VMs in the list are mission-critical production VMs, evaluate the improved availability against the cost of premium disks.
Potential benefits: Improved availability with single VM SLA available only when all disks are premium
For More information, see Azure managed disk types
Upgrade VM from Premium Unmanaged Disks to Managed Disks at no additional cost
Azure Managed Disks provide higher resiliency, simplified service management, higher scale target and more choices among several disk types. Your VM is using premium unmanaged disks that can be migrated to managed disks at no additional cost through the portal in less than 5 minutes.
Potential benefits: Leverage higher resiliency and other benefits of Managed Disks
For More information, see Introduction to Azure managed disks
Upgrade your deprecated Virtual Machine image to a newer image
Virtual Machines (VMs) in your subscription are running on images scheduled for deprecation. Once the image is deprecated, new VMs can't be created from the deprecated image. To prevent disruption to your workloads, upgrade to a newer image. (VMRunningDeprecatedImage)
Potential benefits: Minimize any potential disruptions to your VM workloads
For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines
Upgrade to a newer offer of Virtual Machine image
Virtual Machines (VMs) in your subscription are running on images scheduled for deprecation. Once the image is deprecated, new VMs can't be created from the deprecated image. To prevent disruption to your workloads, upgrade to a newer image. (VMRunningDeprecatedOfferLevelImage)
Potential benefits: Minimize any potential disruptions to your VM workloads
For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines
Upgrade to a newer SKU of Virtual Machine image
Virtual Machines (VMs) in your subscription are running on images scheduled for deprecation. Once the image is deprecated, new VMs can't be created from the deprecated image. To prevent disruption to your workloads, upgrade to a newer image.
Potential benefits: Minimize any potential disruptions to your VM workloads
For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines
Upgrade your Virtual Machine Scale Set to alternative image version
VMSS in your subscription are running on images that have been scheduled for deprecation. Once the image is deprecated, your Virtual Machine Scale Set workloads would no longer scale out. Upgrade to newer version of the image to prevent disruption to your workload.
Potential benefits: Minimize any potential disruptions to your Virtual Machine Scale Set workloads
For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines
Upgrade your Virtual Machine Scale Set to alternative image offer
VMSS in your subscription are running on images that have been scheduled for deprecation. Once the image is deprecated, your Virtual Machine Scale Set workloads would no longer scale out. To prevent disruption to your workload, upgrade to newer offer of the image.
Potential benefits: Minimize any potential disruptions to your Virtual Machine Scale Set workloads
For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines
Upgrade your Virtual Machine Scale Set to alternative image SKU
VMSS in your subscription are running on images that have been scheduled for deprecation. Once the image is deprecated, your Virtual Machine Scale Set workloads would no longer scale out. To prevent disruption to your workload, upgrade to newer SKU of the image.
Potential benefits: Minimize any potential disruptions to your Virtual Machine Scale Set workloads
For More information, see Deprecated Azure Marketplace images - Azure Virtual Machines
Provide access to mandatory URLs missing for your Azure Virtual Desktop environment
For a session host to deploy and register to Windows Virtual Desktop (WVD) properly, you need a set of URLs in the 'allowed list' in case your VM runs in a restricted environment. For specific URLs missing from your allowed list, search your application event log for event 3702.
Potential benefits: Ensure successful deployment and session host functionality when using Windows Virtual Desktop service
For More information, see Required FQDNs and endpoints for Azure Virtual Desktop
Align location of resource and resource group
To reduce the impact of region outages, co-locate your resources with their resource group in the same region. This way, Azure Resource Manager stores metadata related to all resources within the group in one region. By co-locating, you reduce the chance of being affected by region unavailability.
Potential benefits: Reduce write failures due to region outages
For More information, see What is Azure Resource Manager?
Use Availability zones for better resiliency and availability
Availability Zones (AZ) in Azure help protect your applications and data from datacenter failures. Each AZ is made up of one or more datacenters equipped with independent power, cooling, and networking. By designing solutions to use zonal VMs, you can isolate your VMs from failure in any other zone.
Potential benefits: Usage of zonal VMs protect your apps from zonal outage in any other zones.
For More information, see Move Azure single instance VMs from regional to zonal target availability zones
Enable Azure Virtual Machine Scale Set (VMSS) application health monitoring
Configuring Virtual Machine Scale Set application health monitoring using the Application Health extension or load balancer health probes enables the Azure platform to improve the resiliency of your application by responding to changes in application health.
Potential benefits: Increase resiliency by exposing application health to Azure
For More information, see Using Application Health extension with Virtual Machine Scale Sets
Enable Backups on your Virtual Machines
Secure your data by enabling backups for your virtual machines.
Potential benefits: Protection of your Virtual Machines
For More information, see What is the Azure Backup service?
Enable automatic repair policy on Azure Virtual Machine Scale Sets (VMSS)
Enabling automatic instance repairs helps achieve high availability by maintaining a set of healthy instances. If an unhealthy instance is found by the Application Health extension or load balancer health probe, automatic instance repairs attempt to recover the instance by triggering repair actions.
Potential benefits: Increase resiliency by automating repair of failed instances
For More information, see Automatic instance repairs for Azure Virtual Machine Scale Sets
Configure Virtual Machine Scale Set automated scaling by metrics
Optimize resource utilization, reduce costs, and enhance application performance with custom autoscale based on a metric. Automatically add Virtual Machine instances based on real-time metrics such as CPU, memory, and disk operations. Ensure high availability while maintaining cost-efficiency.
Potential benefits: Ensures high availability while maintaining cost-efficiency
For More information, see Overview of autoscale with Azure Virtual Machine Scale Sets
Use Azure Disks with Zone Redundant Storage (ZRS) for higher resiliency and availability
Azure Disks with ZRS provide synchronous replication of data across three Availability Zones in a region, making the disk tolerant to zonal failures without disruptions to applications. For higher resiliency and availability, migrate disks from LRS to ZRS.
Potential benefits: By designing your applications to use ZRS Disks, your data is replicated across 3 Availability Zones, making your disk resilient to a zonal outage
For More information, see Convert a disk from LRS to ZRS
DNS Servers should be configured at the Virtual Network level
Set the DNS Servers for the VM at the Virtual Network level to ensure consistency throughout the environment. In the configuration of the primary network interface, DNS Servers setting should be set to Inherit from virtual network.
Potential benefits: Ensures consistency and reliable name resolution
For More information, see Name resolution for resources in Azure virtual networks
Migrate to Virtual Machine Scale Sets Flex
Migrate workloads from virtual machine (VM) to Virtual Machine Scale Sets Flex for deployment across zones or within the same zone across different fault domains. The platform plans to deprecate availability sets.
Potential benefits: Availability across zones or across different fault domains
For More information, see Migrate deployments and resources to Virtual Machine Scale Sets in Flexible orchestration
Workloads
Configure an Always On availability group for Multi-purpose SQL servers (MPSQL)
MPSQL servers with an Always On availability group have better availability. Your MPSQL servers aren't configured as part of an Always On availability group in the shared infrastructure in your Epic system. Always On availability groups improve database availability and resource use.
Potential benefits: Improved Database availability and resource use
For More information, see What is an Always On availability group?
Configure Local host cache on Citrix VDI servers to ensure seamless connection brokering operations
We have observed that your Citrix VDI servers aren't configured Local host Cache. Local Host Cache (LHC) is a feature in Citrix Virtual Apps and Desktops that allows connection brokering operations to continue when an outage occurs.LHC engages when the site database is inaccessible for 90 seconds.
Potential benefits: Seamless connection brokering operations
Deploy Hyperspace Web servers as part of a Virtual Machine Scale Set Flex configured for 3 zones
We have observed that your Hyperspace Web servers in the Virtual Machine Scale Set Flex set up aren't spread across 3 zones in the selected region. For services like Hyperspace Web in Epic systems that require high availability and large scale, it's recommended that servers are deployed as part of Virtual Machine Scale Set Flex and spread across 3 zones. With Flexible orchestration, Azure provides a unified experience across the Azure VM ecosystem
Potential benefits: High availability and on-demand large scale for Hyperspace web servers in Epic DB
For More information, see Create a Virtual Machine Scale Set that uses Availability Zones
Set the Idle timeout in Azure Load Balancer to 30 minutes for ASCS HA setup in SAP workloads
To prevent load balancer timeout, make sure that all Azure Load Balancing Rules have: 'Idle timeout (minutes)' set to the maximum value of 30 minutes. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the setting.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Enable Floating IP in the Azure Load balancer for ASCS HA setup in SAP workloads
For port resuse and better high availability, enable floating IP in the load balancing rules for the Azure Load Balancer for HA set up of ASCS instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Enable HA ports in the Azure Load Balancer for ASCS HA setup in SAP workloads
For port resuse and better high availability, enable HA ports in the load balancing rules for HA set up of ASCS instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Disable TCP timestamps on VMs placed behind Azure Load Balancer in ASCS HA setup in SAP workloads
Disable TCP timestamps on VMs placed behind AzurEnabling TCP timestamps will cause the health probes to fail due to TCP packets being dropped by the VM's guest OS TCP stack causing the load balancer to mark the endpoint as down
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see https://launchpad.support.sap.com/#/notes/2382421
Set the Idle timeout in Azure Load Balancer to 30 minutes for HANA DB HA setup in SAP workloads
To prevent load balancer timeout, ensure that all Azure Load Balancing Rules 'Idle timeout (minutes)' parameter is set to the maximum value of 30 minutes. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the recommended settings.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Enable Floating IP in the Azure Load balancer for HANA DB HA setup in SAP workloads
For more flexible routing, enable floating IP in the load balancing rules for the Azure Load Balancer for HA set up of HANA DB instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the recommended settings.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Enable HA ports in the Azure Load Balancer for HANA DB HA setup in SAP workloads
For enhanced scalability, enable HA ports in the Load balancing rules for HA set up of HANA DB instance in SAP workloads. Open the load balancer, select 'load balancing rules' and add or edit the rule to enable the recommended settings.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Disable TCP timestamps on VMs placed behind Azure Load Balancer in HANA DB HA setup in SAP workloads
Disable TCP timestamps on VMs placed behind Azure Load Balancer. Enabling TCP timestamps causes the health probes to fail due to TCP packets dropped by the VM's guest OS TCP stack causing the load balancer to mark the endpoint as down.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see Azure Load Balancer health probes
Ensure that stonith is enabled for the Pacemaker configuration in ASCS HA setup in SAP workloads
In a Pacemaker cluster, the implementation of node level fencing is done using a STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Set the corosync token in Pacemaker cluster to 30000 for ASCS HA setup in SAP workloads (RHEL)
The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to 30000 for SAP on Azure.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Set the expected votes parameter to '2' in Pacemaker cofiguration in ASCS HA setup in SAP workloads (RHEL)
For a two node HA cluster, set the quorum 'expected-votes' parameter to '2' as recommended for SAP on Azure to ensure a proper quorum, resilience, and data consistency.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Enable the 'concurrent-fencing' parameter in Pacemaker cofiguration in ASCS HA setup in SAP workloads (ConcurrentFencingHAASCSRH)
Concurrent fencing enables the fencing operations to be performed in parallel, which enhances high availability (HA), prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in the Pacemaker cluster configuration for ASCS HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Ensure that stonith is enabled for the cluster configuration in ASCS HA setup in SAP workloads
In a Pacemaker cluster, the implementation of node level fencing is done using a STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the stonith timeout to 144 for the cluster configuration in ASCS HA setup in SAP workloads
The ‘stonith-timeout’ specifies how long the cluster waits for a STONITH action to complete. Setting it to '144' seconds allows more time for fencing actions to complete. We recommend this setting for HA clusters for SAP on Azure.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the corosync token in Pacemaker cluster to 30000 for ASCS HA setup in SAP workloads (SUSE)
The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to '30000' for SAP on Azure.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set 'token_retransmits_before_loss_const' to 10 in Pacemaker cluster in ASCS HA setup in SAP workloads
The corosync token_retransmits_before_loss_const determines how many token retransmits are attempted before timeout in HA clusters. For stability and reliability, set the 'totem.token_retransmits_before_loss_const' to '10' for ASCS HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
The 'corosync join' timeout specifies in milliseconds how long to wait for join messages in the membership protocol so when a new node joins the cluster, it has time to synchronize its state with existing nodes. Set to '60' in Pacemaker cluster configuration for ASCS HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the 'corosync consensus' in Pacemaker cluster to '36000' for ASCS HA setup in SAP workloads
The corosync 'consensus' parameter specifies in milliseconds how long to wait for consensus before starting a round of membership in the cluster configuration. Set 'consensus' in the Pacemaker cluster configuration for ASCS HA setup to 1.2 times the corosync token for reliable failover behavior.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the 'corosync max_messages' in Pacemaker cluster to '20' for ASCS HA setup in SAP workloads
The corosync 'max_messages' constant specifies the maximum number of messages that one processor can send on receipt of the token. Set it to 20 times the corosync token parameter in the Pacemaker cluster configuration to allow efficient communication without overwhelming the network.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set 'expected votes' to '2' in the cluster configuration in ASCS HA setup in SAP workloads (SUSE)
For a two node HA cluster, set the quorum 'expected_votes' parameter to 2 as recommended for SAP on Azure to ensure a proper quorum, resilience, and data consistency.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the two_node parameter to 1 in the cluster cofiguration in ASCS HA setup in SAP workloads
For a two node HA cluster, set the quorum parameter 'two_node' to 1 as recommended for SAP on Azure.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Enable 'concurrent-fencing' in Pacemaker ASCS HA setup in SAP workloads (ConcurrentFencingHAASCSSLE)
Concurrent fencing enables the fencing operations to be performed in parallel, which enhances HA, prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in the Pacemaker cluster configuration for ASCS HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Ensure the number of 'fence_azure_arm' instances is one in Pacemaker in HA enabled SAP workloads
If you're using Azure fence agent for fencing with either managed identity or service principal, ensure that there's one instance of fence_azure_arm (an I/O fencing agent for Azure Resource Manager) in the Pacemaker configuration for ASCS HA setup for high availability.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set stonith-timeout to 900 in Pacemaker configuration with Azure fence agent for ASCS HA setup
For reliable function of the Pacemaker for ASCS HA set the 'stonith-timeout' to 900. This setting is applicable if you're using the Azure fence agent for fencing with either managed identity or service principal.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Create the softdog config file in Pacemaker configuration for ASCS HA setup in SAP workloads
The softdog timer is loaded as a kernel module in linux OS. This timer triggers a system reset if it detects that the system has hung. Ensure that the softdog configuation file is created in the Pacemaker cluster forASCS HA set up
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Ensure the softdog module is loaded in for Pacemaler in ASCS HA setup in SAP workloads
The softdog timer is loaded as a kernel module in linux OS. This timer triggers a system reset if it detects that the system has hung. First ensure that you created the softdog configuration file, then load the softdog module in the Pacemaker configuration for ASCS HA setup
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set PREFER_SITE_TAKEOVER parameter to 'true' in the Pacemaker configuration for HANA DB HA setup
The PREFER_SITE_TAKEOVER parameter in SAP HANA defines if the HANA system replication (SR) resource agent prefers to takeover the secondary instance instead of restarting the failed primary locally. For reliable function of HANA DB high availability (HA) setup, set PREFER_SITE_TAKEOVER to 'true'.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Enable stonith in the cluster cofiguration in HA enabled SAP workloads for VMs with Redhat OS
In a Pacemaker cluster, the implementation of node level fencing is done using STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration of your SAP workload.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Set the corosync token in Pacemaker cluster to 30000 for HA enabled HANA DB for VM with RHEL OS
The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to 30000 for SAP on Azure with Redhat OS.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Set the expected votes parameter to '2' in HA enabled SAP workloads (RHEL)
For a two node HA cluster, set the quorum votes to '2' as recommended for SAP on Azure to ensure a proper quorum, resilience, and data consistency.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Enable the 'concurrent-fencing' parameter in the Pacemaker cofiguration for HANA DB HA setup
Concurrent fencing enables the fencing operations to be performed in parallel, which enhances high availability (HA), prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in the Pacemaker cluster configuration for HANA DB HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability of SAP HANA on Azure VMs on Red Hat Enterprise Linux
Set parameter PREFER_SITE_TAKEOVER to 'true' in the cluster cofiguration in HA enabled SAP workloads
The PREFER_SITE_TAKEOVER parameter in SAP HANA topology defines if the HANA SR resource agent prefers to takeover the secondary instance instead of restarting the failed primary locally. For reliable function of HANA DB HA setup, set it to 'true'.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Enable stonith in the cluster configuration in HA enabled SAP workloads for VMs with SUSE OS
In a Pacemaker cluster, the implementation of node level fencing is done using STONITH (Shoot The Other Node in the Head) resource. To help manage failed nodes, ensure that 'stonith-enable' is set to 'true' in the HA cluster configuration.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the stonith timeout to 144 for the cluster configuration in HA enabled SAP workloads
The ‘stonith-timeout’ specifies how long the cluster waits for a STONITH action to complete. Setting it to '144' seconds allows more time for fencing actions to complete. We recommend this setting for HA clusters for SAP on Azure.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the corosync token in Pacemaker cluster to 30000 for HA enabled HANA DB for VM with SUSE OS
The corosync token setting determines the timeout that is used directly, or as a base, for real token timeout calculation in HA clusters. To allow memory-preserving maintenance, set the corosync token to 30000 for HA enabled HANA DB for VM with SUSE OS.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set 'token_retransmits_before_loss_const' to 10 in Pacemaker cluster in HA enabled SAP workloads
The corosync token_retransmits_before_loss_const determines how many token retransmits are attempted before timeout in HA clusters. Set the totem.token_retransmits_before_loss_const to 10 as recommended for HANA DB HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the 'corosync join' in Pacemaker cluster to 60 for HA enabled HANA DB in SAP workloads
The 'corosync join' timeout specifies in milliseconds how long to wait for join messages in the membership protocol so when a new node joins the cluster, it has time to synchronize its state with existing nodes. Set to '60' in Pacemaker cluster configuration for HANA DB HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the 'corosync consensus' in Pacemaker cluster to 36000 for HA enabled HANA DB in SAP workloads
The corosync 'consensus' parameter specifies in milliseconds how long to wait for consensus before starting a new round of membership in the cluster. For reliable failover behavior, set 'consensus' in the Pacemaker cluster configuration for HANA DB HA setup to 1.2 times the corosync token.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the 'corosync max_messages' in Pacemaker cluster to 20 for HA enabled HANA DB in SAP workloads
The corosync 'max_messages' constant specifies the maximum number of messages that one processor can send on receipt of the token. To allow efficient communication without overwhelming the network, set it to 20 times the corosync token parameter in the Pacemaker cluster configuration.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the expected votes parameter to 2 in HA enabled SAP workloads (SUSE)
Set the expected votes parameter to '2' in the cluster configuration in HA enabled SAP workloads to ensure a proper quorum, resilience, and data consistency.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set the two_node parameter to 1 in the cluster configuration in HA enabled SAP workloads
For a two node HA cluster, set the quorum parameter 'two_node' to 1 as recommended for SAP on Azure.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Enable the 'concurrent-fencing' parameter in the cluster configuration in HA enabled SAP workloads
Concurrent fencing enables the fencing operations to be performed in parallel, which enhances HA, prevents split-brain scenarios, and contributes to a robust SAP deployment. Set this parameter to 'true' in HA enabled SAP workloads.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Ensure there is one instance of fence_azure_arm in the Pacemaker configuration for HANA DB HA setup
If you're using Azure fence agent for fencing with either managed identity or service principal, ensure that one instance of fence_azure_arm (an I/O fencing agent for Azure Resource Manager) is in the Pacemaker configuration for HANA DB HA setup for high availability.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Set stonith-timeout to 900 in Pacemaker configuration with Azure fence agent for HANA DB HA setup
If you're using the Azure fence agent for fencing with either managed identity or service principal, ensure reliable function of the Pacemaker for HANA DB HA setup, by setting the 'stonith-timeout' to 900.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Ensure that the softdog config file is in the Pacemaker configuration for HANA DB in SAP workloads
The softdog timer is loaded as a kernel module in Linux OS. This timer triggers a system reset if it detects that the system is hung. Ensure that the softdog configuration file is created in the Pacemaker cluster for HANA DB HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Ensure the softdog module is loaded in Pacemaker in ASCS HA setup in SAP workloads
The softdog timer is loaded as a kernel module in Linux OS. This timer triggers a system reset if it detects that the system is hung. First ensure that you created the softdog configuration file, then load the softdog module in the Pacemaker configuration for HANA DB HA setup.
Potential benefits: Reliability of HA setup in SAP workloads
For More information, see High availability for SAP HANA on Azure VMs on SUSE Linux Enterprise Server
Next steps
Learn more about Reliability - Microsoft Azure Well Architected Framework