Design chaos experiments

15 minutes

Your mission-critical application needs to be resilient and be ready to respond to failures. However, it's difficult to predict potential failure scenarios in the cloud. Chaos engineering allows you to conduct failure experiments in a controlled environment to identify problems that are likely to arise during development and deployment. You deliberately inject real-world faults and observe how the system reacts.

In this unit, you use Azure Chaos Studio. This service helps you measure, understand, and improve your cloud application and service resiliency. It prepares you to respond quickly if a failure occurs under adverse conditions in production.

Conduct failure mode analysis

When you design a chaos experiment, the first act is to conduct failure mode analysis (FMA) of the application components to identify potential failure scenarios:

List all the components that are relevant for a user flow that need to be available and functional. For example, the checkout user flow uses Azure App Services, Azure Functions, and Azure Cosmos DB database.
For each component, list possible failure cases, their impact, and any potential mitigation.

Let's see the outcome of FMA done for the components of the Contoso Shoes checkout user flow example.

Azure App Service for hosting the front end application

Risk	Impact	Possible mitigation
Availability zone outage	Instances in that zone might become unavailable. Full outage isn't expected because zone redundancy is enabled on the App Service plan.	Allow for the extra load on the remaining instances and provide enough head room for this scenario while still achieving the performance targets.
SNAT port exhaustion	Outbound connections can't be created. As a result, downstream calls, such as calls to the database, fail.	Use private endpoints for connecting to the downstream components.
Individual instance becoming unhealthy	User traffic routed to an unhealthy instance might see poor performance or even fail entirely.	Use the App Service health check feature. This feature causes unhealthy instances to be automatically identified and replaced by new, healthy instances.

Azure Functions for checkout logic

Risk	Impact	Possible mitigation
Slow (cold) start performance	Because the Azure Functions Consumption plan is used, new instances don't have performance guarantees. High demand on the service (from "noisy neighbors") might cause the checkout function to experience a long startup duration that affects performance targets.	Upgrade to the Azure Functions Premium plan.
Underlying storage outage	If the underlying storage account becomes unavailable, the function stops working.	Use Load balanced compute with regional storage or Load balanced compute with GRS shared storage.

Azure Cosmos DB database

Risk	Impact	Possible mitigation
Renaming a database or collection	Because of mismatch in configuration, there might be data loss. The application can't access any data until the configuration is updated and its components are restarted.	Prevent this situation by using database and collection-level locks.
Write region outage	If the primary region (or write region) encounters an outage, the Azure Cosmos DB account automatically promotes a secondary region to be the new primary write region when automatic (service-managed) failover is configured on the Azure Cosmos DB account. The failover occurs to another region in the order of region priority you specified.	Configure the database account to use multiple regions and automatic failover. If there's a failure, the service automatically fails over and prevents any sustained problems in the application.
Extensive throttling due to lack of request units (RUs)	Certain stamps might run hot on Azure Cosmos DB utilization while others can still serve requests.	Use better load distribution to more stamps, or add more RUs.

Design a chaos experiment

To design a chaos experiment, pick a few failure cases. The choice can be based on the likelihood that the failure occurs or on the possible impact.

The goal of the experiment is to validate resiliency measures that you implemented in your application. For an example hypothesis, suppose you run your application on App Service and enable zone redundancy. If all the underlying instances in a zone go down, you expect your application to still be running.

Use Chaos Studio to inject the faults into the relevant components. Chaos Studio offers a library of faults for you to choose from. However, because the fault library doesn't cover everything, you might need to adjust your scenario. Or you might need to find more tools to help you inject the failure.

Important

Target only a non-production environment during your experiments. Injecting faults into your production environment can be risky and requires experience and planning.

Example: Azure Cosmos DB outage and failover

Suppose you pick the "write region outage" failure scenario of Azure Cosmos DB listed in the table. The hypothesis is: A service-initiated failover shouldn't result in any sustained impact on the application. If this hypothesis proves to be true, you validated that your resiliency measure of replicating to multiple regions has the desired positive effect on application reliability.

To simulate this fault, use the Azure Cosmos DB failure from the Chaos Studio fault library.

This example is for an Azure Cosmos DB failover that runs for 10 minutes (PT10M) and uses West US 2 as the new write region. It assumes that West US 2 was already set up as one of the read replication regions.

{
  "name": "branchOne",
  "actions": [
    {
      "type": "continuous",
      "name": "urn:csci:microsoft:cosmosDB:failover/1.0",
      "parameters": [
        {
          "key": "readRegion",
          "value": "West US 2"
        }
      ],
      "duration": "PT10M",
      "selectorid": "myCosmosDbResource"
    }
  ]
}

After the experiment ends, Chaos Studio switches the write region back to its original value.

Before you can inject a fault against an Azure resource, you must enable the corresponding targets and capabilities setting for that resource. This setting controls the faults that can run against the resources enabled for fault injection. When you use targets and capabilities together with other security measures, you can avoid accidental or malicious fault injection.

Now that you designed both load tests and chaos experiments, you need to automate them into your pipelines so that they run consistently and regularly. In the next unit, you learn about adding the tests your CI/CD pipelines.

Design chaos experiments

Knowledge check

Feedback