Exercise - Define health state, metrics, and thresholds
In this exercise, we continue with the health model structure you created previously. Your task is to quantify health states of individual components for the example application.
In the health model structure, start by evaluating the layers starting at the top with user flows and proceed to the lower layers.
User flow health state
So far, we identified two user flows: List catalog items and Add comment. To determine health states for each flow, ask questions such as:
- When is the user flow considered healthy?
- Can it operate in a degraded state?
Based on the implementation and functional requirements, identify the application components that participate in the user flow. The components are described in Example architecture components.
User flow | Components |
---|---|
List catalog items | Front-end internal web application, Catalog API |
Add comment | Front-end internal web application, Catalog API, Background processor |
If any of those components become unhealthy, the user flow is expected to become unhealthy.
Note
Some applications can operate in a special degraded mode. For example, if Contoso Shoes implements local browser caching, employees who are using the web application can create comments, but comments can't be sent and the customer view isn't updated until the Catalog API becomes healthy, which the browser can continuously check.
Application component health state
Determine metrics that contribute to the component's health state. For this step, you need to know the functionality of the component. Ask questions like:
- What processing time in the API is acceptable to maintain a good user experience?
- Are there any expected errors? What's the "normal" error rate?
- What's the "normal" processing time? What does it mean if processing time is higher than normal?
- What happens to write operations if Azure Cosmos DB is unreachable?
These questions should lead you to specific and measurable thresholds for key metrics. For example, you might consider these threshold values for the Catalog API component.
Metrics and threshold | Health state |
---|---|
Response Time < 150-ms Failed request count < 10 | Healthy |
Response Time < 300-ms Failed request count < 50 | Degraded |
Response Time > 300-ms Failed request count > 50 | Unhealthy |
You can get the values from an application monitoring solution, such as Application Insights.
Azure resource health state
Azure service health states are based on specific resources. For example, Azure Cosmos DB reports database transaction unit (DTU) utilization, and Azure App Services provides information about CPU utilization.
For information about metrics by resource type, see Supported metrics with Azure Monitor.
Health states and thresholds
After you evaluate all layers of the application, you should have a list of components and their health state definitions that look similar to this example.
Component | Indicator/metric | Healthy | Degraded | Unhealthy |
---|---|---|---|---|
List catalog items user flow | Underlying health state | Front end healthy and Catalog API healthy | ||
Add comment user flow | Underlying health state | Front end healthy, Catalog API healthy, and background processor healthy | ||
Front-end web application | # of non-20x HTTP responses/min | 0 | 1-10 | > 10 |
Catalog API | # of exceptions/sec | < 10 | 10-50 | > 10 |
Average processing time (ms) | < 150 | 150-500 | > 500 | |
Background processor | Average time in queue (ms) | < 200 | 200-1,000 | > 1,000 |
Average processing time (ms) | < 100 | 100-200 | > 200 | |
Failure count | < 3 | 3-10 | > 10 | |
Azure Cosmos DB | DTU utilization | < 70% | 70%-90% | > 90% |
Azure Key Vault | Failure count | < 3 | 3-10 | > 10 |
Azure Event Hubs | Processing backlog length (outgoing/incoming messages) | < 3 | 3-20 | > 20 |
Azure Blob Storage | Average latency (ms) | < 100 | 100-200 | > 200 |
In this example, the error tolerance for the front-end web application and the Catalog API is different. This difference relates to the technical understanding of the application. All front-end errors should be handled client-side, so there's a zero threshold. However, on the API layer, 10 exceptions are allowed to account for user-caused errors. For example, errors such as 404 - Not Found don't necessarily indicate a health issue.