Design review checklist for Operational Excellence
This checklist presents a set of recommendations to help you build a culture of operational excellence. Start with a DevOps approach to integrate specializations from multiple disciplines. This approach creates a rigorous design and development practice. This approach leads to repeatable, reliable, and safe deployments of infrastructure and code.
Prioritize human intervention in areas that benefit from it, and incorporate automation in other areas. Observability serves operational excellence by monitoring health events and also for validating the current workload design and implementation to inform future product development.
If you don't consider tradeoffs and recommendations for operational excellence, your workload might be at risk. Carefully consider the points covered in the following checklist to instill confidence in your design's success.
Checklist
Code | Recommendation | |
---|---|---|
☐ | OE:01 | Define your standard practices to develop and operate your workload. Foster a blameless culture that emphasizes continuous learning and prioritizes continuous improvement and optimization. |
☐ | OE:02 | Formalize the way you run routine, as needed, and emergency operational tasks. Increase consistency and predictability by adopting industry-proven practices and approaches. |
☐ | OE:03 | Formalize software ideation and planning processes. Draw from established industry and organizational standards for team communication, requirements and design documentation, and software development processes. |
☐ | OE:04 OE:04 OE:04 |
Enhance software development and quality assurance by implementing industry-standard practices. Ensure clear role definitions and consistent processes by standardizing tools, source control, design patterns, documentation, and style guides. |
☐ | OE:05 | Use a standardized infrastructure as code (IaC) approach to prepare resources and configurations. Use IaC to ensure consistent styles, modularization, and quality assurance. Prefer declarative over imperative approaches when practical. |
☐ | OE:06 | Build a workload supply chain that drives changes through predictable, automated pipelines. Ensure these pipelines test and promote changes across all environments and quality gates. Incorporate comprehensive testing. |
☐ | OE:07 OE:07 |
Design and implement a monitoring system to capture and expose telemetry, metrics, and logs from your infrastructure and code. Use this data to validate design choices and guide future design and business decisions. |
☐ | OE:08 | Establish a robust emergency operations practice. Create an incident response plan that clearly documents roles, responsibilities, and all emergency response processes and procedures. Capture learnings through postmortems and incident reports to continuously improve the plan and the workload. |
☐ | OE:09 | Automate tasks that are repetitive, procedural, and provide a clear return on investment. Prefer off-the-shelf automation tools over custom solutions. Apply the Well-Architected Framework pillars to the design and implementation of all automation efforts. |
☐ | OE:10 | Design and implement automation upfront for tasks like lifecycle management, bootstrapping, and governance. Avoid retrofitting automation later. Simplify your design by adopting platform-native automation functionality. |
☐ | OE:11 | Clearly define your workload's safe deployment practices. Focus on small, incremental releases with quality gates. Use modern deployment patterns and progressive exposure to manage risk. Plan for both routine and emergency deployments. |
☐ | OE:12 | Implement a deployment failure mitigation strategy to handle unexpected issues during rollout. Use approaches like rollback, feature disablement, or the native capabilities of your deployment pattern for quick recovery. |
Next steps
We recommend that you review the Operational Excellence tradeoffs to explore other concepts.