Dela via


Safe deployment practices

Sometimes a release doesn't live up to expectations. Despite using best practices and passing all quality gates, there are occasionally issues that result in a production deployment causing unforeseen problems for users. To minimize and mitigate the impact of these issues, DevOps teams are encouraged to adopt a progressive exposure strategy that balances the exposure of a given release with its proven performance. As a release proves itself in production, it becomes available to tiers of broader audiences until everyone is using it. Teams can use safe deployment practices in order to maximize the quality and speed of releases in production.

Control exposure to customers

DevOps teams can employ various practices to control the exposure of updates to customers. Historically, A/B testing has been a popular choice for teams looking to see how different versions of a service or user interface perform against target goals. A/B testing is also relatively easy to use since the changes are typically minor and often only compare different releases at the customer-facing edge of a service.

Safe deployment through rings

As platforms grow, infrastructure scale and audience needs tend to grow as well. This creates a demand for a deployment model that balances the risks associated with a new deployment with the benefits of the updates it promises. The general idea is that a given release should be first exposed only to a small group of users with the highest tolerance for risk. Then, if the release works as expected, it can be exposed to a broader group of users. If there are no problems, then the process can continue out through broader groups of users, or rings, until everyone is using the new release. With modern continuous delivery platforms like GitHub Actions and Azure Pipelines, building a deployment process with rings is accessible to DevOps teams of any size.

Feature flags

Certain functionality sometimes needs to be deployed as part of a release, but not initially exposed to users. In those cases, feature flags provide a solution where functionality may be enabled via configuration changes based on environment, ring, or any other specific deployment.

User opt-in

Similar to feature flags, user opt-in provides a way to limit exposure. In this model, a given feature is enabled in the release, but not activated for a user unless they specifically want it. The risk tolerance decision is offloaded to users so they can decide how quickly they want to adopt certain updates.

Multiple practices are commonly employed simultaneously. For example, a team may have an experimental feature intended for a very specific use case. Since it's risky, they'll deploy it to the first ring for internal users to try out. However, even though the features are in the code, someone will need to set the feature flag for a specific deployment within the ring so that the feature is exposed via the user interface. Even then, the feature flag may only expose the option for a user to opt in to using the new feature. Anyone who isn't in the ring, on that deployment, or hasn't opted in won't be exposed to the feature. While this is a fairly contrived example, it serves to illustrate the flexibility and practicality of progressive exposure.

Common issues teams face early on

As teams move toward a more Agile DevOps practice, they may run into problems consistent with others who've migrated away from traditional monolithic deliveries. Teams used to deploying once every few months have a mindset that buffers for stabilization. They expect that each deployment will introduce a substantial shift in their service, and that there will be unforeseen issues.

Payloads are too big

Services that are deployed every few months are usually filled with many changes. This increases the likelihood that there will be immediate issues, and also makes it difficult to troubleshoot those issues since there's so much new stuff. By moving to more frequent deliveries, the differences in what's deployed become smaller, which allows for more focused testing and easier debugging.

No service isolation

Monolithic systems are traditionally scaled by leveling up the hardware on which they're deployed. However, when something goes wrong with the instance, it leads to problems for everyone. One simple solution is to add multiple instances so that you can load balance users. However, this can require significant architectural considerations as many legacy systems aren't built to be multi-instance. Plus, significant duplicate resources may need to be allocated for functionality that may be better consolidated elsewhere.

As new features are added, explore whether a microservices architecture can help you operate and scale thanks to better service isolation.

Manual steps lead to mistakes

When a team only deploys a few times per year, automating deliveries may not seem worth the investment. As a result, many deployment processes are manually managed. This requires a significant amount of time and effort, and is prone to human error. Simply automating the most common build and deployment tasks can go a long way toward reducing lost time and unforced errors.

Teams can also make use of infrastructure as code to have better control over deployment environments. This removes the need for requests to the operations team to make manual changes as new features or dependencies are introduced to various deployment environments.

Only Ops can do deployments

Some organizations have policies that require all deployments to be initiated and managed by the operations staff. While there may have been good reasons for that in the past, an Agile DevOps process greatly benefits when the development team can initiate and control deployments. Modern continuous delivery platforms offer granular control over who can initiate which deployments, and who can access status logs and other diagnostic information, making sure the right people have the right information as quickly as possible.

Bad deployments proceed and can't be rolled back

Sometimes a deployment goes wrong and teams need to address it. However, when processes are manual and access to information is slow and limited, it can be difficult to roll back to a previous working deployment. Fortunately, there are various tools and practices for mitigating the risk of failed deployments.

Core principles

Teams looking to adopt safe deployment practices should set some core principles to underpin the effort.

Be consistent

The same tools used to deploy in production should be used in development and test environments. If there are issues, such as the ones that often arise from new versions of dependencies or tools, they should be caught well before the code is close to being released to production.

Care about quality signals

Too many teams fall into the common trap of not really caring about quality signals. Over time, they may find that they write tests or take on quality tasks simply to change a yellow warning to a green approval. Quality signals are really important as they represent the pulse of a project. The quality signals used to approve deployments should be tracked constantly every day.

Deployments should require zero downtime

While it's not critical for every service to always be available, teams should approach their DevOps delivery and operation stages with the mindset that they can and should deploy new versions without having to take them down for any time at all. Modern infrastructure and pipeline tools are advanced enough now where it's feasible for virtually any team to target 100% uptime.

Deployments should happen during working hours

If a team works with the mindset that deployments require zero downtime, then it doesn't really matter when a deployment is pushed. Further, it becomes advantageous to push deployments during working hours, especially early in the day and early in the week. If something goes wrong, it should be traced early enough to control the blast radius. Plus, everyone will already be working and focused on getting issues fixed.

Ring-based deployment

Teams with mature DevOps release practices are in a position to take on ring-based deployment. In this model, new features are first rolled out to customers willing to accept the most risk. As the deployment is proven, the audience expands to include more users until everyone is using it.

An example ring model

A typical ring deployment model is designed to find issues as early as possible through the careful segmentation of users and infrastructure. The following example shows how rings are used by a major team at Microsoft.

Ring Purpose Users Data Center
0 Finds most of the user-impacting bugs introduced by the deployment Internal only, high tolerance for risk and bugs US West Central
1 Areas the team doesn't test extensively Customers using a breadth of the product A small data center
2 Scale-related issues Public accounts, ideally free ones using a diverse set of features A medium or large data center
3 Scale issues in internal accounts and international related issues Large internal accounts and European customers Internal data center and a European data center
4 Remaining scale units Everyone else All deployment targets

Allow bake time

The term bake time refers to the amount of time a deployment is allowed to run before expanding to the next ring. Some issues may take hours or longer to start showing symptoms, so the release should be in use for an appropriate amount of time before it's considered ready.

In general, a 24-hour day should be enough time for most scenarios to expose latent bugs. However, this period should include a period of peak usage, requiring a full business day, for services that peak during business hours.

Expedite hotfixes

A live site incident (LSI) occurs when a bug has a serious impact in production. LSIs necessitate the creation of a hotfix, which is an out-of-band update designed to address a high-priority issue.

If a bug is Sev 0, the most severe type of bug, the hotfix may be deployed directly to the impacted scale unit as quickly as responsibly possible. While it's critical that the fix not make things worse, bugs of this severity are considered so disruptive that they must be addressed immediately.

Bugs rated Sev 1 must be deployed through ring 0, but can then be deployed out to the affected scale units as soon as approved.

Hotfixes for bugs with lower severity must be deployed through all rings as planned.

Key takeaways

Every team wants to deliver updates quickly and at the highest possible quality. With the right practices, delivery can be a productive and painless part of the DevOps cycle.

  • Deploy often.
  • Stay green throughout the sprint.
  • Use consistent deployment tooling in development, test, and production.
  • Use a continuous delivery platform that allows automation and authorization.
  • Follow safe deployment practices.

Next steps

Learn how feature flags help control the exposure of new features to users.