Reduce the Blast Radius of a Bad Deployment with Automated Canary Analysis

May 23, 2022 by Stephen Atwell

Software deployment processes differ across organizations, teams, and applications. The most basic, and perhaps the riskiest, is the “big bang deployment.” This strategy updates all nodes within the target environment simultaneously with the new software version.

This deployment strategy causes many issues, including potential downtime or other issues while the update is in progress. It can also be challenging to perform a rollback when problems arise in production.

A big bang deployment can have devastating effects. For example, it can raise issues in production or even take down the system. Since rollbacks aren’t easy, production issues can cause a prolonged, catastrophic outage.

Overall, a big bang deployment throttles the delivery pipeline, causing deployments, bug fixes, and rollbacks to take too long and be too complex.

Big bang deployment

Although bad deployments cause challenges, automated canary analysis helps. Let’s explore how to reduce the blast radius of a bad deployment using Armory — based on open source Spinnaker — and Kayenta for automated canary analysis.

Safe and Reliable Deployments with Armory

Releasing lousy software can have a long-lasting, damaging impact on any business. Armory helps you overcome this by empowering developers with tools for automated, intelligent software delivery. Intelligent software delivery prevents large-scale outages and minimizes the blast radius of bad deployments.

DevOps teams can define Armory pipelines using templates to encourage reuse and continuous delivery (CD). These pipelines enforce best practices from Netflix, Google, and the open-source software (OSS) community. They also reduce the risk of bad deployments and large-scale outages.

Armory minimizes the impact of bad deployments using a 1-click rollback when a new deployment causes errors. Our platform also uses the automated canary analysis technique to test new releases before deploying them to production.

With this technique, the new changes deploy to a small set of users before the full release. The software automatically promotes or fails the deployment based on predefined metrics.

What is Canary Deployment?

The idea of canary deployment is based on canaries in coal mines. Miners took these birds into the mine to measure the amount of toxic gas present. Canaries are more sensitive to dangerous gases than humans, so miners would know hazardous gases were likely present if a canary died inside the mine.

Software deployment uses a version of this strategy. Instead of birds, the canary is the new software version. DevOps teams roll out a new application version in stages, deploying it to a small subset of the production infrastructure and enabling the latest version for a small set of users who help identify issues.

When ready, DevOps teams deploy the software version to a larger subset of the infrastructure and a larger set of users, and so on, until the rollout is complete. This strategy dramatically reduces the risk associated with deploying a new software version into production.

Canary Deployment

Manual Versus Automated Canary Analysis

When a DevOps team manually performs this deployment strategy, they deploy a new version of the application to a small subset of production servers and shift a small percentage of traffic to the latest version. The rest of the servers and traffic remain unchanged.

The team then looks at graphs and logs and monitors multiple metrics to determine the server health with the new version. If the results are within acceptable values, then the team deploys to a larger set of servers and traffic and repeats the process. Otherwise, they roll back the deployment and route all the traffic to the stable servers.

As we can see, manual canary analysis is tedious and time-consuming. Moreover, it doesn’t scale well when deploying to multiple servers several times a week or day and is prone to human error.

Automated canary analysis overcomes this by, well, automation. Automation makes fetching metrics and running statistical tests more accurate and less time-consuming.

Automated Canary Analysis using Armory Continuous Deployment-as-a-Service

In addition to Spinnaker, Armory has a new lightweight SaaS offering that supports canary deployments to Kubernetes that leverage automated canary analysis. This offering allows you to easily trigger a canary deployment from your existing deployment tooling, providing a CLI that is a drop-in replacement for kubectl.

CD-as-a-Service allows you to specify a set of queries for your metric provider, and thresholds for each query. If a metric comes back outside of the threshold range, the analysis is considered a failure. This makes it simple to, for example, require that CPU usage is below X% and memory usage is below Y% during an analysis. CD-as-a-Service provides a prescriptive canary strategy that can deploy pods to the new version one at a time, and check all metrics over a given amount of time before continuing to deploy to additional pods. This allows you to check the health of each individual pod as each one is provisioned, and trigger a rollback if any problems are detected.

In addition to automated canary analysis, Continuous Deployment-as-a-Service also supports running your existing custom automation during a canary deployment strategy. You can use this, for example, to issue a query against elastic search to check your logs for errors. If errors occur, have CD-as-a-Service rollback the change, if there are no errors finish scaling up traffic.

Armory is still seeking additional design partners to help shape project CD-as-a-Service, but spots are limited and filling up fast, so sign up today.

Automated Canary Analysis with Spinnaker and Kayenta

The automated canary analysis (ACA) platform Kayenta integrates with Armory, which is based on the open-source multi-cloud continuous delivery platform Spinnaker. You can set up an automated canary analysis stage in a Spinnaker pipeline and use Kayenta to assess the canary’s risk by fetching user data, running statistical tests, running checks for degradation between the new and old version, and providing an aggregate score for the canary.

Based on the score, Kayenta automatically judges whether to promote or fail the canary or prompt human intervention. It does this in two phases: metric collection and judgment.

Armory implements canary analysis by running three clusters in parallel: production, canary, and baseline. Comparing the production cluster with the canary deployment wouldn’t produce reliable results because of long-running process effects. For this reason, Armory creates a baseline cluster where it deploys the application’s production version.

The platform then compares the canary cluster against the baseline cluster. The canary and baseline clusters each receive a small percentage of the traffic, while the rest goes to the production cluster. Armory then handles the lifecycle of the canary and baseline clusters.

Canary lifecycle

Metric Collection (Retrieval)

In Armory’s canary pipeline stage, DevOps teams can specify the metrics to check and their sources. Armory supports Stackdriver, Prometheus, Datadog, SignalFx, and New Relic. When Armory uses different sources, it combines the various metrics into a single analysis. However, it’s considered best practice to avoid adding too many metrics in one group.

Armory retrieves these metrics from the baseline and canary clusters, tags them (baseline or canary), and stores them in a time-series database. It then passes the results to the canary judge for analysis.

The judgment stage compares the baseline and canary results from the collection stage, individually evaluates each metric, and performs statistical tests. The output is an aggregate score ranging from 0 to 100. This score falls into three categories:

The judgment stage has four main steps:

The calculation is for the final score is:

(pass metrics / total number of metrics) * 100
Armory then uses this score to decide whether to promote or roll back the deployment.

Judgement flow

Next Steps

Contact Armory today for a complimentary assessment of your software delivery practices and learn more about how your organization can benefit from safe, reliable deployments.

Share this post:

Recently Published Posts

How to Become a Site Reliability Engineer (SRE)

Jun 6, 2023

A site reliability engineer (SRE) bridges the gap between IT operations and software development. They understand coding and the overall task of keeping the system operating.  The SRE role originated to give software developers input into how teams deploy and maintain software and to improve it to increase reliability and performance. Before SREs, the software […]

Read more

Continuous Deployment KPIs

May 31, 2023

Key SDLC Performance Metrics for Engineering Leaders Engineering leaders must have an effective system in place to measure their team’s performance and ensure that they are meeting their goals. One way to do this is by monitoring Continuous Deployment Key Performance Indicators (KPIs).  CD and Automated Tests If you’re not aware, Continuous Deployment, or CD, […]

Read more

What Are the Pros and Cons of Rolling Deployments?

May 26, 2023

Rolling deployments use a software release strategy that delivers new versions of an application in phases to minimize downtime. Anyone who has lived through a failed update knows how painful it can be. If a comprehensive update fails, there are hours of downtime while it is rolled back. Even if the deployment happens after hours, […]

Read more