Reduce the Blast Radius of a Bad Deployment with Automated Canary Analysis

May 23, 2022 by Stephen Atwell

Software deployment processes differ across organizations, teams, and applications. The most basic, and perhaps the riskiest, is the “big bang deployment.” This strategy updates all nodes within the target environment simultaneously with the new software version.

This deployment strategy causes many issues, including potential downtime or other issues while the update is in progress. It can also be challenging to perform a rollback when problems arise in production.

A big bang deployment can have devastating effects. For example, it can raise issues in production or even take down the system. Since rollbacks aren’t easy, production issues can cause a prolonged, catastrophic outage.

Overall, a big bang deployment throttles the delivery pipeline, causing deployments, bug fixes, and rollbacks to take too long and be too complex.

Big bang deployment

Although bad deployments cause challenges, automated canary analysis helps. Let’s explore how to reduce the blast radius of a bad deployment using Armory — based on open source Spinnaker — and Kayenta for automated canary analysis.

Safe and Reliable Deployments with Armory

Releasing lousy software can have a long-lasting, damaging impact on any business. Armory helps you overcome this by empowering developers with tools for automated, intelligent software delivery. Intelligent software delivery prevents large-scale outages and minimizes the blast radius of bad deployments.

DevOps teams can define Armory pipelines using templates to encourage reuse and continuous delivery (CD). These pipelines enforce best practices from Netflix, Google, and the open-source software (OSS) community. They also reduce the risk of bad deployments and large-scale outages.

Armory minimizes the impact of bad deployments using a 1-click rollback when a new deployment causes errors. Our platform also uses the automated canary analysis technique to test new releases before deploying them to production.

With this technique, the new changes deploy to a small set of users before the full release. The software automatically promotes or fails the deployment based on predefined metrics.

What is Canary Deployment?

The idea of canary deployment is based on canaries in coal mines. Miners took these birds into the mine to measure the amount of toxic gas present. Canaries are more sensitive to dangerous gases than humans, so miners would know hazardous gases were likely present if a canary died inside the mine.

Software deployment uses a version of this strategy. Instead of birds, the canary is the new software version. DevOps teams roll out a new application version in stages, deploying it to a small subset of the production infrastructure and enabling the latest version for a small set of users who help identify issues.

When ready, DevOps teams deploy the software version to a larger subset of the infrastructure and a larger set of users, and so on, until the rollout is complete. This strategy dramatically reduces the risk associated with deploying a new software version into production.

Canary Deployment

Manual Versus Automated Canary Analysis

When a DevOps team manually performs this deployment strategy, they deploy a new version of the application to a small subset of production servers and shift a small percentage of traffic to the latest version. The rest of the servers and traffic remain unchanged.

The team then looks at graphs and logs and monitors multiple metrics to determine the server health with the new version. If the results are within acceptable values, then the team deploys to a larger set of servers and traffic and repeats the process. Otherwise, they roll back the deployment and route all the traffic to the stable servers.

As we can see, manual canary analysis is tedious and time-consuming. Moreover, it doesn’t scale well when deploying to multiple servers several times a week or day and is prone to human error.

Automated canary analysis overcomes this by, well, automation. Automation makes fetching metrics and running statistical tests more accurate and less time-consuming.

Automated Canary Analysis with Spinnaker and Kayenta

The automated canary analysis (ACA) platform Kayenta integrates with Armory, which is based on the open-source multi-cloud continuous delivery platform Spinnaker. You can set up an automated canary analysis stage in a Spinnaker pipeline and use Kayenta to assess the canary’s risk by fetching user data, running statistical tests, running checks for degradation between the new and old version, and providing an aggregate score for the canary.

Based on the score, Kayenta automatically judges whether to promote or fail the canary or prompt human intervention. It does this in two phases: metric collection and judgment.

Armory implements canary analysis by running three clusters in parallel: production, canary, and baseline. Comparing the production cluster with the canary deployment wouldn’t produce reliable results because of long-running process effects. For this reason, Armory creates a baseline cluster where it deploys the application’s production version.

The platform then compares the canary cluster against the baseline cluster. The canary and baseline clusters each receive a small percentage of the traffic, while the rest goes to the production cluster. Armory then handles the lifecycle of the canary and baseline clusters.

Canary lifecycle

Metric Collection (Retrieval)

In Armory’s canary pipeline stage, DevOps teams can specify the metrics to check and their sources. Armory supports Stackdriver, Prometheus, Datadog, SignalFx, and New Relic. When Armory uses different sources, it combines the various metrics into a single analysis. However, it’s considered best practice to avoid adding too many metrics in one group.

Armory retrieves these metrics from the baseline and canary clusters, tags them (baseline or canary), and stores them in a time-series database. It then passes the results to the canary judge for analysis.

The judgment stage compares the baseline and canary results from the collection stage, individually evaluates each metric, and performs statistical tests. The output is an aggregate score ranging from 0 to 100. This score falls into three categories:

  • Success: The judge promotes the canary to production.
  • Marginal: The judge needs human intervention to make a decision.
  • Failure: The judge recommends stopping the pipeline, performing a rollback, and directing traffic to production.

The judgment stage has four main steps:

  • Data validation: This ensures there is valid data for the baseline and canary metrics before analysis. If the required data is not available for either the baseline or the canary or both, Armory labels the metric NODATA and moves the analysis to the next metric.
  • Data cleaning: This prepares the raw data for comparison. This preparation includes handling and sanitizing missing values and removing outliers.
  • Metric comparison: This compares the canary and the baseline for each metric. It classifies each metric as pass, high, or low, indicating the difference between the canary and the baseline.
  • Score computation: This computes the final score based on the metric classifications. This score is a ratio of the number of pass metrics over all the metrics and is a percentage.

The calculation is for the final score is:

(pass metrics / total number of metrics) * 100
Armory then uses this score to decide whether to promote or roll back the deployment.

Judgement flow

Automated Canary Analysis using Armory Project Borealis

In addition to Spinnaker, Armory has a new lightweight SaaS offering that supports canary deployments to Kubernetes that leverage automated canary analysis. This offering allows you to easily trigger a canary deployment from your existing deployment tooling, providing a CLI that is a drop-in replacement for kubectl.

Borealis allows you to specify a set of queries for your metric provider, and thresholds for each query. If a metric comes back outside of the threshold range, the analysis is considered a failure. This makes it simple to, for example, require that CPU usage is below X% and memory usage is below Y% during an analysis. Project Borealis provides a prescriptive canary strategy that can deploy pods to the new version one at a time, and check all metrics over a given amount of time before continuing to deploy to additional pods. This allows you to check the health of each individual pod as each one is provisioned, and trigger a rollback if any problems are detected.

In addition to automated canary analysis, Project Borealis also supports running your existing custom automation during a canary deployment strategy. You can use this, for example, to issue a query against elastic search to check your logs for errors. If errors occur, have borealis rollback the change, if there are no errors finish scaling up traffic.

Armory is still seeking additional design partners to help shape project borealis, but spots are limited and filling up fast, so sign up today.

Next Steps

Big bang deployments can cause production issues, especially for large projects. Fortunately, Kayenta and Armory’s version of Spinnaker overcome these issues using automated canary analysis. This integration enables developers to deploy changes to production with minimal risk quickly.

Integrate Kayenta into Armory to benefit from safe, intelligent canary deployment and increased productivity. If your new software version does cause an issue in production, the blast radius will be much smaller, and metrics and automation help you roll back quicker. A smaller blast radius means less mess and happier customers.

Contact Armory today for a complimentary assessment of your software delivery practices and learn more about how your organization can benefit from safe, reliable deployments.

Recently Published Posts

A Faster Way to Evaluate Self-Hosted Continuous Deployment from Armory

Sep 30, 2022

Introducing Quick Spin One of the most common challenges that organizations face when implementing a continuous deployment strategy is the time and focus that it takes to set up the tools and processes. But a secure, flexible, resilient and scalable solution is available right now. Want to see if it’s the right tool for your […]

Read more

3 Common Spinnaker Challenges (and Easy Ways to Solve Them)

Sep 27, 2022

Spinnaker is the most powerful continuous delivery tool on the market.  DevOps engineers and developers recognize this power and are looking to use Spinnaker as a foundational tool in their Continuous Integration and Continuous Delivery (CI/CD) process for hybrid and multi-cloud deployments. Such a powerful, expansive open source tool needs expertise within your organization to […]

Read more

Streamline Advanced Kubernetes Deployments from GitHub Actions with New Armory Service

Sep 23, 2022

Today, Armory is excited to announce the availability of the GitHub Action for Armory Continuous Deployment-as-a-Service. GitHub is where developers shape the future of software. After a developer writes and tests their code in GitHub, it must be deployed. Armory’s GitHub Action for Continuous Deployment-as-a-Service extends the best-in-class deployment capabilities to Kubernetes. CD-as-a-Service enables declarative […]

Read more