Feature Flags vs Canaries
May 2, 2017 by Armory
Feature flags and canaries have different objectives despite appearing similar: Feature flags expose a specific feature to a sample set of users and then measures it’s impact. Canaries expose a specific version of the entire application to a sample set of users and then measures it’s impact.
So what’s the difference?
Here’s the high-level:
Feature Flags
- Generally used for user visible features
- Great for well defined user experiments which span longer periods of time
- Testing against cohorts of users (i.e. testing features by location, registration date, etc)
- Becomes harder to test over time as more feature flags means more permutations of code paths
- Adds additional overhead to engineers to remember to add flags, turn them on, measure results, and then turn them off when needed
- Not possible to test on code changes that affect large portions of the code base
Canaries
- Gives you information about every release that goes out
- Spans a single deployment which is typically shorter than a feature/user experiment
- Works well when you ship small diffs and are quickly changing production code
- Almost no overhead when process is fully automated
Most companies use feature flags to understand the impact of a specific feature on user or business level metrics. For instance, how does this new auditing feature impact our user engagement, positively or negatively?
On the opposite end of the spectrum is canaries. When you execute a canary deployment, you’re seeking consistency and looking for metrics to remain within a standard deviation. When you detect an anomaly, you typically rollback the change. Additionally, you’re also looking at signals like system and network level metrics to determine whether the deployment was a success.
Canaries: Best Practices & Tools
In practice, canaries are most commonly used through manual processes: an instance is manually rolled forward to the new versions while a team watches and monitors multiple metrics dashboards for the instance’s health. If all goes well, the team may choose to deploy to a few more servers, then repeat the process. The downfall with this approach is that it is hard to determine which metrics matter and what to do in the event of failure.
During a deployment, hours are wasted agreeing that there is an issue and even more time deciding what to do about it. The journey to automating this process is further complicated by assumption that the underlying mechanism to deploy – and rollback – is reproducible, safe, and consistent. This is exactly why Netflix moved away from this style of deployment to an automated method. Their automated systems look at over 1,000 metrics in real-time when doing deployment of the Netflix API!
Consistent Deployments
The best way to achieve these goals is to use immutable infrastructure as it limits the risk of external resources not being available at run time; but this also creates confidence in the immutable package itself. Of course we believe that Spinnaker is the best to get an initial solid deployment. It becomes the basis of any deployment since you can make assumptions about deploying an AMI and having application configuration logic encapsulated from deployment logic.
Once you have a predictable deployment model, the next step is to instrument the applications and environments. In order to have an intelligent deployment you’ll need lots of information for the system to make good decisions. While unit tests and regression/integration tests provide a solid base for gaining confidence in the deployment there are many other tools that give you better ROI to gain additional confidence.
More focus on Instrumentation, Less on Integration Testing
When asked about what makes a deployment “safe,” the easy answer is to add more integration tests. While theoretically you can reach 100% coverage with integration & regression tests, the cost of doing so is very high. After a set of core integration tests, each additional test provides diminishing value but costs the same to create and maintain. In most cases teams find themselves maintaining an application —the integration suite— to test their applications because their integration tests become so complex. Maintaining integration tests becomes difficult over time and soon the tests are left behind as the application continues to progress which leaves the application with no coverage.
An alternative to this approach is to use canaries, but in most cases applications and systems don’t have enough instrumentation in place to signal an automated control system (or a human) that the deployment is heading for failure. Canary-ing is now a common deployment strategy within new PaaS frameworks like Kubernetes, however Kubernetes or any other system won’t know what a “good deployment” is as it’s very specific to your application and organization. This is why instrumentation is so critical to the success of your deployments.
Feature Flags: Best Practices & Tools
Creating feature flags is a popular method used by companies like Facebook to be able to deploy hundreds of features with a single version of Facebook but expose some features to a sample size of the population, measure it against a control group and make a product decision as to whether this feature should be exposed to the entire population.
Within companies like Facebook and Netflix there are entire teams who work on services to handle statistical significance, bias, control and results of features. While much of the intelligence on feature flags has been kept proprietary at both companies, Netflix has open sourced a tool called Archaius which allows for dynamic configuration of applications which can be used to expose features at run-time.
Archaius is analogous to Consul which allows for dynamic configuration. These tools aide in the ability to change the configuration they will not help with the need to analyze those changes and make the necessary decisions.
There are off-the-shelf tools which aid the ability to do feature flagging and measure their results: Tools like Optimizely and Launch Darkly allow for developers to hide or modify features to provide experiments to a small population for measuring the results. This then gives the product team information on whether to move forward with 100% exposure of the feature.
Optimizely is well suited for web-based features and experiments which allows anybody to easily create new experiments through the UI. It also handles the intelligence to calculate results through their dashboard.
LaunchDarkly is “Feature Flag As A Service”. It helps developers release features independently of deployments. As long as the feature is already deployed it can be hidden from users until the product team is ready to release the feature and measure its results. This tool is more likely to be integrated with your code at an API level but also provides an experiments framework.
The Right Tools for the Right Problem
Ultimately, feature flags and canary-ing techniques set out to solve two different problems. Both are part of a greater product and engineering strategy to achieve an amazing customer experience. Using each one where appropriate will allow your teams to deploy faster with greater levels of confidence. Here are some guidelines to contrast feature flags and canaries.