Jun 16, 2017 by DROdio
Andy Glover recently gave a Spinnaker Deep Dive presentation at an LA DevOps Meetup at Tinder HQ. Here are some of the global software delivery best practices Andy shared in his presentation:
Even though Spinnaker (the Continuous Delivery & Infrastructure platform open-sourced by Netflix) facilitates simultaneous regional deployments, they don’t recommend doing them.
In the event of a bad deployment — be it a software bug or infrastructure issue — keeping deployments focused to a region means your blast radius is limited (rather than it affecting the entire globe).
How to leverage Spinnaker to employ this best practice: The screenshot below shows a deployment pipeline in Spinnaker for a Netflix app called “api,” which is dedicated to deploying to US-East-1. What’s more, this particular pipeline also make use of an additional best practice (described below) by using a deployment window.
You can also see a separate pipeline for a different region, US-West-2.
NOTE: Spinnaker enables one pipeline to kick a second pipeline off — so in this example, an application developer could easily trigger the US-West-2 pipeline to run once the US-East-1 pipeline completes successfully.
Red/Black (also commonly referred to as “blue/green”) deployments take advantage of the key concept of immutable infrastructure and AWS elasticity, where the next version of a particular software package is stood up in a new Auto Scaling Group (ASG). Once health checks pass for this new ASG, traffic is routed to it and the previous ASG is disabled. The benefit of this style of deployment is that you can easily and rapidly rollback to the previous ASG in the event of a failure.
How to leverage Spinnaker to employ this best practice: In this screenshot, Spinnaker has Red/Blacked a new ASG into EU-West-1, and moved the previous to a disabled state.
In the event of an issue with an ASG, it accordingly becomes extremely easy with Spinnaker to rollback either via automation or even manually:
Netflix regional traffic is fairly cyclic. For the most part, people in any particular region tend to watch more Netflix during the evening hours. For instance, in the evenings when US-East is beginning to see peak traffic, US-West traffic is still largely in its trough (as most people on the west coast are at work). Accordingly, service teams at Netflix can take advantage of the trough in any particular region and deploy during that time.
How to leverage Spinnaker to employ this best practice: In this case, you can see that this particular pipeline in Spinnaker will only deploy to US-East from 10am to 2pm Pacific time.
Netflix has a sophisticated telemetry platform that allows them to compare two different versions of running software. They call this Automated Canary Analysis (ACA). With its ACA platform, Netflix can compare two different versions of software taking production traffic. It’s a great last line of defense to ensure things are working well before opening the flood gates.
How to leverage Spinnaker to employ this best practice: The open-source version of Spinnaker does not yet have ACA built in, but it’s a priority that the community is working to deliver. Spinnaker does, however, enable pipelines to be configured with canaries out of the box.
Below is a screenshot to illustrate how ACA works within Netflix, and the kind of functionality that will be coming to the open-source version of Spinnaker. The pipeline below shows a production push that includes an ACA step. In this case, the ACA for a newly deployed service was scored at 93. ACA scores are between 1 to 100, and Spinnaker allows the service owner to define the “go/no-go” threshold for the pipeline to continue running or stop automatically. If ACA reports a score below that threshold, that pipeline stage is considered a failure and the overall pipeline is halted. The service owner can also pre-define a failure path in Spinnaker to rollback a deployment, for instance.
Part of the value of having a “paved path” via Spinnaker is deep integration. In the case of ACA, the service owner can easily view a detailed ACA report should s/he want to get more information as to why a particular ACA was scored:
Invariably, there are one-off infrastructure management tasks that need to be done from time-to-time. These could be emergency fixes or even occasional updates to infrastructure, for example. It’s easy to overlook automating these tasks — and that’s bad, because manual tasks tend to create towers of knowledge. By codifying this automation in pipelines, anyone can run them with the benefit of consistency.
How to leverage Spinnaker to employ this best practice: With Spinnaker, service owners can create pipelines that can be used in an emergency situation to deploy to any region (via a parameter).
Here’s another pipeline in Spinnaker that can set a runtime environmental property, again based upon a parameter fed into this pipeline:
Email is a terrible way to get someone’s attention. This means that a company’s “code red” emails requiring urgent action can get inadvertently missed. When it comes to notifying people of important events or actions being required (such as deployments or manual stages) Netflix recommends using alternate channels, like Slack.
How to leverage Spinnaker to employ this best practice: In the screenshot below, a pipeline in Spinnaker uses email and Slack to notify relevant people when it starts, when it completes, and most importantly, if it fails. Notice in the event of a failure, this pipeline is specifically configured to notify a support channel in Slack.
Here’s a notification that requires a human to take action — In this case, to approve a deployment to a particular environment. Below that is a Slack notification that a deployment into production completed successfully.
The machines haven’t completely replaced all of us! Sometimes the judgement of a human is needed in a deployment pipeline.
How to leverage Spinnaker to employ this best practice: Pipelines in Spinnaker can leverage a specific stage called “Manual Judgement” where a human must manually initiate a positive or negative acknowledgment before the pipeline continues.
The screenshot below shows a Manual Judgement stage following an Automated Canary Analysis stage. In the manual judgement stage, there’s an actual button for either stopping the pipeline, or continuing it. This particular pipeline is using this stage to pause the pipeline and let a human review it before it proceeds to a production deployment.
Things change — entropy is a thing. Therefore, guard against it.
How to leverage Spinnaker to employ this best practice: The epitome of recognizing entropy and not assuming reliability is Netflix’s Chaos Monkey, a resiliency tool that helps applications tolerate random instance failures by randomly killing instances. Chaos Monkey is tightly integrated into Spinnaker but also into the ethos of Netflix’s culture that demands service reliability.
Another way to “not assume” is to only trigger pipelines when you know people will be in the office, ready to handle any issues. Don’t assume that by deploying at 9pm on Friday night, your team will be happy to get paged at 3am in the morning on Saturday to handle a failure scenario — because they most definitely won’t.
Spinnaker pipelines can be triggered using a cron expression — but note, in this case, the cron expression excludes the weekend. This particular pipeline is deploying into production, and this team wants to be sure someone is around in the event of a problem (and they want to enjoy their weekends!).
Here’s a pipeline taking advantage of a precondition stage within Spinnaker. The execution of a pipeline can take a long time, especially if it’s waiting on a deployment window. And in those scenarios, the underlying cloud infrastructure might have changed. For example, someone might have manually deployed a new ASG for that application. Consequently, in those cases, it’s prudent to use a precondition stage to ensure things are as they should be before taking some action.
Netflix believes strongly in the notion of immutable infrastructure. Consequently, when service owners create an AMI for deployment, they do it once and promote that same AMI through environments rather than creating one for each. This ensures consistency. Consequently, in Spinnaker you can use a “Find AMI” stage to pull an AMI from a test environment and push it forward into a production one.
Here’s another “Find AMI” before an ACA:
Even if you follow every best practice here, failure will eventually happen. So you might as well make it as easy as possible to reduce the time to fix the issue.
How to leverage Spinnaker to employ this best practice: Spinnaker tightly integrates with PagerDuty. In the event of an issue with a particular application, Spinnaker makes it it easy to page that application’s on-call person. In fact, a PagerDuty key can be required when defining an application.
Spinnaker’s tight integration with PageDuty makes it super easy to link your application with a PagerDuty key:
And since manually entering in a PagerDuty Service key is error prone, Spinnaker exposes the linkage between PagerDuty service keys and apps, allowing service owners to select the corresponding service name:
In summary, if you want to rapidly deliver software with confidence across multiple AWS regions:
Want to see Andy share these best-practices in person? Watch the video of Andy’s talk.
Spinnaker is the most powerful continuous delivery tool on the market. DevOps engineers and developers recognize this power and are looking to use Spinnaker as a foundational tool in their Continuous Integration and Continuous Delivery (CI/CD) process for hybrid and multi-cloud deployments. Such a powerful, expansive open source tool needs expertise within your organization to […]
Read more →
Today, Armory is excited to announce the availability of the GitHub Action for Armory Continuous Deployment-as-a-Service. GitHub is where developers shape the future of software. After a developer writes and tests their code in GitHub, it must be deployed. Armory’s GitHub Action for Continuous Deployment-as-a-Service extends the best-in-class deployment capabilities to Kubernetes. CD-as-a-Service enables declarative […]
Read more →
Call me Pollyanna, but what a great time to be a Platform or DevOps engineer. If you’re working in a public company, the S&P is off ~20% year over year, so the value of your RSUs has wilted. If you’re working in a private company, venture funding and M&A velocity are anemic, making expansion capital […]
Read more →