Best Practices for Continuous Delivery on GCP: A Deep Dive on how Google Leverages Spinnaker
Mar 11, 2017 by DROdio
I just got back from Steven Kim and Tom Feiner’s talk, “Best practices for Continuous Delivery on Google Cloud Platform“. What an incredible deep dive into Spinnaker. This is the single best talk I’ve seen to describe the pain points Spinnaker is solving, told from Tom’s perspective as a Systems Operations Engineer at Waze, and by Steven as a Software Engineering Manager at Google.
Steven dove into four First Principles, including a great explanation of immutable infrastructure. I also highly recommend Nir’s talk about how Waze uses Spinnaker for multi-cloud deployments between AWS and GCP.
Here’s the talk summary:
“Continuous Delivery (and Deploy) is a new look at how we should get our deployable artifacts into production. Google’s been doing this for quite some time with success, and we have opinions and corresponding best practices based on lessons we’ve learned. In this session, we’ll share how Google Cloud Platform users can best take advantage of our tools for continuous delivery to Kubernetes/GKE, GCE and App Engine runtimes, with examples of how Waze operates in this realm.”
The amount of interest at Next for Steven’s talk was incredible — there was a queue of attendees waiting to enter that wrapped around itself three times:
Here are some pics I took at the event:
Here’s a transcript of the talk:
Tom: We’ve all been there. you’re on call in charge of an entire production system such as this. It is peak traffic on the East Coast, but you’re on the other side of the world, sleeping like a baby. You have no idea what’s about to hit you. In my nightmares, here’s what I see: The [pager – 00:38] goes off saying something like this. You wipe the sleep off your eyes and grab a computer to investigate. You’re still half asleep and have no idea what’s really going on. Traffic jams are forming all around the area. Users have no idea where to go to avoid traffic. Time to start debugging. Millions of people are counting on you. You open the CPU graph for this server group. As you know, this service is CPU-bound. And it all looks normal. Auto-scaling did its thing. It was set to keep the CPU under 45%, and that’s exactly what it did. The number of instances doubled when the CPU started to climb. So why isn’t the service working? You feel a cold sweat forming. This is not going to end well. All instances in this server group are currently completely failing their health checks. You connect to a sample server and you see that it’s at 100% CPU. Why? [inaudible – 01:46] and nothing comes to you. You can’t see anything. So you restart the application, look at the logs just to see the startup sequence. See if you can find anything. But it all looks fine. The dependencies are loading. The application starts fine. But yet once it’s started, CPU jumps right back to 100%. A bad feeling creeps up your spine at this point.
You connect to another server, and this one is completely idle. How come? You look around and you can’t even find the application running there at all. What’s going on here? Was the CPU graph from earlier lying to me? CPU graphs usually don’t lie. You check it again and you realize it’s an average CPU graph. How did I miss that? It appears that some servers are running at 100% CPU, burning up, practically thrashing, not doing anything, while others are completely idle. So you ask yourself, “Am I in over my head?” [chuckles] Your heart tries to pound its way out of your chest. You realize you have no idea how to fix this. Then while looking at the logs, suddenly, instance gets terminated, thrown out, and you can’t debug it anymore. In the UI, you see even more instances committing suicide, instances which were just started a few minutes ago. This doesn’t make any sense. The system is scaled up only to kill itself in the loop. Come on. Is it time to admit failure? Wake up some more people from the team to help. You’re supposed to know what you’re doing here. You’re a professional. You think for a minute what on earth could be causing this, and nothing comes to mind [chuckles].
Okay, okay, so you pick up the phone, about to call for help. You’re about to hit Send, and then it hits you. That’s auto healing. Auto healing is like the immune system. It’s supposed to replace unhealthy instances. But in this case, it’s all of them, so it’s doing much more harm than good. You dig through the console to try to find out where to shut down auto healing, wasting a few more precious minutes. You find another server which is still up and idle. Why is the application not running there? You keep combing the logs. You’re looking everywhere you can until you finally find this. Apparently, each server upon startup connects to a configuration server to download its configuration. Without it, it’s pretty much useless. So this configuration server apparently fell over and managed to do two things at the same time. One is taking down this entire production system with it, turning a perfectly nice auto-scaling group into a set of zombie servers which are committing suicide, and two is to give you a heart attack. Now that you understand the problem, you quickly fix the configuration server, scale up some more servers, and we’re back to normal.
Finally, you double-check just to make sure and go back to bed, patting yourself on the back for saving the day. You try to sleep but find it impossible. Something is still bothering you. Did we really save the day? We were down for 20 minutes, 20 minutes with millions of users frustrated. Then you realize your job is not to be a ninja, even though it’s fun sometimes. Your job is to prevent these kind of failures in the first place. You remember hearing once about the immutable server pattern. And the light pops up in your head. If this fleet of servers was immutable and completely self-contained, all of this would’ve been avoided. Baking eliminates the external dependencies such as this configuration server, improving the stability of the system, not to mention faster boot times.
So you wake up in the morning and run off to work to start baking immutable images.
Stephen: There’s an astronaut saying, “It’s in space. There’s no problem so bad that you can’t make it worse.” And it [inaudible – 06:54] dealing with a level of complexity and pressure, like in this story, where you can quickly feel like you are in sinking sand. The complexity of rolling out software continues to increase: new architectures, micro services architecture, the scope of deployment, multi regional deployments, and global deployments. And as a response, what we’ve done is we’ve introduced measures and processes that slow us down. And I think at this point, most of us don’t think that rolling out the production hourly, daily, weekly, monthly, quarterly is a good idea, let alone feasible.
Meanwhile, what’s on the table is early and often feedback that we’re doing the right things, frequent opportunities to pivot and change direction when we need to, and some side effects, doing more frequent rollouts such as the scope of change is smaller because you’re doing more frequently, so there’s inherently less risk.
So there’s a lot of talk out there about continuous delivery. But we need to talk about really what’s important in terms of first principles behind continuous delivery and not just the tooling. And Google has been doing this for a long time. We have made our share of mistakes. We have lessons learned. And based on that, we have a set of opinions. So we want to share those lessons learned with you along with some tangible examples.
Continuous delivery can’t just be about better orchestration. It’s one of the things but that’s not it. All of you guys came and attended because you care about this. So my hope throughout this talk is to get us, the community, thinking about, talking about, and implementing the right first principles as we take on continuous delivery. My name is Stephen Kim. I’m an engineering manager out of New York City. And this is Tom Feiner. He’s a Systems Operations Engineer at Waze, if you didn’t guess. And we’re here to talk to you about continuous delivery and a tool called Spinnaker, which we want to walk you through.
Spinnaker is an open-source, multi-cloud, continuous delivery platform developed at Netflix. Google joined the project some three years ago. And along with other members of the community like Kenzan, we open-sourced in October, November of 2015. Google and Netflix, we found a kindred spirit in one another in the way we approach the kinds of problems that Tom described earlier.
What I want to do today is I want to share with you four first principles, and for each of these, walk through how they manifest at Waze to ground them a bit and give you some tangible examples. And at the end, hopefully, we’ll have enough time to walk through a simple demo.
The first principle is immutable infrastructure as Tom talked about. It means that once you deploy something, you don’t change it. When there’s change, you [inaudible – 10:37] it up from the host [app] or the container [app] and you tear down the old one altogether.
How many of you guys have not heard of immutable servers or immutable infrastructures? Okay, great, great. So it’s widely out there. Why do we want to do this? Why is immutable infrastructure important? We want our process of building, testing, deploying, and validating to be as deterministic as possible. We know that across environments, there is a lack of [permiticity – 11:10] where their configurations shift [inaudible – 11:13]. And as you go, deploy, promote, and move from environment to environment, that thing is changed. Just because you know that it worked in one environment – whether it be your developer local workstation or in an [inaudible – 11:25] staging or stress environment – there’s still a lot out there through which things can go wrong from that promotion to the last production stage.
So we want to go ahead and minimize changes slipping in as we go to move up the runway. And here’s the way it works. In practice, you bake. So every time you have a change, you bake an image using Packer. Or for those of you who are using configuration management software like Puppet and Chef, you run that as much as you can. And then you bake it into an image. In GCP, it’s Google GCE images. And then at deploy time, you go ahead and dole those out into GCE instances. So for those of you who are from the container world, it seems really familiar because it’s the same thing. You take your application binaries. You [inaudible – 12:17] run-time stacks. You define them in a Docker file, package [inaudible – 12:21] configuration file to a Docker image using GCR build, which we announced earlier last week, or your favorite service, which is Docker Hub [inaudible – 12:29]. And you go ahead and create Docker container images. And then you instantiate them using, for example, Kubernetes scheduler into Pods or whatnot.
There are some other benefits as well to immutable infrastructure. If you remember, one of the early messages about the public cloud was elasticity. Elasticity has to do with longtime window elasticity that you don’t have to go it and buy up front and eat the cost. But it also had to do with responsiveness to demand, spikey traffic, for example. Well, for those of us who might, for example, be instantiating an instance of VM, run your Chef for Puppet scripts on it, if that Puppet script takes 20 minutes or an hour and a half, you’re not taking advantage of elasticity. The demand has come and it’s gone. So there are some other side effects like that where you have faster startup times. You can respond to demand much quicker in traffic spikes. And also, it opens up opportunities to take advantage of things like deployment strategies that are very particular to the cloud. Immutable infrastructure is an example at Waze that Tom gave. It was somewhat rhetorical, but I think you guys get the idea.
The second practice is around the use of deployment strategies. There are a lot of tools out there that people can use to go ahead and deploy out to infrastructure and deploy changes to code. Jenkins is a popular one. But I start by saying it can’t be just about orchestration, and deployment strategies is an example of this. So we’re talking about deployment strategies. I can specifically take advantage of some of what you get in the cloud, so let’s walk through a couple.
A simple first one is called Red/Black. Some people also call it Blue/Green. And it’s the notion that you have a production cluster V1, and then you have a V1.1. And you basically create a whole new stack running that. And after it’s been validated through health checks or whatever your criteria might be, you cut over traffic. The advantage here is that if something goes wrong, you can quickly cut back. There are some disadvantages as obviously 2Xing the amount of infrastructure and resources required to go ahead and do this. But this is a simple example where if something goes wrong, you know you can go ahead and very quickly cut back.
An evolution of this is something that we’re calling Rolling Red/Black. And it’s the notion that you can go and define incrementality of traffic percentage cutovers with validation gates in between. We can simply say 1%, 5%, 25%, 50%, and 100%. And between each cutover to the next percentage incrementality, you can go ahead and set a validation gate. So it might be a pipeline that you run which might be very robust. Or it might be a simple smoke test or a functional probe. And it also has continuous [take – 15:34] on the advantage of Red/Black, which is if something ever goes wrong, rolling back is immediate.
Another revolution [inaudible – 15:45] is canary. And canary is a different way to do the validation gates in between a percentage rollout. But it takes a slightly different approach. Rather than using functional tests or probe or smoke tests, we go ahead and look at a set of measurements which are much more common regardless of what your application is. So we look at underlying metrics, obviously, things like CPU and memory profiling but also latency, response codes, error rates. And you take all of these and you create a weight by which you go ahead and say a certain canary score is acceptable. And that canary score also takes into effect an acceptable deviation of that canary score or the performance characteristics between your previous version and your current version.
So here’s what it means: test coverage is always a cat-and-mouse game. Test coverage is at 8% today. Every code commit will lower that test coverage. Goes in that direction. So with canary, you can actually say, “Well, no matter what we’re rolling out, we’re going to look at a different set of characteristics that are common to applications.”
So with that, let’s ask Tom to talk about what that looks like at Waze.
Tom: 10 years ago, the cloud did not exist, at least not as it does today. Back then, most companies ran their own infrastructure. And from what I’ve heard in this conference, some still do. What’s up with that? It wasn’t pretty then. It isn’t now. No matter how hard you tried, you could’ve sent API calls to your hosting provider 10 years ago, asking for [inaudible – 17:30] servers, and they would’ve just dismissed you as some lunatic. Luckily, these days, it’s possible. And as Stephen mentioned, it’s part of Spinnaker. So using deployment strategies is basically baked into Spinnaker, and we use that all the time at Waze. This is how it looks like. In a large system, as Stephen mentioned, it’s usually impossible to test everything in advance. So we also use a canary stage. It’s not fancy like the one Netflix uses, but it does the job. It looks like this. Basically, you have a production system running with 100% of traffic flowing to it. You then run version 2 in a canary and send a tiny bit of traffic to it. You then run a canary analysis on each and every API of that server. So for each API, we test for errors, latencies, and number of requests. If it all looks fine, we perform a relative comparison against the production group. If it all looks fine, we blow that up to a production scale and send 50% of traffic to the new group but still 50% of traffic to the old one. We then run the canary analysis again, this time expecting a very similar behavior between the two groups. If there is any significant difference, we can easily roll back by just shifting traffic back to the old group, debugging, and then destroying the old one. But if everything is fine, then we just disable the old group and eventually destroy it.
Stephen: The third first principle is automation. Automation promotes a consistency, and a repeatability of consistency is important both in success and failure modes. When things are failing, you need a consistency of failure to be able to do cause analysis. What really sucks is when it feels in different ways every single time. And a successful consistency promotes a maturity of process that you can go and build on top of. So it’s about identifying and putting in mitigating measures without losing velocity.
If you were, for example, to go ahead and write the number of tools out there that does orchestration as a step one, step two, do this, and do a deploy, something fails. You go, “Oh, that can happen again. I’m going to go ahead and write an immediate measure so I can very quickly roll back so you write something else.” The way that this evolves, I said at the beginning of the talk that it slows process down. And it becomes very difficult to go ahead and improve if you’re slowing down constantly every time. So it’s difficult to improve if you haven’t instrumented. It’s difficult and less useful to instrument if you haven’t automated. And it’s hard to get to a sort of a consistency and a scale of information that gets interesting enough if you’re not automating.
Let me talk about what I mean by automation. Top-level pipelines go ahead, define, and execute for you your process on how you go ahead and release software. For example, find an image that you validated, looks good. Go ahead and deploy using Red/Black strategy. Do some rudimentary smoke tests. Wait a certain amount of period. Go ahead and scale one down and scale up, all the different deployment strategies that we talked about before. So this is good. But the automation also has to happen in more fine, granular form under each of these steps.
Let me show you. For that deploy to prod, Red/Black 1 stage, here’s what Spinnaker does: For each stage, there are a number of steps. So it goes and deploys in default. It shrinks the cluster. It disables it, scales it down. And then of each of those steps, for any one given step, there are a number of tasks that we go ahead and perform. So this is where it gets really tricky, determining the health of things. Do we need to go forward? Do we need to go back? Monitoring, doing necessary cache refreshes for us to be able to do our job. And then finally, for any one of these tasks, there are three cloud platform level operations that we have to go ahead and orchestrate, i.e. dealing with load balancers, association, disassociation, target pools, and removing instances. And for every single one of these minute actions, if something goes wrong, to be able to go ahead and auto-remediate. So this is the lot that you would have to actually go ahead and do to say we have [inaudible – 22:38] we have confidence. What allows you go fast is confidence. And this is the amount of work that is done in Spinnaker to go ahead and promote that confidence. And you shouldn’t have to build all this yourself. You should get behind an open-source project like Spinnaker, which I’ll show you a little bit more later. So for automation, let’s go ahead and hear what Waze does in a little more detail.
Tom: By the way, Waze really didn’t want to go and build this stuff. We are really happy Spinnaker existed. And we’ve actually been users of Asgard for a few years back. And we seamlessly transitioned into Spinnaker from Asgard, so it’s been great. And thanks to the Netflix guys for doing this for open sourcing.
Once deployment pipelines are in place, some interesting things can start to happen. We basically can start using some fail-safe patterns to limit the blast radius of changes. For example, at Waze, we chart everything by geography, obviously. This allows pipelines to limit the blast radius of each upgrade. So for example, we can upgrade for a certain service one part of the world, completely using the Red/Black deployments, then move to a few more parts of the world, and eventually worldwide. If you’re lucky and your usage patterns are predictable, it’s good practice to use time restriction in deployment pipelines. This allows us to say, for example, that a certain pipeline can run between 6 and 12 UTC only between Mondays and Wednesdays. So this ensures that when an upgrade occurs, it affects as few users as possible. And if you’re really lucky, you can try to find a sweet spot between the off-peak hours and the office hours. Then you have staff caffeinated and ready to go to solve any issue that might come up during this upgrade.
Stephen: The fourth and final one, which I don’t think gets talked about enough, is operational integration. And what operational integration speaks to is the notion that the process of how you go ahead and release software needs to work hand-in-hand with how you operationalize that software. And that goes in both directions.
There are two simple examples. One is that when you’re releasing build release, promote, and so forth, that process is gathering and generating a lot of information that is germane to the scope where you are doing root cause analysis or defect triaging. For example, while you’re trying to go ahead and figure out why something went out, it’s useful to understand, “Oh, when we built that software, what are the PR, the commits that pertain to the change from the last deployment to the first? Here’s the blast radius of the code and config change. What kind of validation tests were run? What were the results and output from that test?” All this information that is generated at the time of doing your deployment becomes very, very useful and important when you’re doing operational things. It goes in the other direction as well. So you have systems that are running. And the information that is gathered as your systems run in production are very, very useful as you’re doing your process of doing deploys. A simple example, as we talked about, is canary analysis. Canary analysis goes and looks at how your systems are behaving. And it could do it at point in time or statistically across time to go and determine what is the right validation gate to go ahead and say, “This system is behaving normally as we go ahead and promote from one environment to another.”
Finally, release process inevitably crosses multiple teams. I know there are a lot of good aspirations out there about whether it’s DevOps or a project methodology like Agile or whatever it might be. But your process inevitably has a lot of orchestration and coordination that’s needed across people and teams. So what that means is a Spinnaker pipeline, for example, can help you communicate, validate, and record handoffs that are necessary between teams as you do the release process. To give an example of that, back to Tom.
Tom: Thank you. We all know real life is messy and imperfect, especially when you’re moving fast. During a post-mortem meeting a few months back, the team recognized a certain critical micro-service which doesn’t have proper automated testing. It will eventually be written, but that’ll take time.
In the meantime, we decided that we want a person, someone from QA, to run a sanity after each and every deployment to production. Here’s the problem: deciding to do this is one thing, but actually doing this in a repeatable manner on every single deploy to production is easier said than done. People can simply forget. Tasks get shifted around. And before you know it, recency bias kicks in and we’re back to square one.
Here’s how it looks like: What we’re going to do here is basically, the developer commits a code. That code starts a pipeline in Spinnaker which initiates a manual judgment stage. That manual judgment stage sends a notification to QA, asking them to approve this upgrade because we need resources from them to be there to run the sanity and make a decision. Once they approve and they have a button to either approve or not approve or they can just wait, Spinnaker takes over, runs a full Red/Black deployment, and once that’s done, another notification is sent to QA asking to run the sanity. Then they have an option to either pass the sanity and finish the automatic deployment or roll back. And Spinnaker takes care of all of this. Basically, going forward just means deleting the old version, and moving backwards just means enabling the old version and destroying the new one. This has been very helpful for us.
Stephen: We talked about four important ideas. And I know it’s a lot to take in. And we’ve worked hard to make this sort of consumable and easy to adopt and create a ramp instead of a step. Let me walk you through a simple, basic demo that takes source code and through Spinnaker, applying the key principles that we talked about, deploys and promotes something out to production. Can we have the demo, please? All right, a simple go app. And really, it just goes and responds with Hello World, and the background color is yellow. So let’s make a change. Purple. I made it purple, right? Yeah.
I’m going to go ahead and initiate my release process by pushing to the release branch and source. What happens here is first of all, right now, what you’re looking at is Spinnaker. And we see a staging environment and a production environment. In the staging environment, if we go to what’s out there, it says Hello World in yellow. And also, in the production environment, it also says Hello World in yellow. So far so good. And you’ll also notice things like version of a container, the tag on it, matches. It’s the same thing [inaudible – 30:49] promoted. We’ve gone through this once. When I committed that code, what I have set up is a build trigger. This is not in Spinnaker, but you can go ahead and build your containers in any way, Docker Hub and Quay both have build triggers. And I set up a trigger in GCR build so that when there is a push to the release branch, I go ahead and trigger a build. And I can use the Docker file that’s sitting in there, but I want to use a feature called GCR build steps which allows me to go ahead and build, among other things, lean containers. I just want to take a quick segue to go ahead and talk about this a little bit. The notion is that instead of… A typical Docker file looks something like from this base image, pulling this source, go ahead and build it, and then identify an entry point. And then that whole thing happens in a container. One of the challenges there is that everything that you needed to build it stays with you as you go ahead and run it. So the SDK, [dependent – 31:54] packages, all this you really don’t need. It’s a general notion that how you go ahead and build in a validation context is very different than how you might want to build for what you actually run in production. It’s not really controversial. So what you ideally like to do is do it in two steps. In one step, I want you to go ahead and build my binary. And then second step, containerize just that binary. So that’s what I’ve defined over here. If you guys want to look at more, you should search for Google container builder. And what you’ll see is I have the build that happened just now. And if I look at my container registry for that service, you’ll notice that it looks like four megabytes for that hello world container. Previously, I had built it using the standard method, and it is 241 megabytes. That’s 1 ½ orders of magnitude we’re talking about there. And it’s because the Go SDK is in that container as well as all the different packages that I downloaded. Okay, so [thanks to our – 33:02] normally scheduled program [chuckles].
In Spinnaker, I go ahead and manage the process to production in three pipelines: deploy to stage, validate, and then promote to prod. The first stage, first pipeline looks like this. The configuration is set such that every time there is a change to my Docker registry, a new tag appears. I want you to pull that down and trigger this pipeline. And it really has just one stage: deploy. I’ll look at the deployment in detail. Deploy sends it to a stack that I designated as the staging environment. It deploys the container that triggers the pipeline. And it uses the Red/Black strategy. What that means is once again, it goes and brings up a new… this is to Kubernetes, so it’s going to be a Replica Set. And then once it’s deemed to be healthy, it’ll cut over the load balancer and disable the other one. And you’ll notice that I checked on the box that says, “Scale down replaced server groups to zero instances.” This is a non-production environment. I don’t have to cut back in a hurry. Let’s save on the resources.
That’s running right now. And you can see, once again, as I described before, for that one deploy stage, it’s deploying, shrinking the cluster. Under each of those, you can go ahead and see the individual tasks that are being managed to do this.
The next pipeline is called Validate. The validate pipeline is set to trigger when the deploy to stage pipeline completes successfully, the first one that I just showed you. And it has two simple steps. The first one is a Validate stage. This is what we call a run job stage in Spinnaker. It basically allows you to point to an arbitrary container and run it. And whatever the exit status for that container is, it’ll go ahead and proceed accordingly. And for me, it’s actually just a simple curl command. It goes and curl the end point that I specify, which is my staging end point. And if it comes back with a 200, it’ll say, “It’s a 0 exit code.” It will continue.
Then finally, at that point, I have a manual judgment stage, which notifies somebody. And then that person has to go and say, “Yup, everything looks good.” Yes or no. By the way, I’ll take a moment real quick, continuous delivery and continuous deployment. So continuous deployment is all the way to production automated. And continuous delivery typically entails somebody being a gate to go ahead and say, “Yeah, this is ready to go.”
Let me see if that came through. Here we go. At this point, you’ll notice that my staging environment… Please be purple. Please be purple. It’s purple, yeah. And you’ll notice that the container tags are also different. So it looks good. I’m going to go ahead and promote it. So when I hit Continue, my third and final pipeline will run. And here’s what that looks like. The configuration, as you might imagine, is set so that this pipeline will trigger when the validate pipeline completes successfully. It will do a couple of things. First I find the image from the staging environment. I know it was good. I tested it. I want to use that instead of rebuilding or locating that image otherwise. So I can give it coordinates to go to, say, from the staging environment, take the image from the newest group that’s out there and give me that image. I take that image and I deploy [it – 37:03] canary. And for that deployment, I go ahead and specify the canary environment. I don’t use any strategies because I don’t want to manipulate traffic in any way. I just want to bring something up and put it into the same load balancer. And I put up one instance. So now we have the old production as is at full capacity. And then we have one instance of the new version in a canary environment both taking traffic from the same load balancers. So what will happen at this point is your users are visiting and 1/9th (or how many ever that percentage comes out to) will get the new version. And you’ll start gathering data. And the idea is to start gathering… [inaudible – 37:47] around how well that’s going. If that looks good, after the canary is done, I want you to go ahead and notify somebody. They’ll look at the canary data and approve. And then at that point, they’ll do a Red/Black deployment to production. So looks good. Let’s go ahead and cut over the production environment to using Red/Black to the new version which will disable the old one. But you’ll notice for [inaudible – 38:15] that I’ve deployed prod, I use Red/Black, but I don’t scale down the replace server groups. So they’re sitting there, not taking traffic because Red/Black disabled them. But it’s sitting there. Why? Because if something goes wrong, we want to be able to cut over, cut back very, very quickly. And we have ad hoc operations where you can click button and then it goes, rolls back, and sends traffic out to the old one again instantly. And then we tear down the canary. And once again, we use locators that say, “The newest server group in the canary environment, go ahead and tear that down, destroy server group stage.” And at this point, canary is dead. We have just old prod, new prod. Old prod is disabled. Let’s not be too hasty. Let’s force it [to R08 – 39:06] because at worst, all it’s doing is tying down resources. Then destroy that old prod. So at this point, if I look at where my pipelines are, my canary went through. It’s waiting for manual judgment. And if I go back to my clusters view, here’s my production, and here’s my canary. So I don’t know what to really expect because I’ve never done this with more than me.
Everybody pull out your phones. Can we go back to the slides, please? That is the production end point. So if you please navigate to mdservice.spinnaker-test.net, what should happen [chuckles] is most of you should get a yellow page. And about 1 out of every 10 of you should be getting a purple page. So once you have it loaded up, can you guys raise your phones and point them toward me? Oh, yeah, it worked. Oh my God!
Tom: Well, you can’t see it, but [crosstalk – 40:27].
Stephen: All right, I’ll go ahead and point that over to the… So at this point, it’s going to look good. Canary looked good. It showed up purple. I’m going to hit continue, and it’ll proceed on to go ahead and destroy the old prod. Is that pretty straightforward to you guys about how that would work? Now a couple of things are important that happened there. It was all automated. There was every opportunity where we recorded every step that was taken. It followed a process that we defined. And really, at the point where we get comfortable with things that can go wrong, we can start cutting out those manual approval stages. And we can get much smarter than this. This is a demo scenario. If you ask what Waze is doing with Spinnaker or what Netflix is doing with Spinnaker or a number of other organizations, they’re doing far more sophisticated things. For example, Waze is using Spinnaker to go ahead and do fleet upgrades of base OS based on pipelines that handle all this for them. So as I mentioned, we open-sourced in 2015. And we open-sourced with GCE and EC2 providers. I said Spinnaker is multi-cloud. And a Cloud Foundry provider came out there in beta at the time, and we had like a Red/Black strategy and some basic support. In 2016, we were heads down in development, and we introduced the Kubernetes provider, the Azure provider, joined in beta, Microsoft joined. OpenStack provider. And really, those people from the community who are building these things, some of them are here in the room right now, so thank you very much. We added authentication and authorization support through [OS/2 – 42:04], Google Groups, [inaudible – 42:05], LDAP. We added much more robust support for load balancers. They are L3 and 4. They are L7 HTTP, internal SSL support. We integrated StackDriver logging. So we’re trying to make this operationally ready for you guys to go ahead, take on, and support for things like auto-scaling, auto-healing. And also, we’re beginning to reach out into other popular systems in the ecosystem like Consul for doing things like discovery and health readiness.
And in 2017, what we’re working on is we’re going to be adding an app engine provider. So that’s actually pretty much done. We’re ready to start taking alpha tester. I’ll tell you how to reach out to us later. We’re beginning on formal support for canary strategy that’s available in the OSS realm. So Google and Netflix are working very closely on this. And we expect to be at a pretty good point in the latter half of this year with that. We’re going to be introducing rolling Red/Black strategy. That’s also at a very good place already.
Another major effort is Declarative CD. You can look at it as config as code, but it’s really more than that. It’s more of a holistic look at when we go ahead and say, “Here’s an application,” so far we think about binary and config. But also, we want to go ahead and make that [bundle be – 43:22] defined as, “And here’s how you should deploy me, manage me, upgrade me, and so forth.” So what that means is that as you have Spinnaker in your organization, you can have application development teams who should be self-sufficient, define all of their artifacts as code, check it into a repository, and have Spinnaker enact that on their behalf. Spinnaker is a complex system: many micro services, lots of configuration. We recognize that there is a high friction to adoption. So in about a month, we’ll be introducing formally versioned releases. So we will run the full synthetic transaction integration test for you to go ahead and say, “Here is the combination of different services that we have validated and work for these providers. So we can look and know what you’re getting.” Also, we’ll be done with Spinnaker Configuration Management tool that allows you to install, upgrade, configure, and validate your Spinnaker environment as well. There’s a lot of documentation out there right now currently, but we think that this tool is going to be a necessity for people to really adopt and be successful with this.
Finally, we’re going to be filling in operational monitoring for the Spinnaker instance itself so that when deploying your software, if you might have performance issues or something goes down, then somebody can be notified of that and much more. Actually, there’s some really exciting stuff that you’ll hear from us. It’s not little things. It’s pretty cool things.
Spinnaker.io is our website. And we develop out in the open. So if you go to the Spinnaker GitHub organization, you can see everything that we’re doing. The best way to get engaged is our Slack channel. We are a community that’s 2000-strong, heavy participation. People are helping each other out. It’s a great place to come, get introduced into the community, ask for any level of help. People are extremely friendly and helpful over there.
And also, I’ll mention as a shortcut. If you want to quickly get an instance of Spinnaker out for yourself to play around today, go to cloud.google.com/launcher, search for Spinnaker, and you can in one click get an instance of Spinnaker up that you can start working with. There are also a set of code labs on Spinnaker.io that you can look at. That’s how you can get started.
Recently Published Posts
How to Take the Pain of Rollbacks out of Deployments
Software applications have become an integral part of the business climate in most modern organizations. With an ever-increasing demand for new features and enhancement of already-existing ones, software teams constantly face novel challenges, and the pace of software development is growing by the day. To keep up with this fast-paced business climate, software teams […]
Read more →
Monitoring Spinnaker: Part 1
Overview One of the questions that comes up a lot is how you monitor Spinnaker itself. Not the apps Spinnaker is deploying, but Spinnaker itself and how it’s performing. This is a question that has a lot of different answers. There are a few guidelines, but many of the answers are the same as how […]
Read more →
The Importance of Patents: Interview with Nick Petrella, Head of Legal
In honor of Armory’s recent acquisition of a patent for continuous software deployment, we sat down with Nick Petrella, Head of Legal, for a casual conversation covering a wide range of subjects, from patent law to Nick’s background as a software engineer and why he made the leap to the law. Check out […]
Read more →