Spinnaker is the Secret Sauce to Waze Success
Calling Spinnaker, “the secret sauce behind our multi-cloud production system,” Tom Feiner, Infrastructure Team Lead at Google, told the story at Google Next 2018 of how Waze came to use Spinnaker and what it does for them.
“[Spinnaker is] a sweet spot between just the amount of abstraction on top of each technology, but it’s still able to control each stack precisely, he said. “It’s a bit contradictory, but somehow Spinnaker manages to pull it off.”
He was joined on stage in San Francisco at GoogleCloudNext by Andy Glover, Director of Netflix’ Delivery Engineering team, and Matt Duftler, Software Engineer at Google, who works full time on Spinnaker.
This team of heavyweights in Google’s CI/CD space gave a joint presentation to show how Spinnaker makes scaling CI/CD easier and let us in on what’s new and what’s next with the Spinnaker open source project. Read about what Netflix and Google have in store for Spinnaker here [link to story upcoming on Wednesday].
Waze: The Early Days
The Waze app, as most commuters know, is a navigation tool that takes crowd-sourced data and updates routes and traffic in real time across the globe. You can set the app to send you the route with either the shortest distance or fastest time. Choosing to be routed by the fastest way means Waze will reroute you when an accident happens, moments after it happens. You can also find the best time to leave, set a desired arrival time and see how typical traffic will affect your drive. For example, I can see how my trip from the East Bay into Armory HQ in San Mateo on a weekday morning will vary from 55 to 31 minutes, depending on whether I want to arrive at the office at 8:45 or 10:00.
All of this takes data. Lots of data. Feiner said Waze generates a mind-boggling amount of data streaming through the app, which now has 100 million monthly active users. Accordingly, they manage hundreds of server groups, data that is shared across the groups, and thousands of instances.
Founded in 2008 with the motto “help users outsmart traffic together,” the company started with a few developers on physical servers. They soon outgrew that model, moving to the cloud, where things got complicated. The more they grew, the more severe the problem of delivering software faster.
In order to deploy one backend service into production, developers needed to create an instance, create images, and deploy them to multiple, geographically sharded auto-scaling groups. Rollbacks were extremely painful.
This process was moved to the Site Reliability Engineering (SRE) team, who is in charge of Waze production.
As things scaled, creating a new instance also created a bottleneck and jammed the pipeline. This bottleneck led to developers batching releases together, said Feiner, which just complicated everything further. When one piece broke the system, troubleshooting became a nightmare because they couldn’t tell which release in the batch was causing the problem.
Waze & Spinnaker
Netflix was gracious enough to share Spinnaker with them, said Feiner. It was the key to evolving technology stacks in the company. It was a shortcut to what they needed: fast feedback for their users.
Feiner was attracted by Spinnaker’s fast, safe and repeatable software delivery with easy rollbacks. Especially the easy rollbacks. It allowed the team of engineers to move to continuous delivery in a multi-cloud environment.
Deployments are now a single click operation, and it’s all self-service. Developers can ship, get feedback, monitor the system through a single pane of glass for the infrastructure control plane, and have complete autonomy. The SRE team now gets to focus on system stability and scalability.
Best part, said Feiner, is that users get features and bug features faster and a more stable system.
Spinnaker’s power, he said, comes from its abstractions and power pipeline system. This allows them to bridge between multiple technology stacks. at Waze, Spinnaker provides a bridge between kubernetes, and GCP/AWS compute virtual machines.
Everyone across the company accesses Spinnaker through a single pane of glass. The point is that it feels the same no matter what stack you’re on (kubernetes, GCP, or any cloud provider).
But as all developers know, he said, once one problem is solved, two new problems crop up. Now they found things were moving too fast. This sounds like a good thing, but it introduces risk.
Now they had 1,000 production deployment pipelines; all managed through the Spinnaker UI. They found that most were duplicates of the same pipeline with just a few changes. Also, each team had their own set of pipelines, and any good ideas remained local to that team.
What they needed, he said, was a paved road that 80% of the developers could be happy using out of the box, and the rest of them could change it and adapt it for their needs but still use the basic roadway.
They started their pipeline ‘paved road’ with a station to bake, a smoke test, then deploy to production.
“This raised the company speed limit from 60 miles an hour to 600 miles an hour,” he said. And when you do that, new risks appear. So they needed more safeguards.
They installed a canary system to make deployment less risky. And they built a by-pass so developers could opt out of a piece of the pipeline but still be on the road. Now any changes and upgrades made reflected across the organization.
A critical aspect to the success of Spinnaker at Waze, said Feiner, is that all of this is done in code, with any changes to the paved road system sent to teammates for review. So tracking and rolling back is now pretty easy.
They have a few pipeline roads, he noted, for example, one for microservices and one for data.
Waze just joined the Spinnaker open source community and is excited to be contributing to the project, said Feiner They will be adding SRE concepts like concepts Service Level Objectives and error budgets. He believes that once Spinnaker is aware of an error budget for an application, some interesting things can start to happen.
“When a service is almost out of error budget, the canary system could automatically run longer to gain more confidence in the new version. Or perhaps something that was fully automatic now has a manual stage because of the error budget.”