How and Why Netflix Uses Spinnaker To Create Competitive Advantages And Win In The Marketplace
Last night, Armory sponsored an Advanced AWS meetup focused on Spinnaker, which included:
- Andy Glover, Delivery Engineering Manager, Netflix, sharing how, and why, Netflix uses Spinnaker to create competitive advantages and win in the marketplace, as well as a peek into its roadmap
- Diana Tkachenko, Data Platform Engineer at Stitch Fix, sharing how she is implementing Spinnaker
- Andrew Leung Senior Software Engineer, Netflix, covers how Netflix is using Spinnaker with Titus – their home grown container management and scheduling system.
Thanks to Electronic Arts for hosting the event, and to ThoughtWorks for also sponsoring.
Here’s the Transcript:
Daniel: Hi, guys. I am Daniel. I am the CEO of Armory. And please ease your hearts content. I think we ordered plenty of food. So [… – 00:26] extra. So we are a new startup. And we really believe that software is permeating everything around us over the next decade. Software will be in everything. There’s no concept today of you’re closed, not working [for the – 00:42] walls in your house or your office not working. But we believe that software will be in those things. And the Global 2000 companies are going to have to figure out [… – 00:54] software […] competency. And as they have to do that, we look at the market and we thought, “What’s the best tool out there to allow companies to deliver software in ways that don’t break customer trust?” And we very quickly realized that it was Spinnaker. And actually [… – 01:12] for doing the startup 15 years ago on [the East Coast – 01:14] and was very excited when he told us about Spinnaker. And it just felt like the world is aligning. We’re passionate about software in the way that it’s literally changing the world. And there’s this amazing tool out there that has [… – 01:28] framework […], all of these things that we just really believe to be the right thing.
So we are doubling down on Spinnaker. We’re doubling down on trying to bring it to the world. We are just in a very early customer validation mode. So if you work for a company that’s evaluating Spinnaker or thinking about doing so or running it in production, we’d love to talk to you. We’d love to understand what it is about Spinnaker that isn’t working for you today so that we can work with the community to try to make it better. And we’re looking forward to getting to know all of you and we build the future of software delivery. So enjoy the food and we’d love to get to know you guys. Just come up and say hi afterwards. Thank you.
Male 1: Thank you very much. And then we have a premium [… – 02:16] ThoughtWorks. If you don’t know about ThoughtWorks, they have been around for a while. If you’ve read the book Continuous Delivery, they were some of the folks that wrote that book. They have two products, Snap CI and GoCD. Also around doing continuous deployment, that sort of thing. So please go and check those out as well. [… – 02:40] why we do this […]. Anyone? It’s [… – 02:44] community. […] content […] fantastic […] content […]. The reason we do […] is to build community, to learn from each other, to have practitioners come and talk about how they’re using AWS. So [… – 03:04]. I’m always looking for more speakers. If you have a story to tell, maybe something catastrophic happened in AWS for you. And here’s how you found [… – 03:16]. Here’s how to fixed it. And you learn something from it. Come and tell us about it. I’m easy to find. Please sign up for the [… – 03:26]. So always looking for more speakers.
And as part of building a community, one of the other things I like to do is open up for announcements. So if anyone here has something to announce, now is your time. That can be, “We’re hiring,” “I’m unemployed, looking for work,” “I have created this open-source software,” “I discovered this open-source software.” It’s up to you. So any announcements?
Male 3: Thank you very much. I work for AWS. And once again, like he said, I’m hiring. I work for [… – 04:14] manager and […] strategic accounts segments [… – 04:21] clients, big ones. Netflix falls into that, woohoo! So thank you very much.
Male 4: [… – 04:30]?
Male 3: Oh, I apologize. My business cards are [… – 04:35] with […].
Male 1: Yes. In case anyone hasn’t noticed, there’s [… – 04:40] that sort of thing, just opposite the [… – 04:44]. Anyone else?
Male 5: [… – 04:55] systems integrator called conversion one. And one of the products that I’ve sold to a lot of customers – I started out locally here in California – is Interactive Intelligence. And Interactive has released PureCloud on AWS. I’ve got two events the next week. If you want to, there’s going to be [… – 05:15] for [… – 05:16] strategies, comparing cloud architectures. And there’s going to be a demo of PureCloud that runs on AWS. I’m here the 18th and San Jose at Polycom, the 20th. And San Francisco and 555 California Street is where those two events are midday. So if you’re interested in finding more about, it’s the first [… – 05:48] platform that runs on AWS.
Josh: My name is Josh. I work for Dell. We have internal database cloud that I take care of globally. We have about 80 systems supporting about 3000 applications. I’m new to the area, and I’m looking forward to make a switch for my existing [… – 06:02]. Thank you.
Male 1: I don’t know if anyone saw the AWS guys that came out this morning. Basically, Joshua, the guy that created this and got the ball rolling said, “There’s a lot of information about how to go and use AWS. So it’s kind of all over the place. Let’s put together a list that is not your typical getting started guide, but it’s more of here’s [the gotchas – 06:35] in AWS.” So I definitely recommend going and checking that out as well. And then without further ado, please, welcome Andy, who’s going to give us an intro to Spinnaker.
Andy: Thanks to everyone who showed up tonight. Big thanks to [EA – 07:36] for the wonderful facilities. [… – 07:39] the […]. My kids are big Battlefront fans, and [… – 07:46] to see the Battlefront [… – 07:49] pretty cool. Thanks to Armory for the food. Anyone from ThoughtWorks here? Thanks to ThoughtWorks for not showing up. No, thank you [… – 08:00].
Tonight I want to talk about how Netflix facilitates continuous binging. My name is Andy [… – 08:13], and I work for Netflix. And we are hiring as well, so [… – 08:19] if you want to talk about that. How many Netflix members are here? Thank you very much. Stranger Things, everyone has seen it? Pretty good?
Netflix is a global company. And we believe strongly that in order to stay ahead in the market, to beat our competition, we have to out-innovate everyone out there. And how we do that is we believe really strongly in the value of continuous delivery. Specifically, continuous delivery gives us its competitive advantage. We can stay ahead of the competition, like I said. In fact, [CE – 09:06] provides a number of benefits. And the first one is obviously accelerate in time to market. We are continuously innovating the Netflix experience. When we sign in tonight, it might be slightly different than what you saw last night. In fact, when two of you sign in, one of you may have a different experience than the other person next to you. We are always innovating, constantly running A/B tests, so that we can figure out what is the best experience for you so that you come back night after night and watch Netflix instead of going [… – 09:35]. And because we’re constantly putting stuff out in front of you, we’re constantly focused on your experience, we’re trying to make the best experience for you [… – 09:49], so that you come back time and time again and watch amazing content. And we’re able to do that because we can make changes on it. Flip the switch and we have new code out there that can fix a bug, put a new feature in front. We can experiment. We can do all that and ultimately keep you happy. And when you’re unhappy, you can also fix that quickly as well, hopefully.
The last thing we see from continuous delivery is that obviously, we’ve seen [market – 10:19] improvement in developer productivity and efficiency. So we believe so strongly in the value of continuous delivery that we’ve invested in a platform called Spinnaker, which is obviously the topic of discussion tonight.
Spinnaker is a continuous delivery platform that we wrote from the ground up. We didn’t just decide to write a continuous delivery platform because we thought it was a cool thing to do. We have been practicing continuous delivery [at scale – 10:47] in the cloud for over six years, as our distinguished guest Adrian can attest to. Spinnaker is the evolution of multiple continuous-delivery products, tools, platforms, scripts, and whatnot at Netflix. It’s a very [opinionated – 11:06] system that ultimately takes the lessons learned from all those different systems so that we can innovate at a breakneck speed. And for those of you that might be familiar with [… – 11:21] because I know Peter is [… – 11:23]. One thing we did differently with Spinnaker is we wanted to make a system that was friendly to the community, that you could plug into it. In fact, we’ll talk about that tonight with respect to Titus. We didn’t want people to necessarily fork it and build something else on another cloud. We wanted to be a partnership. So Spinnaker has a very strong community. It’s a very strong partnership with Google, Microsoft, Target, Pivotal, Veritas. And these partnerships reflect deep integration not only within AWS but also GCE, Kubernetes, Cloud Foundry, OpenStack, and Azure. And we released Spinnaker to the open-source community last November. So coming [… – 12:14]. And we have a very, very healthy and large open-source community, in my opinion. [We get – 12:20] companies like Stitch Fix, Twitch, Adobe, [… – 12:25] Target, Veritas, [… – 12:27], Coursera, AOL, Symantec, Airbnb, ESPN, Under Armour, Reuters, [… – 12:36]. So I think it’s a very healthy community, and it continues to get more and more healthy. Obviously, the addition of the Armory folks is a [valid sign – 12:46] as well in terms of the excitement and the richness of that community.
And you mentioned ThoughtWorks. Jez Humble wrote a book on continuous delivery called Continuous Delivery. And there’s a great quote. He said, “Continuous delivery is about making software delivery boring.” And I’m here to tell you that with Spinnaker, software delivery is very, very boring. In fact, it’s so boring that we do 4000 deployments a day. And that’s a twofold increase from when we first released Spinnaker internally in Netflix when we were about 2000 deployments a day. Now we’re doing 4000. I think that’s a [testimony – 14:33] in terms of how boring it is. It’s so damn easy. You just essentially check in [… – 14:38] email […] for you to play with it. And I wasn’t going to demo, so I just had some screenshots.
Again, Spinnaker is a continuous delivery platform. At the heart of Spinnaker is this notion of a pipeline. It’s an orchestrated workflow. So you can say before I go into this environment, I want to do x, y, and z. So in this case, what you have here is a bake stage. In this case, they’re baking in AMI. So that’s our bread-and-butter kind of deployment asset at Netflix for the time being. Increasingly so, it will turn out to be a Docker image. And then you can do things in parallel. So after the bake, we’re verifying it or running some integration tests and we’re running some unit tests.
And then we’re deploying a canary. I’ll talk a little bit more about canary in a minute. But suffice it to say canary is the way to test your application in production using [… – 15:41] production traffic.
After the canary, then there’s a manual judgment. So again, this pipeline… And you can fashion it any way you want. In fact, that’s [something key – 15:49] in Netflix, which is that my team builds Spinnaker, but we do not go out to other team and build your pipelines. It’s up to each team to come up with their own pipelines. There are no DevOps teams in Netflix. DevOps is just kind of built into the [core mantra of – 16:03] any team. So each team owns their own destiny, can come up with their own pipelines.
After this manual judgment, then there’s the [… – 16:12]. Then they wait a couple of hours, and they may clean up the old server. So this is a fairly typical pipeline that’s probably run close to 4000 times a day.
What this slide is showing you is that we provide a rich amount of information in the UI. So in this case you can see that I’ve clicked on the canary ACA and [… – 16:39] you can see here that they’ve run this ACA in three different regions of AWS. And it reported back a score of about 88. In this case, [… – 16:49] 1, 2, or 0 to 100, 100 being perfect, 0 being very, very bad. Consequently, they ran it for four hours. And after it ran, it went to the next stage [… – 17:03] ran previously. You can see that it ran to completion. And actually, it took a 10-hour pipeline.
Here’s another very interesting pipeline that I’ve got to call your attention to. Anyone watched Netflix on your computer? One person. Okay, come on [laughter]. This is the HTML5 client. A fair amount of people watch Netflix on their computers. It’s called [Cambian – 17:38] inside Netflix. And this is one of their pipelines to push it into production. This is run frequently. So the next time you fire up Netflix on your computer, you can thank Spinnaker when it works.
The core thing with Spinnaker is [… – 17:58] workflow. And that workflow is made up of stages. And those stages have many different types. And this is a drop-down showing the different types of stages there are in Spinnaker. I forget the last count, how many different types we have. But in this case, there is ACA, [bake – 18:13], canary, deploy, run a job, you name it. There’s a myriad of stages. Do you have the number off the top of your head [… – 18:22]?
Male 6: 29.
Andy: 29, okay. One thing that’s key about Spinnaker is it is a continuous-delivery platform. But it’s also a [control ping – 18:34] for AWS and other clouds. So in this case what we have here is a production [… – 18:42] this is great. Talking to [… – 18:44] audience [… – 18:46] terms [… – 18:47] what is an ASG? So what you have here are a couple of ASGs in some regions. And the key thing about Spinnaker that it borrowed from Asgard is [… – 19:00] Netflix… This is what I believe [… – 19:03] this is this notion of an application, [that an – 19:07] application is made up of clusters, and then clusters are made up of ASGs. As you all know, ASGs are regional. So a cluster stands essentially in many regions (the globe in this case). So what we have here is a cluster called API production. In this case, the picture only shows two regions [… – 19:25].
And you can manage your assets once they’re in the cloud. So in this case, I’ve clicked on an ASG and I can roll it back, I can clone it, I can destroy it, I can resize it. I can do everything [it would want – 19:37] to do.
Another key thing about Spinnaker is it’s giving me real-time data. So in this case, what we have here is this is a deployment to EU West, and it’s probably either sailing up due to traffic because you can see that these blue ones are coming into service. So it’s a very, very rich platform, not only for you to deploy things but also to manage those assets once you’re in the cloud.
Spinnaker is open-source, as already mentioned. There’s a website, Spinnaker.io. I highly recommend you check it out. You can download it when you go there. And I’ll show you where you can get help as well. Just really quickly, we’re going to spend some time going over the Netflix roadmap so you have an idea where we’re going. As you’re here tonight, we are working hard to integrate Spinnaker with a container cloud called Titus. Spinnaker also works with Kubernetes. So we’re spending a lot of time working on that Titus work [… – 20:37] talk about Titus coming up. But if you do have questions about [… – 20:43] Spinnaker team here as well.
We’re also making a sizeable investment in this notion of declarative continuous delivery, the idea here being that you can codify your, let’s say, [nth – 20:55] state and how you want to get there in a format outside of Spinnaker. And this is not a new term. And we are making headway on that now and hope to roll that out probably in the next quarter or two.
And then I want to spend a little time going about where we’re going with Spinnaker long-term. And this encompasses all the community, not just when Netflix sees. This reflects the entire community.
And the thing about Spinnaker right now is flexibility is the first-class citizen in the platform. There are knobs and dials there to practically change and do anything you want to in the cloud. If you were an AWS person, all the concepts you know and love about at AWS are exposed through Spinnaker, and people love it. It turns out that as I said earlier, the value of continuous delivery is about moving fast. And in this case, speed [… – 21:46] flexibility. So [… – 21:48] it’s really helpful that I can fine-tune, I can set my [… – 21:53]. I can pick my instance types. At the end of the day, [we want it to move – 21:57] fast. And it turns out that Spinnaker itself, through metadata like declarative continuous delivery and other things, we can make a lot of intelligent defaults. We can make the system smarter and do a lot of things for you so that you don’t have to spend a lot of time configuring the pipelines, i.e. you can [tell it’s a – 22:14] metadata like [I’m a] tier-0 app. And we can infer a lot of things from that. Or even more, you can [… – 22:19] tier-0 app and need a lot of memory. And we can start to make intelligent decisions for you and not ask you to say… We don’t need to ask you for your instance type. And you can see that already in Spinnaker, that we try to abstract it out. Again, this is an AWS cloud, so we all [love our M4s – 22:36], right? [… – 22:38]. Some time soon. Top secret. That’s right. You guys [don’t – 22:43] talk about those things. [… – 22:46] are coming out at some point, right? So what we’re trying to do is we’re trying to [abstract out – 22:53] a way from people so they don’t have to worry about that. At the end of the day, they just [… – 22:57] or they need a lot of disk space. So they need something else. And it turns out the way Netflix works is we buy a whole lot of reservations because as we all know, reservations are cheaper than [… – 23:07], right?
But as you all know, when you buy a reservation, you buy it for certain type [… – 23:13], correct? So if we know some things about [… – 23:19], you need a lot of memory. And you’re in tier-0 [… – 23:23]. Then we can make some guesses for you. We’re going to put you on M4, and we’re going to [in all the – 23:27] regions. [… – 23:29] figure out. We don’t want you to have to figure out what M4 is because it turns out that maybe tomorrow, they come out with M5’s and we get a better deal on M5’s and we [… – 23:35] M5’s. We just want to hide that from you so that you can start to just move fast and we’ll make the decision for you. You’ve kind of seen this already if you download Spinnaker and then we have this screen here that kind of guides you towards the specific type. But as we talk about [… – 23:50] comes up here is we support different [COM – 23:52] providers. And in Netflix we are 100 percent AWS. We love AWS. We use everything in AWS. Is there some service we’re not using? But a big portion of what we use in AWS is [… – 24:02]. We use Titus as well we goes on top of AWS. But right now what we do is we make you pay at the app level whether you’re going to go to Titus or whether you’re going to go to AWS. And there’s probably a world at some point in the future where we don’t need to ask you whether you’re going to Titus or you’re going to an AMI or AWS. We’ll figure it out for you at either deployment time or an application definition time. So we’re going to start to see these things kind of [… – 24:27] does appear that always be there as dials that you can [go into – 24:30]. [… – 24:31]. But given a certain amount of data, we’ll make the decision for you [… – 24:35] 80 percent of [… – 24:37] Netflix […] throw you out of Titus, and we’re going size you appropriately, and we’re going to give you everything you need. And then for the 20 percent that actually need to know these things, they can go in here and [take – 24:45] these things.
Netflix runs in more than one region. [But I can’t say – 24:58] how many regions we run in. Suffice to say it’s more than one [but less than – 25:01] all of them. And we’re a global company. So right now, if you [move – 25:09] deploy [… – 25:10] Netflix, you have to [… – 25:11] regions you go into. But if you’re a tier-1 app or a tier-0 app, you facilitate streaming. You’re going be in all the regions. So why should we make you pick what regions you’re going to? We should just do it for you. So there’s this notion of global. And again, this is where this intelligent decision-making is going, so that you don’t have to worry about the regions. We’ll put you in all the regions that we’re in. [… – 25:34] that we’re doing. And you’ll see this with the pipelines, is that every team can create their own pipeline. But they’re doing this now for over six years. But Spinnaker has been around for over two years now. We have established best practices. We know good patterns, and we know bad patterns. So why not at an abstract level create something kind of like a base class? These aren’t base classes, but the notion of the base class is that people can plug into it. For example, if you are a tier-0 app and you think that x, y, and z is the way you should do deployments, you can define at the top level that everyone who’s a tier-0 app can inherit that behavior.
And then where it gets really interesting, I showed you that ACA task. And that’s where we need to deploy something [… – 26:20] a small amount of traffic to go to. And then [… – 26:23] if it’s a [… – 26:25], if it’s a no-go, we can back it out. Think of Spinnaker now as a paved road. It’s a paved path to production. If you plug into Spinnaker, you get all these things for free. ACA is a primary example of [one – 26:37] integrated, automated [… – 26:39] Spinnaker. Everyone in Netflix has to take advantage of it. But it was opt-in. You had to go in your pipeline and say, “Hey, I want to [… – 26:47] and add ACA, and I wanted to configure it. And the configuration was it’s kind of messy. You have to be an expert in ACA. But again, intelligent [… – 26:56]. If you have a base class, what we can say is everyone in Netflix [did – 27:01] ACA. And here’s your default configuration. So the next time you go to your pipeline, you have ACA. So instead of an opt-in, it’s an opt-out. And we can do this with other [… – 27:12] that standard path, whether it be [… – 27:15] testing, any number of tools that other centralized teams at Netflix are writing. Even Titus, there could be some things in Titus. We’ve got a standard Titus pipeline. People plug in. Everyone gets [… – 27:27]. So we keep innovating. [… – 27:34] Spinnaker. [… – 27:35] in the open-source world.
What I want to leave you with is that join us on Slack, spinnakerteam.slack.com. We’re all there. Netflix is there. Stitch Fix is there. [… – 27:47] today. The Armory people are there. We’re all there, and we’re looking forward to innovating with you. If you have any questions, you can reach me on Twitter [… – 27:57] email. We’ll hang on afterwards. And I do want to thank everyone for coming [… – 28:01]. And I’ll turn it over to [… – 28:05].
Male 1: Any questions?
Andy: Oh, yeah.
Male 7: [… – 00:28:16] I assume that’s something that’s Netflix’s [… – 00:28:19]?
Andy: It is an internal Netflix tool that is not open-source, however, there are alternatives.
Male 8: There’s a probable [… – 00:28:30]. So could you talk a little bit about what is it [… – 00:28:35]?
Andy: Do you want to hear about [… – 00:28:40]?
Male 9: [… – 00:28:41].
Andy: Sweet. Yes. Very cool. Did you want to talk about it?
Male 10: Yeah. So, it looks like system matrix, service matrix, and also if you want it to be more sophisticated, you [… – 00:28:55]. You can look at it [… – 00:28:58].
Male 11: Makes sense. Is that something you guys are open-sourcing or that’s…?
Male 10: We’re thinking about it. We’re very early stage so we are looking at pilots to do and probably open source a part of it and close source another part but we’re open [source]. Thank you.
Andy: Any other questions?
Male 1: Please welcome Diana.
Diana: Thank you, guys. My name is Diana [… – 00:29:42]. I work on the data platform team over at Stitch Fix and today I’m going to be talking to you about how we made Spinnaker work in our infrastructure. So, originally, when I got asked to do this talk, I was like, yaay, I’m going to show my pipelines and show how to use Spinnaker, kind of similar to Andy’s talk, have these slides of the pipelines that we have and how we use it but it’s not [… – 00:30:21] yet. Sometimes things just don’t go according to plan. So, instead, I’m going to talk to you about how to install and set up Spinnaker for you. So if you have different infrastructure [that] Netflix uses, which was the case for us, just a little bit more difficult to set up and I will go into the details here. So, I’m going to start with talking about our infrastructure, how we deploy our apps, then I’m going to go into the really gory details about Spinnaker, and then I’ll finish with configuring authentication which was [… – 00:31:01].
Okay. So our infrastructure, pre-Spinnaker, everything is on AWS, so a 100% same as Netflix. We’re much smaller though and we have three peered VPCs. We have a prod VPC, a test, and then infra, pretty standard. All of your applications that you want to test out will go to test, production apps when they’re ready will go in there, and then infra is a special VPC where you can put stuff that both prod and test need, so things like Jenkins [... – 00:31:39] houses our RPMs [... – 00:31:42] on there, so everything’s on there and then Spinnaker would also go in there because it would need to deploy to both test and prod, and things like [Flotilla] which is our execution service that we’re plugging into Jenkins. Okay, so our deployment pipeline, we stole it from Netflix, immutable server pattern, or immutable infrastructure. Again, the AMI is our immutable server. We package the code into RPMs. We’re on Amazon Linux, not Ubuntu. That’s a bit of a change that made getting Spinnaker working a little bit more difficult for us and I will go into that in detail later. So you bake the AMI from the RPM and then the deployment into ASG, set up a launch config from the AMI, create the autoscaling groups of the [... – 00:32:37] and put the Route53 entry on top of that. So, again, cool that I don’t have to explain the acronyms but do let me know if anybody wants an explanation. Okay, so the process overview. I like to think of it as kind of a definition of an application and a repeatable deployment process. So if you want to create an application, you first create your spec and spec is just a recipe for packaging your code into an RPM. You also create your ELB for your application and you create a nice Route53 [... – 00:33:20] so that it’s not super ugly like the ELB [... – 00:33:25] normally are. And so this is your definition of the application, the ELB and the Route53. We call this scaffolding, the AWS scaffolding for your app. So once you create that, you use this spec to build your RPM. You push it to your RPM [... – 00:33:42]. For us, it’s [... – 00:33:42]. You make the AMI, you launch the ASG, and you attach your ASG to the [... – 00:33:50] and voila, there you go, your application is live. So, step one is building the RPM. We’re not really giving our users a very friendly interface for this. That’s probably going to change soon but we basically just give them a [... – 00:34:11] and we’re like, here you go. We template it a little bit but basically it looks like this. It’s a bit scary for people who’ve never worked with it before. We try to make it a little bit easier by kind of templating. Most of our applications are just [Python Flask apps] and so we kind of make those [intelligent] decisions for people and put it in a template. So you would create your spec file from the template, customize it a little bit, and then we just have a Jenkins Job to build the RPM, and push it to [... – 00:34:44]. So, again, the spec file seems pretty complex and it can get complex sometimes and so it’s scary for the user but it does make deployment really much easier down the line. Okay, so, step two, you bake your AMI. So right now, pre-Spinnaker, we use aminator which is also a Netflix OSS. We use that to create the AMIs and I just created a very simple Jenkins Job. People just select the RPM that they want to bake into an AMI and then they click the button and there we go. So how does the AMI get baked? Aminator in particular will use volumes to bake the AMI. You can also just launch an instance and then install your RPM on the instance and then take a snapshot. That is one of the possibilities that Spinnaker actually gives you in addition to this volume [... – 00:35:41] but if you were going to do it with aminator, you just create the volume from the base AMI, you attach it, you [... – 00:35:48] into it, you install the RPM, and then you take a snapshot, convert it to an AMI and there you go. So, pretty simple. We love this tool. And step three, this is the deploy. Again, you have your RPM. The RPM is baked into an AMI and then you create your launch configuration from that AMI, basically says, okay, launch all of the EC2 machines from this image. That is your immutable server right there. Both of these are used to create the autoscaling group and usually we have at least two EC2 instances [... – 00:36:25]. If one goes down, you still have the other one and you have the ELB on the top. So the ASG is connected to the ELB. The ELB routes traffic to the EC2 instances inside and then you have the Route53 entry, something like spinnaker dot stitchfix dot com and then internet traffic comes in through there. Okay, so, why did we choose Spinnaker? So, at Stitch Fix, we have eighty data scientists who are still growing and we have ten platform engineers. So when I started about a year and a half ago, we had twenty total and I think five engineers of the twenty. And so we grew rapidly and for us, it’s very important to have self-service applications. Otherwise, you just turn into tech support. So, what we did is we looked at Spinnaker and we were like, this is great. It has such a great UI. It can give you so much information. Like Andy presented earlier, we want our data scientists and engineers to be able to look at their applications, see what is deployed, and do it without our help, and so we were like, this is perfect. So we decided to go for it. Okay, so now I’m going to go into setting up Spinnaker. This is going to be pretty technical. If you haven’t actually worked with Spinnaker, you might be confused but I kind of thought I’ll just give this talk and then you guys can have the slides for later. If you wanted to set it up, you can do so.
Okay, so I structured my talk more around the key differences that we have at Stitch Fix, the differences between our setup and the Netflix setup, and what that meant for us getting Spinnaker to work in our infrastructure. So, first, one big difference is we are on Amazon Linux instead of Ubuntu. So that was a big difference. We had to add RPM support to Gradle [… – 00:38:45]. I don’t think they were there at the time. They were almost there but they are there now. So, there’s that. We also use Nginx instead of Apache. We use Redis on AWS. I think this is the same as Netflix but there was like this one little config change that was kind of messing things up and I’ll go through that and then also we don’t really have a Cassandra laying around so we actually don’t use Cassandra and I know that Netflix, it has [… – 00:39:24] but when I tried implementing it at first, it didn’t [at that time].
Okay, so, again, we’re on Amazon Linux. That’s too bad for us but, funny story, we actually picked Amazon Linux because Netflix was on Amazon Linux earlier and we’re like, this is fantastic. We’re just going to use all their tools, right. So we did. We built a bunch of our own tools and we’re like, oh look, Spinnaker, fantastic. This is going to be on Amazon Linux and we’re like, oh, they’re on Ubuntu. They switched. So, yeah. Sucks for us.
Okay, so first thing is we have to add RPM support to Gradle [… – 00:40:00]. So I was not familiar with Gradle so I looked at it and I was like, woah, but actually it looks much more friendly than my spec file. That looks insane. So I was like, this is great. They pretty much had everything ready for RPM support. Just had to put in that [… – 00:40:32] and just put in all of your [requires] in there and basically what I did is I just built every single Spinnaker component because Spinnaker is like ten services. Every Spinnaker component created an RPM and I pushed it to our Artifactory and then this is the main [… – 00:40:57] file for Spinnaker that combines all of those, so it gets pulled in when you build your RPM. And then there’s this line at the bottom [… – 00:41:06]. You need to have that there. I believe now you might not because it got fixed and so now it’s super nice but at the time that I was doing that, it would build the RPM but then it would not install it. So I had to put that in and, yeah, this was super nice. This was pretty easy to get working. It is actually much nicer than, like, our way of doing spec files.
Okay, so, we use different startup systems. So we use System V. I just found out what it’s called but, anyway, to start up a service, you do service Nginx start. You have your signup scripts [… – 00:41:55]. They just look like normal [… – 00:41:57] scripts with a start and a stop, pretty standard, and then you have the [… – 00:42:02] utility for starting [… – 00:42:04] and then you specify what level and what order. So that’s what we used at Stitch Fix. I noticed that Spinnaker uses Upstart which is newer and to start an application with that, [… – 00:42:21] Spinnaker and then you have [these con files]. They’re not actual [… – 00:42:26] scripts. These are con files and they live in [… – 00:42:28] and that’s an example of a con file right there. So it’s kind of speaking in [… – 00:42:35] language and I was like, okay, no problem. Let me see if Amazon Linux has Upstart, and it did, and, you know, that was great and then I found out that it was five years old, and I was like, why? But yeah, it’s super old and I was like, okay, I’m going to try to make it work with that version and if I can’t, I’ll upgrade it and there are these two lines here. I’m sure they’re [needed], the [set group ID] and then the [set user ID] to Spinnaker. I saw that they were added kind of later on after I already started implementing it so I just kind of thought, okay, well, let me try removing them because the old version doesn’t support that and I did and nothing bad happened, I think. So you can try doing that.
Okay, so, again, we really love Nginx and so we use that everywhere and I was like, okay, well, let’s just keep using that for Spinnaker. So, this is what the Nginx config looks like, pretty standard. It’s very similar to the Apache config and we have the [… – 00:43:55] up there, [… – 00:43:56]. That points to the [… – 00:43:59] and we just have [… – 00:44:01] on the EC2 instance. You have Nginx that is sitting on [… – 00:44:09] there, and then this part got a little bit tricky, the name spacing, because Gate, which is the [API] portion of Spinnaker, that one needs its own name space there and so does Rosco because those two get called from the UI, I believe, and so this is how the name spacing works and all of the Spinnaker services are jammed onto one EC2 machine right now and so that’s part of the reason why you need this name spacing to see they’re running on different ports and so if you’re using Nginx and you want to have Spinnaker running on just one machine, this is what you would do.
Okay. So, for Redis, so at first when I was testing, I just had Redis just on the EC2 instance and I was like redeploying Spinnaker and I was losing everything [… – 00:45:15] I’m going to put it on AWS Elasticache which was the plan all along but then I found out that Spinnaker was just spewing all of these exceptions. I was like, what’s happening?
Okay. So, AWS Elasticache, Redis, it will not let you issue config commands which is apparently what Spinnaker was trying to do. I think it’s something that got added again. It wasn’t there immediately when I started implementing. So, you have to have Redis [… – 00:45:54] 2.8.0 but also because Spinnaker keeps trying to issue these config commands, you have to calm it down and tell it to not do that because the Redis that it’s using is secured. So this little portion in red configuration [… – 00:46:12] that made it stop but then that means that you have to configure it yourself. So I went onto AWS console and I had to create a whole new parameter group and then in there I put [… – 00:46:25] had a lot of help from the [slot channel] there. I put that there and everything worked after that. So, yeah.
Okay, so, again, this might not be as relevant now because there is [… – 00:46:45] support which is something that we would have used if I did this…started implementing later but, at the time, there was no [… – 00:46:53] support and we didn’t have a Cassandra so I thought I’m going to [… – 00:46:57] it together real quick.
So, at first, I was like, okay, I’m going to have a cluster and I’ve never worked with Cassandra so I was like, okay, this might take some time, but instead, I was like, I’m going to set it up really quickly. I’m just going to have a single node for now and I just had a single Cassandra that was backed by EBS. So all of the data was being written onto the EBS. And [what] I was worried about is that, you know, if something happens to your EC2, if it dies, the EBS just detaches and you don’t lose your data. You do have down time though so that might be a problem but not really a problem for us right now because we’re pretty small.
And then this is the overview [… – 00:47:50] whole party over there on the EC2 instance. Everybody’s on there [… – 00:47:53], everyone’s in there. And again, we have the [… – 00:48:03] at the top, [… – 00:48:06] down into that ASG. Right now, it’s just one instance, one EC2 instance. Of course, later, I would scale it out. I would put all these services on their own machines and then they would probably have at least two EC2s inside each ASG but for now, to get it working, this is what it looks like and they’re all running on different ports there and then we have the Cassandra and Redis that they are accessing outside.
Okay. Yeah. This is the hardest part, authentication on Spinnaker. So, we didn’t really put authentication on Spinnaker to secure it per se. We wanted to have more, how do you call it, responsible usage. That would mean, I know your name and your name appears everywhere and so if you delete someone’s application, I know it’s you. It’s something that we’ve been really wanting, not to point the blame finger or anything but you do something in Jenkins, no one knows. So yeah.
Okay, so, first, we had to enable SSL and then get to the auth parts. So for authentication, we decided to use Google OAuth and client certificate, what’s it called, client certification. I know that sounds wrong. The [X509] certificates that you authenticate the clients with and that is for scripts that are trying to…our scripts were like automating the creation of pipelines and so is that something we would want to do and all the scripts need to get authenticated if you have auth turned on.
So there are a couple of dilemmas that I ran into. So first is where do I terminate SSL? I have so many choices, right? I have the [… – 00:50:17] which is what we do for most of our applications because they’re in the VPC and we have this legit [… – 00:50:27] that we put on the [… – 00:50:28]. So I can terminate it there, possibly, or I could just terminate it at Nginx level or I could terminate it down to the actual application. And so, I was a little bit confused at first and I was terminating at random places and things were not working but the thing that I learned was, well, the choice that I had to make more was that if you’re using a legit cert, I would terminate everything at the ELB level except for Gate. Then I would [… – 00:51:06] SSL because Gate is responsible for authenticating people and so I was thinking whether to use a real cert or a self signed cert and I ended up using self signed certs and there’s a really great tutorial on Spinnaker website, how to get that going, but then of course you get some problems with websites saying that, oh, your cert is not real and do you want to proceed? So, there’s things like that there but I got through it.
So I decided I’m going to use self signed certs. I’m going to put the self signed certs into my Nginx config because it needs them now. So you can see there in the red now Nginx is listening [on 443], SSL is turned on, and yeah I added a password profile. It felt a little funky but that’s what the Nginx documentation said to do so I was like, alright, that’s okay, so the password for the certs, because without the password, it doesn’t start up automatically which kind of breaks everything because that means it doesn’t start up on [bootup] and I need it to start on [bootup]. And so you can see everything is quite different here. The Rosco name spacing is still the same. The server is listening here on [443 instead of 80]. I have an optional little redirect [80 to 443] over there and then I have a [… – 00:52:41] there. Before, I was just checking on  and it was just getting some static page and it was two hundred. That’s fine but now I have it on five thousand with a beautiful message. Yeah. So, notice how the gate rewrite is gone. That has to do with OAuth redirects and the actual client authentication that [… – 00:53:10].
So, again, for Gate, yeah, there’s a lot of trial and error here but for Gate, I reconfigured the [… – 00:53:21] now is just passing it straight through. It’s not even going through Nginx. You can go through Nginx. There’s a special thing where it tells Nginx not to decrypt and just to pass through directly but I found out later. You might want to do that but, for me, I just said, okay, [… – 00:53:39] go in TCP8084. 8084 is the port that Gate is on so it’s just going directly to Gate and Gate was terminating and that was working. And another thing that I forgot to mention, for the Nginx, if Nginx is terminating [… – 00:53:58] for you, make sure that the [… – 00:54:01] is TCP443 to TCP443 because I [… – 00:54:05] at first and it was [… – 00:54:09] so then there was unencrypted traffic to Nginx, which didn’t work.
Yeah, so, self signed certs, spend some time on this. You’re going to be working a lot with the Java TrustStores because, in Spinnaker, Gate is really the service that talks to everything and so Gate has its own little portion in its config where it says, here’s my TrustStore and I trust non-legitimate certs that were made with this certificate authority and so if you put that there, then…we’ll backtrack a little bit. The certificate authority I created myself and then all of the certs that I was creating, I was creating without certificate authority. So in certs for Gate and the certs for my scripts that were calling Gate, that certificate authority is the same and as long as that certificate authority is inside the Java TrustStore, everything is good but things don’t work out like that always.
So, again, this is the server SSL portion down there. This is for Gate. This is how you configure it, the key store and the trust store. For us, it’s the same thing. I was still getting some exceptions and I was away on vacation recently and I kind of forgot a lot of things about Spinnaker and I came back and I was like, why was this not working? I don’t remember, but a useful hack if things are still not working for you, it’s still saying that the certificate authority is not valid, your Tomcats are not trusting it, then you can just go in and put in your CA into the default Java CA file but don’t overwrite it because that’s what we did first and then every single site that Gate went to, including AWS, was not legit anymore. So yeah, don’t do that. Just add it in there. Okay, so the [next] dilemma is Google [auth]. The redirects will trample all over your carefully setup Nginx rewrites. So, to deal with that, I removed Namespacing for Gate and I bypassed Nginx. That’s why I’m not going through Nginx anymore. Earlier, remember I had the [... – 00:56:55] going directly through to 8084 directly through to Gate and the reason that was not working is because the redirect [... – 00:57:05], it had slash gate slash login in the redirect and the redirect only looks up the host and so the [slash gate] part was being dropped and it kept going to just Spinnaker slash login and so I just solved it by taking Gate out of Nginx and so there is no extra path there for it to lose when it does the redirect. Yeah, so the last problem I had was I was like, great, everything is working and Google [OAuth] is working and now all I need to do is to have my script pass in the certs when it talks to Gate and I was like, cool, this is going to work, but, no, Tomcat completely ignored them. So it was like, I don’t know what you’re talking about. So, there is another setting here. This is actually…I don’t think this is in any documentation. I Googled for it for a while but adding that [... – 00:58:12] will make Tomcat actually look at the cert because, before, it was like, I don’t see a cert. I had to turn on all kinds of [... – 00:58:22] to see that and it was like I don’t see a cert even though the cert was being passed in. So, yeah, we have to add that in and then Tomcat is like okay cool and then it passes it through and everything gets authenticated. The authentication system that’s going to [... – 00:58:44] is layered so when you access Spinnaker, it will look like do you have a client cert that I can look at and if you don’t, then it will send you to Google [OAuth] if you’re using that. So, yeah, after I added that, everything was quite peachy and my scripts can actually create pipelines now and everything is nicely automated. Okay, so, the takeaways and what we learned. So Spinnaker is actually pretty complex. It’s a system of many components and if there are differences that you have in your infrastructure then they will take a little bit of time to iron out. So there’s something just to think about it. It’s not like you’re going to get it running like overnight if you have a different infrastructure. Yeah, and then I also learned a lot about SSL and OAuth and Client Authentication and, yeah, I never [... – 00:59:52] that stuff before really but it was pretty fun just to get it working. So I had a lot of help from this [... – 01:00:01] Spinnaker people. Yeah. That’s all. We’re super excited at Stitch Fix to finally get Spinnaker into [prod]. That’s going to be very soon. It’s pretty much ready [... – 01:00:17] put on hold. Yeah, and then I’m in Spinnaker Slack and if you can spell my last name, it’s pretty hard, but anyway, that’s me. So, feel free to reach out. Yeah. That’s all. Thank you. Oh yeah. Oh yeah. Questions? Yes.
Female 1: Now that you have Spinnaker running in production, what is the connection to the business value of [… – 01:00:51] enabling your data scientists and platform engineers to do? I’m just trying to connect the dots.
Diana: Yeah, so, the question was what is the business value now that we have [it in production] for us? And so for us, that would be [greater] visibility into your applications. Right now, we have a homemade thing that’s just [… – 01:01:16] because we suck at building UI. So [… – 01:01:20] tools and it’s very hard to really see what’s going on, like, it’ll say, yeah, your application is deployed and we just have people go to the AWS console which doesn’t actually group things by application like Spinnaker does. So on AWS to look at your application, you have to go to like [… – 01:01:39] go to like instances. You have to go to all these different things and they’re not [… – 01:01:46] and it’s pretty difficult to do to really get a picture of what your app is doing. So, does that answer your question?
Female 1: Yeah. [… – 01:01:55].
Diana: Okay. Any other questions? Yes.
Male 12: I was just curious about the password [… – 01:02:00] understand what that password was.
Diana: You mean [poop]?
Male 12: [… – 01:02:09].
Diana: Oh. For the certs.
Male 12: Spinnaker [… – 01:02:17].
Diana: Yeah, yeah. It’s just a file with one word which is a password for the certs.
Male 12: Okay. [… – 01:02:25].
Diana: Yeah. Any other question? Okay, cool. Thanks.
Male 1: Thank you very much. And then continuing on from that question, one of the things that I’ve done in a previous life is change the SSL certificate of the [… – 01:02:52] remove the password so that [… – 01:02:56] Nginx will start up without the password [… – 01:02:59] if you have a pass phrase. So that sort of thing. Right. So, next up we have Andrew from Spinnaker telling us about how [… – 01:03:11] working with Titus. So please welcome Andrew.
Andrew: Hey, guys. My name’s Andrew. I’m an engineer on the Titus team at Netflix and I’m going to talk to you about how we use Spinnaker with the Netflix container cloud. So, as you might have inferred from Andy’s talk earlier, Netflix has a pretty robust cloud native, immutable microservice architecture that is built on top of AWS VMs and being a early AWS adopter, we’ve been investing in this for quite a while. What you see here is actually a picture from yesterday of the Netflix microservice in action [… – 01:04:05] one of our open source pools called Visceral that gives you kind of a visual snapshot of the microservice architecture [… – 01:04:13] there’s a lot going on. It’s complex system but we have [… – 01:04:18] VMs. So why are we investing in containers? There’s kind of three key drivers for us. One is that it provides simpler packaging for both microservices and batch jobs. One of the others is it provides a very scalable, easy-to-use interface to a giant cloud resource [… – 01:04:43] resource dimensions of my application and go, and the other is it provides a consistent local developer experience, what I build and run locally is the same thing that’s running in the cloud and an interesting one thing that you’ll kind of see here [that you don’t see here] that’s in most container [talks] is about consolidation and cost savings and I think Andy had alluded to earlier, the key value for containers for us is moving fast. So it helps us move fast and that’s why it was worth investing in.
So, in the VM world today, you’re probably going to try and deploy a service. I’m going to pick some base AMI and Netflix has predominantly been a Java [... – 01:05:30] so that base AMI is very Java-centric. Supposing I want to make modifications to this thing, I’m making changes at the OS level and the OS customisation and things like that, I’m then basically rebaking an AMI and then if I want to go out and deploy this thing, everything I do from deployment to scaling and management [... – 01:05: 30] via [AWS VM instance starts]. In the container world, [... – 01:05:58] faster and a little more flexible. What I start with is just a basic docker or container base image and I can start really with whatever language I would like to have. So it works really well in Polyglot scenarios at Netflix as we’ve grown, as we’ve taken on additional use cases, such as focusing on originals and studio production and those kinds of things, and as we continue investing in batch processing, we’re actually not really just a [... – 01:06:27] anymore. There are actually quite a few other languages. When I want to make changes to this image, I’m really just copying a few things into my image that I need and when I wasn’t to run this, I’m just saying what are the dimensions of my application. I don’t necessarily care about what the latest and greatest [EC2 type instance is] and all that kind of stuff, and then when I want to scale and manage this thing, it really just happens at the speed of process start. So it’s actually looking at a lot of others who are coming into this space now why am I going right into a container kind of environment because it makes sense. But Netflix has this large existing ecosystem with VMs. So how do we make this transition from using VMs to using containers? It’s really not easy. The microservice is built on top of Netflix, are already built for the Netflix cloud platform. A lot of the other open source tools such as [... – 01:07:34] are all part of that existing Netflix cloud infrastructure. Additionally, the microservices are often pretty much built to be on top of AWS. So, core AWS concepts like [... – 01:07:52] security groups [... – 01:07:54] all these kinds of things are baked into the microservices themselves and having invested in this architecture for years is highly scalable, highly reliable, has a lot of functionality. So, for a container cloud, it makes it difficult to just take kind of an off the shelf solution and plug it into the specific Netflix use case. So things like Kubernetes [... – 01:08:19] are rightfully so a little more geared toward building the whole platform for you. [You’re agreeing to fill the] application. You’re coming [on to] use containers. You’re going to need to have a way to solve networking and service discovery and all of those kinds of things whereas, for us, we have an infrastructure already. What we’re looking for is really just the run time. So, what we did was we built a Netflix specific container cloud called Titus. It is actually built on top of Apache Mesos which acts as our cluster manager but also gives us the ability to control scheduling and container execution how we see fit. So, a couple of the areas we’ve invested in for scheduling are job management that allows us to co-locate service jobs, microservices [... – 01:09:12] batch processing jobs. Traditionally, those things are largely segregated. Additionally, the scheduling is built around being elastic, knowing it’s on [an elastic resource pool like] AWS. What that allows us to do is to know when we should autoscale, when we should scale back down and helps with scheduling decisions. And then on the container execution side, there’s a large amount of investment in how we get the container to integrate with AWS and also how we get the containers to integrate with the rest of the Netflix ecosystem so that anybody running on this gets a similar experience to what they’re used to when they’re running in a VM. So here’s a quick snapshot of where Titus is at today from a load perspective. What we’re showing on the top here is the Netflix [test] account. On the right side is number of containers that ran last week in our VPC and on the left side is the number of containers that ran last week in our classic. If you can’t tell from this, we’re in the midst of a migration from AWS classic to a VPC and these numbers probably roughly correlate to the percentages [complete in the migration – ... 01:10:32]. So we run a little bit over a hundred thousand containers last week. At peak, we’re at about six hundred total AWS instances to run these containers. They are largely r3.8xls which are pretty big [beefy boxes]. I think they’re [... – 01:10:52] of memory and the workload that comes in is generally [... – 01:10:59]. As you can kind of see from the workload graph next to the numbers, we’re generally running at around [six hundred-ish] containers coming in at any one point in time [recursing] up to around five thousand at peak. And what we see here is a mix of batch and microservices and the batch services were larger, [the ones that come] onto the container platform first [... – 01:11:31] strategically onboarding microservices over time. So here is kind of a overview of the Titus architecture. It follows a pretty standard replicated master model with a bunch of worker agents that have been running containers. So, the Titus [master] is kind of broken up into two key pieces from a scheduling standpoint. One is job managers which are responsible for the life cycle of a particular type of job. So for a microservice job and for a batch job, those two things [... – 01:12:15] different kind of life cycles and how they’re managed over time and [... – 01:12:20] job managers, one for each type of job allows us to make the [scheduler] able to work with each of them. And then we have a common resource management layer called Fenzo which is able to take all of the jobs that need to be scheduled, basically to look at the available resources out in the cluster and do a fit to place those things. Once we schedule a job to be run on a worker, it’s going to get passed down to what’s called the Titus Agent and Titus Agent is the VM that’s actually running containers for you. So it’s made up of a few key pieces. There’s some off the shelf bits, docker for container execution. We’re using zfs which is providing us with local storage [... – 01:13:07] and reservation and then the Mesos agent which is what’s doing the host resource management. The Titus executor is the kind of key piece that implements the logic to integrate the container run time with AWS and with the rest of the Netflix ecosystem. And it leverages a whole bunch of different other services running on the [... – 01:13:35] log management, setup networking, and provides some of the other things and we’ll go a little deeper into those later. So that’s great. We have a Netflix and AWS integrated container run time and we have a robust well-vetted, scalable VM run time, so that’s good, right, but it’s actually kind of difficult for users when they come and they say, I want to deploy my application, what’s the right fit for them? Do I use containers? Do I use VMs? [When I] pick one or the other, they have different ways to interact with them to manage them. So it’s actually kind of challenge when you’re living in a world where you have both a live VM run time and a live container run time. And so that’s kind of where Spinnaker comes in for us is it really acts as a common control plane across all of these run times. So, I think Andy earlier had referred to it as a pane of glass. It’s the Netflix kind of single pane of glass where you generally go to manage your applications. So, for a container run time, you kind of need to be able to fit into that world. So, we’ll talk a little bit about how we’ve worked to get Titus to integrate with Spinnaker. So some of the things that it provides are it gives users the same common interface they all know and love. When I go and configure my application, when I go and manage it, it’s all the same. So, they have a common application configuration. I think some of those little screenshots [... – 01:15:26] earlier configuring application, the same deployment strategies, [... – 01:15:31], all those kinds of things exist in now both the VM world and the Titus world. The ability to understand the health of your services is made common because they’re integrated with Spinnaker. So I can see if my application is healthy whether it’s running in the VM or whether it’s running in a container [... – 01:15:56] provide a common way to configure security for these things. So in addition to being useful for the users as a way to sort of understand in one way how to deploy and manage your applications, it actually also serves as a great way to have best practices on the new container platform for applications that were getting deployed through Spinnaker. They’re getting all of the best practices and [good] defaults that Andy was describing earlier. So, we’ll go through a quick example of setting up a Titus container running through Spinnaker. So Titus is implemented in Spinnaker as a cloud driver. So it was pointed [out] earlier that Spinnaker has a pluggable model [... – 01:16:44], so it also works with Titus. So when you come and you want to deploy something internally at Netflix, you can choose do I want run it directly on EC2 or do I want to run it on Titus, which is a little transitive because it’s also running on EC2. So I can [select] my cloud provider and now I’m going to want to go and configure my application. So I’m going to say, hey, my application’s image might be sitting in this [registry]. So the example we’re going to run through here is with an application called OSSTracker. It’s actually a real Netflix app that we use internally to track the health of our open source projects. So it runs and it looks at the number of [PRs] to apply [... – 01:17:40] how many open [... – 01:17:42] issues there are to it, when was the last time somebody committed to it, things like that to be able to score and give you a health of your open source community. So this thing runs on Titus and we’re going to run through an example of deploying it here. So my image is running…is sitting in a [... – 01:18:00] registry. My organization is called Netflix OSS and I’m going to give it my image name and I’m even going to even configure a trigger which is going to tell Spinnaker next time I push a new version of this image, please kick off a deployment pipeline for me.
So, when I configure this deployment pipeline, I’m going to specify where I want it to run. In this case, it’s going to run in Titus in our test VPC [… – 01:18:32] and a lot like I would with a VM, I’m going to configure [… – 01:18:39] some details to how to run and then I’m going to give it a deployment strategy. Here I’m selecting do a [… – 01:18:45] deployment for me and rather than now going and selecting an AMI and setting up a bake and picking a instance type to run, I’m just going to say what resources my application needs and that it’s going to need two CPUs, four gig of memory, four gig of disc, and allocate it a unique IP address. I’m going to say run three of them and then down at the bottom I can do some more advanced configuration in setting up IAM roles, security groups, that kind of stuff.
So, Spinnaker will go and kick off my pipeline, deploy my application, and now it’s running and, here, I can actually see a view where Spinnaker is showing me multiple server groups, each of [... – 01:19:42] with multiple instances of my application running and I can actually see each [... – 01:19:48] is green. It’s telling me it’s healthy and that’s because Spinnaker is integrated with the Netflix health check infrastructure and service discovery infrastructure. So it has a way of already knowing [... – 01:19:58] my application’s healthy even though it’s running in a container and then it’s exposing some amount of information [... – 01:20:08] a task. So we can see here there’s a Titus task ID which you can kind of think of as akin to your AWS [... – 01:20:17] instance ID. It’s unique [in] describing your application running and here we’re showing the host IP address. We’re showing the actual IP of the VM your container is running on and the container IP is the actual unique routable IP address that has been assigned to your container. And then, I don’t know, for anyone who’s already familiar with it, Netflix has a feature called Chaos Monkey which is basically a way to in production test the reliability of your application. So, you can basically set Chaos Monkey to come in and to just terminate instances of your application running in [... – 01:21:03] running in prod, and basically through real exercise make sure you’re available. So even within Titus, you can go to Spinnaker, you can configure for your application how I want Chaos Monkey to be working with my application. So, for a lot of users, it really provides a very natural way to go from this same workflow in a VM to a workflow in a container. So, there’s a bunch of work that had to get done on the Titus side to make it a more natural integration with Spinnaker. One of the big ones was how Titus organizes its own job and task constructs. So Spinnaker has this notion of server groups and instances and in the EC2 world, this would map to an ASG and an instance within your ASG. In Titus, that maps to a job and a task. So a job contains the definition of what your container should be doing and then a task is a running instance of that job. So, in addition, we support the notion of stacks and details and versions for your server group that are able to naturally fit into the model that Spinnaker’s going to want to display. After I do a [... – 01:22:31], I’m going to have a new Titus job. That new Titus job is going to be in the same stack as my previous and it’s going to have a new incremented version. Additionally, we added support for labeling and [... – 01:22:47] of jobs where you can attach arbitrary key [... – 01:22:50] to your job and part of the driver for this is from that big picture Titus overview earlier. [... – 01:23:00] so Spinnaker isn’t the only entry point into Titus. So [... – 01:23:06] running in Titus that Spinnaker is not directly managing. So with this, Spinnaker can basically find out of all the services running, which are the ones that are controlled by me and then which are the ones that I should be displaying and managing and showing to users. And in a way, there is a common job scaling API. So when somebody comes from EC2 and they’re used to ASG [... – 01:23:30] configure my application that way, Titus allows each job to be controlled in a similar fashion so I can set my new [... – 01:23:42] for those [as well]. So, I mentioned earlier the microservices that are doing your running on Titus are built with the notion of integrating with the Netflix platform. Additionally, Spinnaker integrates with the Netflix platform. So there is a bunch of work that we had to do on the Titus side to make this a smooth integration. One of the key ones is when a container runs through Titus, we’re actually setting up an environmental context for that container that looks a lot like what we have in the VM world. So we’re going to be setting information like what your instance ID is, we’re going to be setting up what region and environment you’re running in, those kinds of things, and all the applications that are running expect to see those things and having those there basically [makes for most applications a] very seamless transition to being able to run and work in a container. Additionally, we had to make changes on some of the platform services to make them what I could call container Titus aware. So, for example, we’re running an application in a container and it has its own IP address. It’s not the IP address of the instance it’s running on. So when our [service discovery client] is going to go and register and say, hey, I have [... – 01:25:10] IP address for this microservice, it needs to be able to know I’m registering the IP address for a container with the service discovery and the service discovery is to be able to then [... – 01:25:23] that IP address rather than the IP address of the instance you’re on. Similarly, our [health care] infrastructure had been very easy to build. So there was assumptions baked in that somewhere I could resolve your IP address to an [... – 01:25:40] ID somewhere. So things were changed there to make that also container aware. We have a system called [... – 01:25:52] which basically provides a kind of a [... – 01:25:56] cloud infrastructure, the inventory of what you have currently running and deployed and that is similarly made Titus aware so that when I go to my Spinnaker control page and I can see my application, it’s going to give me links to my [... – 01:26:14] information. So I can just have that right there [... – 01:26:16] still showing me what’s container level information. And another big one was Atlas which is our open source [... – 01:26:28] system. It is providing both application level insights as well as insights onto the machine that you’re running on. So in a VM, that means you’re getting information about your application and getting information about your VM. When you’re running through Titus, you’re getting information about your application but you’re actually getting information about the container that you’re running in, not the whole host itself. So if I’m running on [... – 01:27:03] microservice and I want to see [... – 01:27:08], I don’t really want to see [... – 01:27:10] that’s just [pinned] on CPU. So, [I’ll show here with] that same Netflix OSSTracker, what that looks like. So, the application on the far side here is being selected as OSSTracker and then I’m actually able to come down and see basically per container information. So I’m able to look at processing time for just that application’s container. If you recall from the Spinnaker configuration [... – 01:27:46] we were deploying three of these things. So what you see here is three Titus containers all running, [... – 01:27:54] this is all relative to that container. We’re using Intel’s [... – 01:28:00] system to do this and so now we’re pretty close to having parity on system [metrics] for both the application and the container. So, just basically running a container on EC2 is very much [not] the same as running directly on EC2. So there was a whole bunch of things that were done on our end to get this integration to be a bit more seamless. I mentioned earlier applications running on Netflix are generally pretty integrated with AWS. So, that [level of support] has to be there for a large [... – 01:28:47] of applications. So, we’re actually able to basically take your container and give its own VPC IP address, do network isolation for bandwidth around that, configure common AWS concepts like security groups and IAM roles [... – 01:29:07] task or container level, and then similarly take the AWS [metadata] information and then proxy that to make that container specific. We also have storage isolation. I mentioned earlier we’re using zfs that provides capacity quotas and reservations, and we have integration on for volumes to allow…you can attach zfs volumes to your container and mount in appropriate [... – 01:29:43] drivers and things like that when you’re [running – ... 01:29:45] jobs. So, I’m going to dig a little bit deeper into kind of some of the under the covers of how Titus is doing some of this. So, we’ll take a quick step into what is getting done to set up the VPC networking. So in this example here, we have an EC2 VM and it’s running three Titus tasks. Four Titus tasks, sorry. So, take a look at Titus task three. It’s running in this container which is [... – 01:30:21] and it’s requesting to be using security group Y and security group Z. So how we get this done is we have what’s called a pod root and it’s kind of akin to Kubernetes’ [... – 01:30:35] if anyone’s familiar with that and basically what it’s doing is it’s setting up a network environment for you and then we’re sharing network names basically with your application in this pod root container. So, it’s basically setting up when you’re requesting a routable IP address. We’re actually attaching an ENI to that AWS instance and then we’re setting up routing rules which are basically routing all of your containers’ traffic through that ENI. We’re then applying the security groups that you need to that ENI so that all traffic coming in and out [... – 01:31:10] security group’s policies apply to it. We actually get to leverage a feature of the ENIs for scenarios like considering task 1 and task 2 [... – 01:31:25] both applications wanting to use an IP address and wanting to sit behind security group X. So, because both of these applications are trying to communicate through the same security group, we can actually create one ENI for both of these tasks then actually request secondary IP addresses off of that ENI and then basically give each of them their own unique IP address without having to attach two ENIs. Similarly, when you have an application say a batch job is going to be loading data from [... – 01:31:58], it might not need a routable IP address but I still want to make sure it’s not going out and just, you know, reading the entire Netflix upcoming movie catalogue or something like that, right? So, we still want to have security groups around this. So we actually still will create an ENI for those tasks and apply security groups to them but then we’ll basically set up routing rules that the IP address for that ENI is not routable to that container. Similarly, if you look on the far side here, we [... – 01:32:35] what we call the Titus EC2 [metadata] proxy. It’s basically a proxy service that sits on the VM and what it’s doing is it’s intercepting all of the requests you’re normally making to the AWS [metadata service] and it’s actually taking that and then turning that information into per container or per Titus task information. So when you ask about your identity, you’re getting back instance identity that tells you your Titus task. When I ask about, hey, I want to use this particular IAM role without granting or getting access to the host IAM role or whatever IAM roles might be attached to the host at the time, the [metadata] proxy’s is actually able to [... – 01:33:18] into the specific [metadata…cert, sorry, IAM role] that you need and then give you back just the keys for your application. So, in addition to doing a bunch of work to make the integration with Spinnaker better, Titus actually gets to leverage Spinnaker a little bit. So, one of the things we’ve used it for is actually solving a challenge of how can we safely decommission old Titus hosts? So, hopefully this shouldn’t be a surprise to you by now. We actually use Spinnaker to deploy and manage Titus and here we’re showing two [server] groups of Titus agents which are the VMs that are running the containers with two different [... – 01:34:11] versions. So we have this server [... – 01:34:14] has a really great feature or fixes a bug in [... – 01:34:27] latest and greatest. Well, we had a bunch of microservices that were running and they were running on [... – 01:34:35] and they’re still there. They run forever. So, how are we going to be able to eventually decommission this [old] thing? We could basically just wait until maybe they all hit some sort of [... – 01:34:48] but actually what we get to do is leverage Spinnaker for some of this. So, originally, Titus had a kind of one size fits all approach to how it’s going to solve this. Well, we had services running on these old decommissioned clusters. Let’s just kind of kill them and let them [restart in] a new cluster and we don’t really have a lot of insight into what this application is, how it’s configured. We don’t really have any insight into did this restart actually work? The container is running. You don’t know if your application is healthy, is it taking traffic, is it passing health checks, all that kind of stuff. So, actually, we went with an approach that was, well, Spinnaker actually is the one who knows all of this. It knows about the applications, it knows about how they should be deployed, and it knows about how they’re healthy and when they’re healthy. So, actually what happens is now when we’re running a migration and we want to decommission an old cluster, we get to actually say, hey Spinnaker, we have applications X, Y, and Z that are on this old cluster. Can you please execute a migration pipeline to move these things onto our new Titus agents? So this gives users a way to say, okay, rather than just arbitrarily killing my tasks however you wanted to, I can configure a pipeline that makes sense for my application. I get to reuse a whole bunch of the existing built-in Spinnaker pipelines that everyone’s used to and understands and then when we actually start to do this migration, we can actually see the health of the applications as we’re doing the deploy and say, okay, you know, [... – 01:36:38]. We can maybe stop the migration and go back and fix that. So, we’re still in early days of where we are with our Spinnaker integration. I think Andy talked a lot about their roadmap and where they’re going to go and [... – 01:36:58] their own roadmap but there’s a lot of that roadmap that is kind of coupled with Spinnaker and making sure that integration continues to work and that Spinnaker still becomes kind of the main driving force for people being able to use Titus. So, a couple of the things that we’re working on that are impactful for Spinnaker are we’re doing a large kind of API rewrite, basically bringing our own API up to a more maintainable and [forceable API]. We’re also moving towards implementing a couple of key features on our end, things like capacity guarantees so that when an application, say, a microservice is getting configured [... – 01:37:38] be able to say I just need to be able to scale to this many instances at peak so [we can] actually guarantee that capacity for them. And similarly, when, today, [... – 01:37:50] Titus has a way that I can [... – 01:37:55] and scale my instance. It’s all [... – 01:37:58]. So we’re building in [all of those] scaling capabilities that we’re hoping you’re going to get to be able to leverage and look a lot like how applications are autoscaling today in VMs with Spinnaker. Another use case beyond just microservices and batch processing that we have growing at Netflix is stream processing. So, a lot of these jobs are coming on, they can run in containers but they have a kind of a different organization to them. They tend to be [... – 01:38:29]. They tend to have a worker master model, things like that, so it changes a bit how Titus is [... – 01:38:36] those jobs and if we want them to be managed well through Spinnaker, that’s going to have some integration and [touch points] there. And then kind of one final but important piece is Spinnaker is a great platform [... – 01:38:53] Netflix for running microservices but we still have a lot of batch frameworks that are submitting batch jobs at Titus or [... – 01:39:02] other things but they’re all kind of splintered and fragmented and there’s no kind of common control plane for where to run these things and they all kind of have to reinvent the wheel of are things healthy, how do I [deal with] what’s running, all the things that kind of Spinnaker has logically already solved, so, looking at how we can have Spinnaker help out in areas with batch processing. So, with that, I’m happy to take questions. And there’s [... – 01:39:40] if you want.
Male 13: When are you going to open source Titus?
Male 14: There’s [… – 01:39:47] if you want.
Andrew: Yeah. We’re focused on our Spinnaker integration right now.
Male 15: So what type of [… – 01:39:58]?
Andrew: Right now, it’s Docker.
Male 15: Docker.
Male 15: [… – 01:40:07].
Andrew: Yeah. We have other frameworks that are Mesos-based and [… – 01:40:12].
[Male 15]: And how do you handle the security and [development issues with] Titus?
Andrew: So I talked about it a little bit. We have basically mechanisms for giving VPC IP addresses to containers and then being able to leverage back to [… – 01:40:36] security groups and IAM roles, so basically take the AWS security [… – 01:40:40] and then apply that at a per container level.
Male 16: So, on the [… – 01:40:48] question, are you [… – 01:40:50]?
Andrew: It’s not intercepting all calls. It’s intercepting all calls that would go to the [metadata] proxy.
Male 16: Okay.
Andrew: And then for those, it’s looking at the ones that are requesting security credentials and then looking at basically [… – 01:41:16] information about what IAM roles this job should be accessing and then [… – 01:41:24] into that and then giving your container [back] credentials just for that job.
Male 16: Okay.
Andrew: But just for that IAM role rather.
Male 16: Also, I had a question [… – 01:41:36]. I was wondering why you found it necessary to use zfs address for [… – 01:41:39] and which brand of [… – 01:41:42] are you using?
Andrew: So it’s running today on Ubuntu [… – 01:41:49] and we’re in a migration to [… – 01:41:51] where it’ll become part of the platform.
Male 16: Okay. [… – 01:41:56] zfs.
Andrew: Yeah. So it’s [… – 01:41:58] zfs on Linux and basically it provides a pretty simple way to set up both reservations and quotas so that we can actually say for this application you’re going to get a bounded amount of, say, a hundred gigabytes of disc storage, you’re going to get five hundred, and then we have a fairly simple way to kind of [set it up and enforce it].
Male 16: [… – 01:42:20].
Andrew: Yeah. Those are options we could have gone with too. I mean, it’s not the only one that does [quoting] so…
Male 17: A couple of questions. One is [… 01:42:37 – Fenzo] and then Mesos. What is Fenzo really doing [… – 01:42:44]?
Andrew: So Mesos is for us like a cluster manager. It’s basically the one that knows what all the resources in the cluster are and is presenting them to the Mesos framework which is Titus [… – 01:42:59] resources are which then presents the offers. So what [Fenzo] is doing is it’s basically the piece that you can kind of think [… – 01:43:06] or resource placement where we have job managers that are deciding what should be running [and then] submitting these things to [Fenzo] to say I want to run this. [Fenzo] is then taking the information from Mesos that says here are all the available things in the cluster and basically matching them. So you can think of the input basically being offers comma jobs to run and the output being a matching set that says this portion here running X, Y, Z, this should be running A, B, and C, and then we can pass that information down to Mesos to have it actually execute on that and pass that information down to the Titus agents.
Male 18: [… – 01:43:48] that will also make the callbacks to us to do the autoscanning so it also helps with the elasticity of the underlying resource pool as well.
Andrew: Yeah. It has built-in concepts that are [… – 01:44:01] that kind of thing. So if it knows I can autoscale and here’s the list of jobs coming in, I should [… – 01:44:10] make that decision. It can say [I’ve been] idle and I see a whole bunch of idle hosts. I should scale down, things like that.
Male 19: [… – 01:44:18] you guys are using the service discovery pattern, is there any future plan to support like a [… – 01:44:26]?
Andrew: Well, you know, the [… – 01:44:37] AWS ELB was able to actually [… – 01:44:39] so one challenge we actually had today is if you want to use an ELB, the ELB expects to be given AWS [… – 01:44:51] instance [IDs]. When you’re running a container in Titus, you’re just one of many things running on this instance and so what we actually want the ELB to do is to say, here’s an IP address that should be your [… – 01:45:03]. So we’ve had some discussion with AWS around that kind of functionality.
Female 2: [… – 01:45:12]?
Andrew: Yeah, it’s the same thing [… – 01:45:18].
Male 20: If we move to a point of scheduling ports as opposed to scheduling IP addresses, we probably could use the [… – 01:45:26] but we’ve decided that, for the heterogeneous nature of the applications we have at Netflix, giving them a full IP stack is much better than scheduling ports and asking them to do mapping. So I second that. It would be awesome if the [ELB] would be able to [do it], not only heterogeneous [points] but also heterogeneous IP addresses [… – 01:45:45].
Male 21: [… – 01:45:48].
Male 20: Thanks, man.
[Male 21]: Can we have it tomorrow?
Andrew: Yeah [… – 01:45:52].
Male 22: [Doesn’t the new application – … 01:45:55]?
Male 23: Right but when you’re [… – 01:46:02] you have to provide [… – 01:46:05].
Female 3: [… – 01:46:12].
Male 20: Right. If you go back to Andrew’s diagram of the networking, you can register to all of the ENIs.
Male 20: Not one of the ENIs and one of the IP addresses.
Andrew: Cool. Thanks, guys.
Male 1: Alright. Thanks, everyone, for coming. There’s still plenty of pizza and drinks [… – 01:46:39] don’t have to leave until half an hour or so. Another round of applause to all the sponsors and [… – 01:46:47] also want to thank John. We did this as a sort of a joint meet-up group, so, a special round of applause for John. That’s it. Thanks, folks.