Oct 20, 2021 by Jason McIntosh
One of the questions that comes up a lot is how you monitor Spinnaker itself. Not the apps Spinnaker is deploying, but Spinnaker itself and how it’s performing. This is a question that has a lot of different answers. There are a few guidelines, but many of the answers are the same as how you’d monitor ANY application. You monitor metrics, logs, performance indicators, and trace transaction flows and try to capture all the data you can about “events.” But with all of that, it’s more about understanding the applications that make up Spinnaker. There are different things to monitor for each of the services that makes up the Spinnaker application, and thats’ the real answer on how to monitor Spinnaker – monitor the services in Spinnaker. Let’s dive into some of the basics…
Note: for more detailed thoughts on monitoring concepts, read blogs at Honeycomb.io regarding Observability, numerous other articles on “monitoring” vs. “observability”, and the history of monitoring including Nagios, Grafana, Prometheus, and various platforms.
As mentioned, Spinnaker is multiple services that combine to make Spinnaker, so it’s not a monolithic application. These are Spring Boot applications (minus several Armory apps that use GoLang). Java has had multiple monitoring solutions over the years in various forms, and some of these still work well. The first “real” monitoring was with JMX (Java Management Extensions) via RMI. There were numerous agents that would query remote JMX beans for data. This still works today on mostJava services as many of the monitoring configurations also concurrently register JMX beans. (For example, Tomcat registers a lot of JMX beans as part of its bootstrap, and most newer libraries support JMX natively). This provides detailed level monitoring of the JVM as well as remote access and manipulation of the running environment. The BAD part of JMX is you have to run your JMX client locally or through RMI in addition to the overhead and complexity that comes with JMX.
Spinnaker started with a different library called Spectator (pioneered by Netflix) to bypass a lot of the JMX headaches. Spectator is in use today and is the default metrics solution for Spinnaker services. It operates very similarly to how Prometheus operates, where a remote system scrapes your metric data for collection into a time series database. The problem here is that unless you are Netflix and running their tech stack, this isn’t the most useful of solutions. Instead, a sidecar called the monitoring daemon was implemented to pull the Spectator metrics and feed the metrics to any system that needed the data. However, there were some problems (which many of our customers know with intimate detail) with this implementation. This sidecar was written in Python unlike every other service (causing extra overhead on maintenance). Here’s an overview of how it works:
This kind of Extract, Transform, and Load (ETL) operation is expensive at the best of times. Dealing with 100MB of metric data (all in memory) every 30 seconds is not unheard of because Spinnaker generates large amounts of metric data. That could bring the whole ETL process to a grinding halt and actually bring down your Spinnaker services not because they failed but because a failure of the Monitoring Daemon brings down the pods.
Armory released the Observability Plugin to address these issues. The Observability Plugin uses the native implementation in Micrometer (which Spectator makes use of) and provides a direct in memory mapping of metric data to any given provider solution. This reduces the load and performance impact of monitoring Spinnaker drastically. Micrometer supports integrations with various platforms without needing RMI access like JMX and is a Spring default solution for providing metric data. It supports feeding metric data directly to NewRelic, Dynatrace, DataDog and others as well as provideing an OpenMetrics/Prometheus endpoint. With the Armory Observability Plugin, you can feed metric data directly to NewRelic and provide the OpenMetrics/Prometheus endpoints without needing a sidecar. We’ve seen this reduce CPU usage & memory impacts because it’s using what is already in memory in Spinnaker. Take a look below, and you can see the impact of removing the sidecar on CPU:
23:00 is when the migration from the sidecar over to the plugin happened. Note the scale is 0-40% CPU usage, and it dropped by roughly 25% or more in most of the Spinnaker services after the migration.
We have our metrics , but what metrics sp you look at? The first metrics to look at are common to any application: CPU and memory usage. Those are the minimum for any app, and a great place to start. On top of that, Spinnaker adds default metrics for all the services (often renamed from “standard” metric names). Here are some details:
All of these are some good starting metrics, but it gets deeper for each of the services that do different things. You need to understand the applications to monitor them effectively. Let’s take a look at one of the Spinnaker services for some examples.
Clouddriver is responsible for communications with various cloud providers. It is probably the most important service in Spinnaker to monitor while also being fairly large and complex. Orca does more from a scheduling/workflow perspective, but Clouddriver is the piece that does most of the work. CloudDriver has a lot of metric data that is unique to how it operates that is worth monitoring. The following are some basic examples of metrics to monitor for Clouddriver but not the end all of how you monitor its performance:
Many of these metrics are aggregated or used in a combination to provide meaningful data. For example, a key monitoring metric that’s calculated from these metrics is cache latency, which is a combination of execution time and count. This is roughly how long it took to cache data on any given account. This is often a side effect of either not enough caching agents, an issue with a targeted account, just the size of the account.
You can find a lot of good, basic dashboards for spinnaker in Uneeq spinnaker mixin project, which works with our plugin to provide dashboards for Spinnaker. These are not the last word for Spinnaker dashboards, as there are many more metrics that could be relevant to your use case and some that may not be relevant. Instead, like all monitoring, these are starting places to get you an idea. It’s up to you to take them to where you need them to be.
We’re just scratching the surface of monitoring of Spinnaker, so stay tuned. Next we’ll go from metrics to other aspects of monitoring and other ways to monitor Spinnaker!
Multi-target deployments can feel tedious as you deploy the same code over and over to multiple clouds and environments — and none of them in the same way. With an automatic multi-target deployment tool, on the other hand, you do the work once and deliver your code everywhere it needs to be. Armory provides an […]
Read more →
KubeCon+CloudNativeCon EU is one of the world’s largest tech conferences. Here, users, developers, and companies who have and intend to adopt the Cloud Native standard of running applications with Kubernetes in their organizations come together for 5 days. From May 16-20, 2022, tech enthusiasts will congregate both virtually and in person in Valencia, Spain to […]
Read more →
Deciding how frequently to release a product is an interesting challenge faced by many companies. There are definite pros and cons related to adjusting your release cadence that have to be evaluated on an individual basis. Faster release cycles in theory might sound good, but of course, there can be tradeoffs. Looking at historical release […]
Read more →