Monitoring Spinnaker: Part 1

Oct 20, 2021 by Jason McIntosh


One of the questions that comes up a lot is how you monitor Spinnaker itself.  Not the apps Spinnaker is deploying, but Spinnaker itself and how it’s performing.  This is a question that has a lot of different answers. There are a few guidelines, but many of the answers are the same as how you’d monitor ANY application.  You monitor metrics, logs, performance indicators, and trace transaction flows and try to capture all the data you can about “events.”  But with all of that, it’s more about understanding the applications that make up Spinnaker.  There are different things to monitor for each of the services that makes up the Spinnaker  application, and thats’ the real answer on how to monitor Spinnaker – monitor the services in Spinnaker.  Let’s dive into some of the basics…

Note: for more detailed thoughts on monitoring concepts, read blogs at regarding Observability, numerous other articles on “monitoring” vs. “observability”, and the history of monitoring including Nagios, Grafana, Prometheus, and various platforms.


As mentioned, Spinnaker is multiple services that combine to make Spinnaker, so it’s not a monolithic application.  These are Spring Boot applications (minus several Armory apps that use GoLang).  Java has had multiple monitoring solutions over the years in various forms, and some of these still work well.  The first “real” monitoring was with JMX (Java Management Extensions) via RMI.  There were numerous agents that would query remote JMX beans for data.  This still works today on mostJava services as many of the monitoring configurations also concurrently register JMX beans. (For example, Tomcat registers a lot of JMX beans as part of its bootstrap, and most newer libraries support JMX natively).  This provides detailed level monitoring of the JVM as well as remote access and manipulation of the running environment.  The BAD part of JMX is you have to run your JMX client locally or through RMI in addition to the overhead and complexity that comes with JMX.

Spinnaker started with a different library called Spectator (pioneered by Netflix) to bypass a lot of the JMX headaches.  Spectator is in use today and is the default metrics solution for Spinnaker services.  It operates very similarly to how Prometheus operates, where a remote system scrapes your metric data for collection into a time series database.  The problem here is that unless you are Netflix and running their tech stack, this isn’t the most useful of solutions.  Instead, a sidecar called the monitoring daemon was implemented to pull the Spectator metrics and feed the metrics to any system that needed the data.  However, there were some problems (which many of our customers know with intimate detail) with this implementation.  This sidecar was written in Python unlike every other service (causing extra overhead on maintenance).  Here’s an overview of how it works:

  1. It executes on a set interval and hits the Spectator endpoints.
  2. Loads all the metric data into memory.
  3. Parses the metrics and transforms them into whatever provider format was needed.

This kind of Extract, Transform, and Load (ETL) operation is expensive at the best of times.  Dealing with 100MB of metric data (all in memory) every 30 seconds is not unheard of because Spinnaker generates large amounts of metric data. That could bring the whole ETL process to a grinding halt and actually bring down your Spinnaker services not because they failed but because a failure of the Monitoring Daemon brings down the pods.

Armory released the Observability Plugin to address these issues.  The Observability Plugin uses the native implementation in Micrometer (which Spectator makes use of) and provides a direct in memory mapping of metric data to any given provider solution.  This reduces the load and performance impact of monitoring Spinnaker drastically.  Micrometer supports integrations with various platforms without needing RMI access like JMX and is a Spring default solution for providing metric data.  It supports feeding metric data directly to NewRelic, Dynatrace, DataDog and others as well as provideing an OpenMetrics/Prometheus endpoint.  With the Armory Observability Plugin, you can feed metric data directly to NewRelic and provide the OpenMetrics/Prometheus endpoints without needing a sidecar.  We’ve seen this reduce CPU usage & memory impacts because it’s using what is already in memory in Spinnaker.  Take a look below, and you can see the impact of removing the sidecar on CPU:

23:00 is when the migration from the sidecar over to the plugin happened.  Note the scale is 0-40% CPU usage, and it dropped by roughly 25% or more in most of the Spinnaker services after the migration.

What to look at

We have our metrics , but what metrics sp you look at?  The first metrics to look at are common to any application:  CPU and memory usage.  Those are the minimum for any app, and a great place to start.  On top of that, Spinnaker adds default metrics for all the services (often renamed from “standard” metric names).  Here are some details:

All of these are some good starting metrics, but it gets deeper for each of the services that do different things. You need to understand the applications to monitor them effectively. Let’s take a look at one of the Spinnaker services for some examples.

Clouddriver metrics…

Clouddriver is responsible for communications with various cloud providers.  It is probably the most important service in Spinnaker to monitor while also being fairly large and complex.  Orca does more from a scheduling/workflow perspective, but Clouddriver is the piece that does most of the work.  CloudDriver has a lot of metric data that is unique to how it operates that is worth monitoring. The following are some basic examples of metrics to monitor for Clouddriver but not the end all of how you monitor its performance:

Many of these metrics are aggregated or used in a combination to provide meaningful data.  For example, a key monitoring metric that’s calculated from these metrics is cache latency, which is a combination of execution time and count.  This is roughly how long it took to cache data on any given account.  This is often a side effect of either not enough caching agents, an issue with a targeted account, just the size of the account.

Continuing on…

You can find a lot of good, basic dashboards for spinnaker in Uneeq spinnaker mixin project, which works with our plugin to provide dashboards for Spinnaker.  These are not the last word for Spinnaker dashboards, as there are many more metrics that could be relevant to your use case and some that may not be relevant.  Instead, like all monitoring, these are starting places to get you an idea. It’s up to you to take them to where you need them to be.

We’re just scratching the surface of monitoring of Spinnaker, so stay tuned. Next we’ll go from metrics to other aspects of monitoring and other ways to monitor Spinnaker!

Share this post:

Recently Published Posts

Lambda Deployment is now supported by Armory CD-as-a-Service

Nov 28, 2023

Armory simplifies serverless deployment: Armory Continuous Deployment-as-a-Service extends its robust deployment capabilities to AWS Lambda.

Read more

New Feature: Trigger Nodes and Source Context

Sep 29, 2023

The Power of Graphs for Ingesting and Acting on Complex Orchestration Logic We’ve been having deep conversations with customers and peer thought leaders about the challenges presented by executing multi-environment continuous deployment, and have developed an appreciation for the power of using visual tools such as directed acyclic graphs (DAG) to understand and share the […]

Read more

Continuous Deployments meet Continuous Communication

Sep 7, 2023

Automation and the SDLC Automating the software development life cycle has been one of the highest priorities for teams since development became a profession. We know that automation can cut down on burnout and increase efficiency, giving back time to ourselves and our teams to dig in and bust out innovative ideas. If it’s not […]

Read more