One of the questions that comes up a lot is how you monitor Spinnaker itself. Not the apps Spinnaker is deploying, but Spinnaker itself and how it’s performing. This is a question that has a lot of different answers. There are a few guidelines, but many of the answers are the same as how you’d monitor ANY application. You monitor metrics, logs, performance indicators, and trace transaction flows and try to capture all the data you can about “events.” But with all of that, it’s more about understanding the applications that make up Spinnaker. There are different things to monitor for each of the services that makes up the Spinnaker application, and thats’ the real answer on how to monitor Spinnaker – monitor the services in Spinnaker. Let’s dive into some of the basics…
Note: for more detailed thoughts on monitoring concepts, read blogs at Honeycomb.io regarding Observability, numerous other articles on “monitoring” vs. “observability”, and the history of monitoring including Nagios, Grafana, Prometheus, and various platforms.
As mentioned, Spinnaker is multiple services that combine to make Spinnaker, so it’s not a monolithic application. These are Spring Boot applications (minus several Armory apps that use GoLang). Java has had multiple monitoring solutions over the years in various forms, and some of these still work well. The first “real” monitoring was with JMX (Java Management Extensions) via RMI. There were numerous agents that would query remote JMX beans for data. This still works today on mostJava services as many of the monitoring configurations also concurrently register JMX beans. (For example, Tomcat registers a lot of JMX beans as part of its bootstrap, and most newer libraries support JMX natively). This provides detailed level monitoring of the JVM as well as remote access and manipulation of the running environment. The BAD part of JMX is you have to run your JMX client locally or through RMI in addition to the overhead and complexity that comes with JMX.
Spinnaker started with a different library called Spectator (pioneered by Netflix) to bypass a lot of the JMX headaches. Spectator is in use today and is the default metrics solution for Spinnaker services. It operates very similarly to how Prometheus operates, where a remote system scrapes your metric data for collection into a time series database. The problem here is that unless you are Netflix and running their tech stack, this isn’t the most useful of solutions. Instead, a sidecar called the monitoring daemon was implemented to pull the Spectator metrics and feed the metrics to any system that needed the data. However, there were some problems (which many of our customers know with intimate detail) with this implementation. This sidecar was written in Python unlike every other service (causing extra overhead on maintenance). Here’s an overview of how it works:
- It executes on a set interval and hits the Spectator endpoints.
- Loads all the metric data into memory.
- Parses the metrics and transforms them into whatever provider format was needed.
This kind of Extract, Transform, and Load (ETL) operation is expensive at the best of times. Dealing with 100MB of metric data (all in memory) every 30 seconds is not unheard of because Spinnaker generates large amounts of metric data. That could bring the whole ETL process to a grinding halt and actually bring down your Spinnaker services not because they failed but because a failure of the Monitoring Daemon brings down the pods.
Armory released the Observability Plugin to address these issues. The Observability Plugin uses the native implementation in Micrometer (which Spectator makes use of) and provides a direct in memory mapping of metric data to any given provider solution. This reduces the load and performance impact of monitoring Spinnaker drastically. Micrometer supports integrations with various platforms without needing RMI access like JMX and is a Spring default solution for providing metric data. It supports feeding metric data directly to NewRelic, Dynatrace, DataDog and others as well as provideing an OpenMetrics/Prometheus endpoint. With the Armory Observability Plugin, you can feed metric data directly to NewRelic and provide the OpenMetrics/Prometheus endpoints without needing a sidecar. We’ve seen this reduce CPU usage & memory impacts because it’s using what is already in memory in Spinnaker. Take a look below, and you can see the impact of removing the sidecar on CPU:
23:00 is when the migration from the sidecar over to the plugin happened. Note the scale is 0-40% CPU usage, and it dropped by roughly 25% or more in most of the Spinnaker services after the migration.
What to look at
We have our metrics , but what metrics sp you look at? The first metrics to look at are common to any application: CPU and memory usage. Those are the minimum for any app, and a great place to start. On top of that, Spinnaker adds default metrics for all the services (often renamed from “standard” metric names). Here are some details:
- Memory: jvm.memory.used and jvm.memory.max. Your heap memory usage will, as a rule, fluctuate a bit, so watch these. Watch jvm.gc.pause.* in particular as that’s a sign you don’t have enough memory and need more. Spinnaker services tend to load a lot of data into memory for translation and thus use a lot of memory. Kubernetes providers in Clouddriver are particularly bad about this because of the sizes of the documents loaded.
- CPU: process.cpu.usage measures the usage of the JVM process itself. A word of caution though, there’s a lot more going on than just the JVM process usage. If your metrics are high, you don’t have enough CPUs allocated.
- jdbc.connections.active & jdbc.connections.max: this tells you how well utilized your JDBC connection pools are. It’s a point in time observation, so you can have spikes that get missed by these, but they are still useful to track.
- Controller metrics: these are roughly equivalent to http.server.* type metrics. They’re not a 1-1 translation, but, as a rule, are what you monitor vs. the http metrics for services.
- okhttp.requests.* metrics: these measure failures talking to external services, such as Orca failing to talk to Clouddriver.
- resilience4j.*: these are “circuit breaker” metrics that measure how things retry or failover intelligently. Tracking when these occur is a good sign something is wrong with a call. From these, you can work on figuring out what failed and why and start fixing it.
All of these are some good starting metrics, but it gets deeper for each of the services that do different things. You need to understand the applications to monitor them effectively. Let’s take a look at one of the Spinnaker services for some examples.
Clouddriver is responsible for communications with various cloud providers. It is probably the most important service in Spinnaker to monitor while also being fairly large and complex. Orca does more from a scheduling/workflow perspective, but Clouddriver is the piece that does most of the work. CloudDriver has a lot of metric data that is unique to how it operates that is worth monitoring. The following are some basic examples of metrics to monitor for Clouddriver but not the end all of how you monitor its performance:
- cats.sqlCache. – these metrics are tied to cache events, such as what’s put in, merged, or deleted. These metrics give an idea of how cache operations are performing. This doesn’t tell you that your cloud accounts are working but provides a lot of details on how cache data is processed.
- aws.* – these metrics are for tracking communications with AWS. The throttling metric in particular is one to watch for as that’s a sign that AWS API rate limits are being hit.
- kubernetes.* – these metrics detail performance about Kubernetes cache data
- fiat.permissionsCache – specifically, watch these metrics for permissions failures. It’s not necessarily a sign of something being broken but can be a sign of something going wrong.
Many of these metrics are aggregated or used in a combination to provide meaningful data. For example, a key monitoring metric that’s calculated from these metrics is cache latency, which is a combination of execution time and count. This is roughly how long it took to cache data on any given account. This is often a side effect of either not enough caching agents, an issue with a targeted account, just the size of the account.
You can find a lot of good, basic dashboards for spinnaker in Uneeq spinnaker mixin project, which works with our plugin to provide dashboards for Spinnaker. These are not the last word for Spinnaker dashboards, as there are many more metrics that could be relevant to your use case and some that may not be relevant. Instead, like all monitoring, these are starting places to get you an idea. It’s up to you to take them to where you need them to be.
We’re just scratching the surface of monitoring of Spinnaker, so stay tuned. Next we’ll go from metrics to other aspects of monitoring and other ways to monitor Spinnaker!