It’s six in the evening, and just as you are about to sit down for some dinner -wait for it- you’re getting paged for another production outage. Third time this week! You immediately SSH into the server to see that the JVM has been OOM killed. Great. As you inspect the endless stream of logs, tears slowly drip into your freshly poured glass of whiskey — Salt ruins whiskey. Think of all the spoiled whiskey! You have no idea why the system was OOM killed and the state of the JVM. You restart the JVM only to be killed 15 minutes later. No clues are jumping off the terminal. With a glimmer of hope, you scour 1,000s of log lines for clues as they stream down your terminal every millisecond. A few minutes pass and still no evidence in the logs. Seconds turn into minutes then into hours. Hope turns into frustration then into desperation. Your food is cold now too.
We’ve all been there at some point in our career; maybe you’ve found a better way. If production outages consume your peaceful inner-buddha, then this blog post is for you. We’ll explore the observability domain and how it can help prevent further spoiled whiskey.
A Brief History of Observability
Observability is a relatively new term in the software industry. In the past, engineers would spend countless hours SSH’ing (or telnet’ ing for those who can remember those insecure days) into machines, tailing logs, and interpreting `top` output. Then start the cycle over again in hopes of finding a clue. With the introduction of JMX for Java in the early 2000s, Application Performance Management (APM) systems began to appear on the scene, such as New Relic and AppDynamics. These services provided more insight into what was happening inside the application but came at a high cost both financially and technically. They still left a lot to be desired for system metrics and custom metrics. Then Splunk burst onto the scene like the kool-aid man. They made it clear that all the good stuff is in your application logs. No more tail -f needed! No more grep needed! What freedom! As the internet gained steam, architectures became more complex, and we needed more sophisticated tools. We now see tools like Datadog and Prometheus entering the landscape and reshaping our expectations around what observability should look like in a high volume, high availability world.
Why is it important now, more than ever?
The complexity of building online services has exploded over the last decade, with new infrastructure innovation from serverless solutions like AWS Lambda to container orchestration solutions like Kubernetes. These abstractions get you further away from the underlying compute, making it more difficult to debug if you don’t have the right information. The application tiers also gained complexity through microservices and API integrations with 3rd party vendors for security, credit card processing, and record-keeping, to name a few. Why your application is failing might be layers deep in the architecture that could be out of your control. What used to be SSH’ing into three machines has turned into 1,000s of machines.
The 3 Legs of Observability
Over the past five years, we’ve seen an explosion in SaaS and OSS products that observe and report the state of our application and services. These solutions generally fall into three categories that form the legs of the observability stool: metrics, logs, and distributed tracing.
Metrics
The easiest way to get started is by gathering aggregate metrics from your systems. With new services like Datadog, Dynatrace, and SignalFx, it’s easy to get started without even instrumenting your code. These metrics, in many cases, come out-of-the-box with these solutions. Typical metrics include:
- Latency – How quickly are your services responding to inbound requests to your service.
- Inbound/Outbound Traffic – How much data is going in and out of your system or nodes.
- CPU/Memory/Disk Utilization – How much compute resources your service is consuming.
- Error rate – Percentage of responses that are not a positive response like 2XX or 3xx.
These are great for system-level metrics, but don’t tell the whole story of your application’s health. For example, if the CPU goes up by 30%, is that a good or bad event? It could be due to a spike in orders from your system, which is a good thing! Or it could be a nagging performance issue, which is terrible. Application or business level metrics become critical in determining the overall health of your service. Application metrics might be the number of items in a queue or aggregate number of a particular request. For example, Netflix might use “number of plays per second” to understand if their overall system is in a good state.
Logging
This type of observability allows developers to output any text or debug information into a text file to a centralized system that will index it like a search engine. Developers, Operators, and Security teams can all search that data to look for anomalies. The distinction between logging and metrics is about cardinality, i.e., the uniqueness of the data. For example, request IDs or remote IP addresses might be good to log so that you can track requests through your microservices or identify attackers. Keeping track of the number of requests, or highly redundant messages to a given endpoint might not be the best choice to log. It’ll spike up storage costs, but using what is essentially a search engine to do metrics aggregations is very compute-intensive.
Proper log levels and logging hygiene become important as what you might need in your development and stage environment are not what you want to be logging in production, given the difference in scale. You’ll also want to consider making your log levels change at runtime so you can increase your log levels during a time when systems are in a degraded state.
As you move more services into the cloud, the systems are completely ephemeral. VMs and containers can be terminated at unexpected times. When those systems go away, so do the data that are associated with them. Having logging systems keep redundant copies of those logs allows you to search for anomalies on hosts that no longer exist.
Distributed Tracing
This technique allows you to monitor individual requests across an ecosystem of microservices by tagging metrics and log messages with identifiable structured data. As the industry continues to trend towards building hundreds to thousands of tiny services, it becomes increasingly important to get a holistic view of the ecosystem when tracking down errors. If you’re a major online retailer, for example, and your architecture includes a single service for:
- Edge routing
- Checkout
- Inventory
- Payments
How would you begin to debug user-level issues across the spectrum of these services? With distributed tracing, you can start to thread the needle of a particular interaction within your system and understand why an error might be occurring to your users.
Where do you go from here?
The first step in gaining a deeper understanding of your application runtime is by instrumenting your code. In many cases, the tool you choose will automatically drive additional insights with little to no extra effort. Depending on your language of choice, you might consider Micrometer, which is a vendor-agnostic monitoring tool for Java or Go-Metrics, which is a metrics library for Golang. These libraries will allow you to setup aggregate metrics and export them to metric stores such as Prometheus, New Relic, or Datadog.
New communities are popping up, such as the ‘Open Telemetry’ project, which helps foster OSS standards for tracing. Using these libraries, you’ll save yourself time if you need to change vendors in the future as there will be no switching cost for your application code.
Lastly, for logging, you won’t have to instrument your code at all as most languages easily let you write to a log file. The instrumentation happens as the system or machine level either through an agent, sidecar, or Daemonset in Kubernetes. This instrumentation will automatically forward data to the centralized system for you, index, and allow for fast searching of the data.
Enjoying Your Tear-Free Whiskey
It’s 6 pm. You’re enjoying your perfectly cooked, medium-rare steak dinner when you hear your phone buzz against the dinner table. Only this time, you don’t get as frustrated. Sprinting to your backpack, you pull out your laptop -littered with vendor stickers- and pull the top open. Opening the browser, you go to your carefully curated metrics dashboard that points out response latency is up 50%. As you comb over the logs, it corroborates the findings as you can see, the response times are coming from inside your application. You dash to your tracing system and observe that it’s only calls requested to a particular microservice. After sharing your findings with that team, you realize they deployed that service earlier that day. You both agree to rollback, and that normalizes the metrics. Disaster averted. All of this happens in less than 20 minutes. Pacing back to the dinner table, you’re surprised that your steak is still warm. You pick up your whiskey and enjoy the taste of the smoky-sweet liquor — this time, without the salty tears.