It’s six in the evening, and just as you are about to sit down for some dinner -wait for it- you’re getting paged for another production outage. Third time this week! You immediately SSH into the server to see that the JVM has been OOM killed. Great. As you inspect the endless stream of logs, tears slowly drip into your freshly poured glass of whiskey — Salt ruins whiskey. Think of all the spoiled whiskey! You have no idea why the system was OOM killed and the state of the JVM. You restart the JVM only to be killed 15 minutes later. No clues are jumping off the terminal. With a glimmer of hope, you scour 1,000s of log lines for clues as they stream down your terminal every millisecond. A few minutes pass and still no evidence in the logs. Seconds turn into minutes then into hours. Hope turns into frustration then into desperation. Your food is cold now too.
We’ve all been there at some point in our career; maybe you’ve found a better way. If production outages consume your peaceful inner-buddha, then this blog post is for you. We’ll explore the observability domain and how it can help prevent further spoiled whiskey.
A Brief History of Observability
Why is it important now, more than ever?
The 3 Legs of Observability
- Latency – How quickly are your services responding to inbound requests to your service.
- Inbound/Outbound Traffic – How much data is going in and out of your system or nodes.
- CPU/Memory/Disk Utilization – How much compute resources your service is consuming.
- Error rate – Percentage of responses that are not a positive response like 2XX or 3xx.
- Edge routing