“Observability.” A term that, in recent years, has imbued itself into the realms of conference keynotes, vendor sales pitches, industry bloggers, and IT-group meetings around the globe. But what is it really?
Wikipedia says observability is part of the greater concept of Control Theory and “is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” At a greater sense, Control Theory is a field of mathematics focused on creating dynamic models of control for systems where the control action is optimized for the task.
But wait, the theme this year is ELI5, right? OK, let’s start over.
“Control Theory” is the idea we can create methods of control customized to their target system. “Observability” is the idea of seeing the health of a system from the output of it. In our world (IT), this means we should be able to answer both the known and unknown questions we should ask about a system, simply by watching how the system acts from the outside.
With monitoring, we measure our systems using known thresholds based on contextual, and sometimes tribal, knowledge. In the observability framework, we gain the ability to ask questions we didn’t know to ask, in near-real time, to understand exactly what our system is doing at any time.
One of the pillars of the observability framework is event-driven, with the primary event being a request from your customer. We’re moving away from caring about things like availability, uptime, CPU load, memory utilization, and a whole host of other metrics we’ve come to know and love. As one of the biggest evangelists of observability, Charity Majors, says; “nines don’t matter if users aren’t happy.” Our users don’t care even in the slightest we’ve built a 64-core database server with 624GB of RAM and a RAID1+0 SSD array if it takes what they perceive is a long time to load their webpage. The user’s experience is the primary measurement of success in our industry now.
Take, for instance, the recent launch of the highly-anticipated Disney+ service. In the first few hours, users around the globe received numerous errors we now know were due to unforeseen issues in the application’s code base. Initial rumors placed blame on the underlying infrastructure in AWS, which was reported as being fully available the entire time. But this underlying availability didn’t alleviate any of the negative user experience, which almost assuredly couldn’t been predicted in a way we could have built a proactive monitor around to quickly and efficiently isolate root cause.
Enter Observability, stage left.
With a proper observability tooling, not only can we post-mortem these issues with high cardinality, but potentially provide near-real-time, or even proactive, stances by incorporating machine learning models into our tools capable of learning what’s “normal” and letting us know when things land a little off-kilter. And while this can increase our user’s experience, ultimately the real value comes in giving us more time to search the depths of the internet to find gems like this to share with our colleagues: