The world may run on coffee, but it’s the alarm clock that gets us out of bed. It operates on a simple threshold. You set the time that’s important to you and receive an alert when that variable is true.
Like your alarm clock, today’s tooling for web service alerting often operates on simple thresholds, but unlike with your clock, there is a wide variety of metrics and it’s not as clear which should trigger an alert. Until we have something better than thresholds, engineers have to carefully weigh which metrics are actionable, how they are being measured, and what thresholds correspond to real-world problems.
Measure the Thing You Care About
In practice, this arguably simple process of reasoning about what you are monitoring, and how you are monitoring it, is rarely undertaken. More often, our metric choices and threshold values are guided by our preexisting tools. Hence, if our tools cannot measure latency, we do not alert on latency. This practice of letting our tools guide our telemetry content is an anti-pattern which results in unreliable problem detection and alerting.
Effective alerting requires metrics that are reliably actionable. You have to start by reasoning about the application and/or infrastructure you want to monitor. Only then can you choose and implement collection tools that get you the metrics you’re actually interested in, like queue size, DB roundtrip times, and inter-service latency.
One Reliable Signal
Effective alerting requires a singular, reliable telemetry signal, to which every collector can contribute. Developing and ensuring a reliable signal can be difficult, but the orders of magnitude are simpler than building out multiple disparate monitoring systems and trying to make them agree with each other – in the way many shops, for example, alert from one system like Nagios, and troubleshoot from another like Ganglia.
It’s arguably impossible to make multiple, fallible systems agree with each other in every case. They may usually agree, but every false positive or false negative undermines the credibility of both systems. Further, multiple systems rarely improve because it’s usually impossible to know which system was at fault when they disagree. Did the alerting system send a bogus alert or is there a problem with the data in the visualization system? If false positives arise from a single telemetry system, you simply iterate and improve that system.
Alert Recipient == Alert Creator
Crafting effective alerts involves knowing how your systems work. Each alert should trigger in the mind of its recipient an actionable cognitive model that describes how the production environment is being threatened. How does the individual piece of infrastructure that fired this alert affect the application? Why is this alert a problem?
Only engineers who understand the systems and applications we care about have the requisite knowledge to craft alerts that describe actionable threats to those systems and applications. Therefore effective alerting requires that the recipients of alerts be able to craft those alerts.
Push Notifications as Last Resort
Emergencies force context switches. They interrupt workflow and destroy productivity. Many alerts are necessary, but very few of them should be considered emergencies. At AppOptics, the preponderance of our alerts are delivered to group chat. We find that this is a timely notification medium which doesn’t interrupt productivity. Further, group chat allows everyone to react to the alert together, in a group context, rather than individually from an email-box or pager. This helps us avoid redundant troubleshooting effort, and keeps everyone synchronized on problem resolution.
Effective alerting requires an escalation system that can communicate problems in a way that is not interrupt-driven. There are myriad examples in other industries like healthcare and security systems, where, when every alert is interrupt-driven, human beings quickly begin to ignore the alerts. Push notifications should be a last resort.
Alerting is Hard
Effective alerting is a deceptively hard problem, which represents one of the biggest challenges facing modern operations engineers. A careful balance needs to be struck between the needs of the systems and the needs of the humans tending those systems.