Open for Voting

Enhanced Notifications (Building Upon Enhanced Node Status)

Enhanced node status is fantastic and it's a great move in the right direction! That gives us great control over "HOW" we define alert conditions, now comes the human engagement part:

  • What is the priority or urgency of the alert?
  • Who are we going to alert? 
  • How do they want to receive the alert?
  • Are there timing considerations that should be evaluated?
  • What do we do if no one acknowledges or responds to the alert?

Note: This is one of those complex things like 'Enhanced Node Status' and should be able to be toggled on and off, like 'Enhanced Node Status' can be. If anyone here is using PagerDuty, you'll be familiar with the concepts I'm thinking about.

What is the priority or urgency of the alert

Let's start with some examples:

  • On this production application server: CPU Load critical threshold = greater than 90% for 4 consecutive polling intervals | Memory Utilization critical threshold = 95% for 4 consecutive polling intervals
  • On this non-production database server: CPU Load critical threshold = greater than 95% for 6 consecutive polling intervals | Memory Utilization critical threshold = 95% for 10 consecutive polling intervals

A non-production database server is generally something we don't want to be woken up about. It's low urgency. But a production application server might be something we do care to be woken up about. On that same token, I may care more about CPU on the database server because we're oversubscribed on CPU in that data center and they're doing site-acceptance-testing this week (i.e. high urgency - but only on this node and just temporarily during site-acceptance-testing), but memory, we can handle at our convenience (low urgency).

The priority of the "reach-ability of the node" and the "child object contributors" should be able to be defined separately. I may care if the node is down more (high urgency) than if CPU is at 100% (low urgency).

Where could this be specified?

  • Global level - node reach-ability and child object contributors inherit from this (the 'node status contributors' page might be a good place for this - Settings > All Settings > Node Child Status Participation)
    • Example: I might 'enable' CPU Load Threshold  as a node status contributor and then see warning/critical threshold drop downs appear next to it and I can specify critical is high urgency and warning is low urgency per child object contributor.
  • Node level - child object contributors inherit from this
    • Example: On most nodes, I might define the default values of all child object contributors as low urgency and node reach-ability as high urgency, but on this non-production database server that's doing site-acceptance-testing this week, I'm changing the child object contributors to high-urgency, as we'll have teams running tests 24-hours/day for the next week.
  • Child object contributor level
    • Example: Generally speaking, my child object contributor "Response Time" is a low urgency notification, but on this new node at a remote location, we've been having issues and a VP is onsite this week, so we need to know about that right away (high urgency), but the CPU and others can remain low urgency.

Who are we going to alert

Create "teams" or "notification groups" that represent either alert responders or notification subscribers and give them the following child-objects or properties:

  • For a high urgency notification:
    • Use this schedule (I'll elaborate on schedules in the 'are there timing considerations that should be evaluated' section)
    • Use this notification method(s) (I'll elaborate on notification methods in the 'how do they want to receive the alert' section)
  • For a low urgency notification:
    • Use this schedule
    • Use this notification method(s)

In the 'What is the priority or urgency of the alert' section above, I talked about the global level, node level, and child object contributor level. To build on that here, the same can be true about the warning/critical threshold drop downs that allow us to select high urgency or low urgency, we can add a selector that allows us to define responders and subscribers from our list of notification group objects.

Example: Let's say we recently had an issue with chillers in our data center and had to shut the entire data center down until the chillers were repaired. I might say, I want to include Hardware Sensor as a child object contributor, critical threshold defaults to high urgency, warning threshold defaults to low urgency. Our 'Network Operations Center' notification group is the default 'responder' of the warning threshold alerts, and our 'Infrastructure Systems' notification group is the default responder of the critical threshold alerts. Our Infrastructure management team has been wanting to stay in the loop on any temperature sensor alerts in the data center because of some C-level attention to the recent chiller issues, so they'll be added to the critical threshold alert as a notification subscriber (not a responder - i.e. we're not asking them to acknowledge and resolve the alert).

How do they want to receive the alert

Different teams need different things. Some of our teams are 24/7 teams and an email notification to a distribution group is just fine. Some of our teams are in the field and they need an SMS. Some of our teams want to use PagerDuty or SendWordNow. Some teams want only an email for low urgency stuff and they'll get to it at their earliest convenience, but for high urgency stuff, they want to get an email, SMS, and a phone call (for the SMS/phone call, we might use something like 'Run an external program' and have a utility pass the message to a separate application that sends the SMS or makes the call.

Are there timing considerations that should be evaluated

A schedule represents the hours in which an alert can trigger for a given team (this has also been suggested for attaching to nodes as well - great idea!). Schedules could consist of calendars and allow the calendars to be cumulative with the last write wins concept (similar to Group Policy). Example: My team might work M-F 0700 - 1600, but we're off on company holidays. We have two calendars - the 'My Team Calendar' and the 'Company Holiday Schedule' calendar. We attach them to our notification group's low urgency schedule with the team calendar first and the company calendar last, so company calendar would black-out holidays that fall on the M-F 0700 - 1600 time frame. For high urgency stuff, we're always on call, so we attach the '24 x 7' calendar to our notification group's high urgency schedule. The same concept could be used for nodes and nodes would "come after" notification groups in the "last write wins" structure, so they would supersede notification group preferences/schedules (maybe even make that a global configuration preference - on which supersedes).

What do we do if no one acknowledges or responds to the alert

Give notification groups the ability to add an 'escalation policy' for high urgency and low urgency alerts sent to their notification group. If someone doesn't acknowledge the alert, send to the following notification group after X amount of time. Then follow that notification groups' escalation policy if they don't acknowledge. Some organizations may already use PagerDuty or similar services for this, so make this optional per notification group.

Wrapping up

None of this needs to be evaluated until an alert is triggered and then it only needs to be evaluated for the triggering object, so that should minimize performance impact. When global settings or node settings change (i.e. something that might change objects that inherit from them), users should be prompted on whether or not they want to overwrite existing objects that inherit or whether those objects should "break inheritance" and maintain existing settings and they should become the new settings going forward. Also, give users the option to download a JSON file of all existing settings pre-change and maybe an easy import tool as an "undo" button.

On the 'node status contributors' page (Settings > All Settings > Node Child Status Participation), there should be a toggle for 'advanced', which would duplicate each contributor section and allow us to define conditions in which the contributors would roll-up. Essentially, and 'if, if, if, else' kind of evaluation. Example: if (Vendor = "Windows" AND Environment = "Production") then use these child contributors,  if (Vendor = "Windows" AND Environment = "Non-Production") then use these child contributors, if (Vendor = "Net-Snmp" AND Environment = "Production") then use these child contributors, else - use these child contributors. That being said, it might be easier to just treat the 'child contributors' as an object. When someone enables enhanced status, they receive the object 'Default - Child Contributors' in the 'else' block.

Let me know what your thoughts are (aside from the fact that this was a terribly long book of an idea - I apologize in advance).