The Four Questions part 4

In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked. You can read that introduction here‌ along with information on the first question (Why did I get this alert). You can get the low-down on the second question (Why DIDN'T I get an alert) here. And the third question (What is monitored on my system) is here.

My goal in this post is to give you the tools you need to answer the fourth question: Which of the existing alerts will potentially trigger for my system?

Reader's Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Riddle Me This, Batman...

It's 3:00pm. You can't quite see the end of the day over the horizon, but you know it's there. You throw a handful of trail mix into your face to try to avoid the onset of mid-afternoon nap-attack syndrome and hope to slide through the next two hours unmolested.

Which, of course, is why you are pulled into a team meeting. Not your team meeting, mind you. It's the Linux server team. On the one hand, you're flattered. They typically don't invite anyone who can't speak fluent Perl or quote every XKCD comic in chronological order. On the other...well, team meeting.

The manager wrote:

            kill `ps -ef | grep -i talking | awk '{print $1}'`

on the board, eliciting a chorus of laughter from everyone but me. My silence gave the manager the perfect opportunity to focus the conversation on me.

“We have this non-trivial issue, and are hoping you can grep out the solution for us.” He begins, “we're responsible for roughly 4,000 sytems...”

Unable to contain herself, a staff member followed by stating, “4,732 systems. Of which 200 are physical and the remainder are virtualized...”

Unimpressed, her manager said, “Ms. Deal, unless I'm off by an order of magnitude, there's no need to correct.”

She replied, “Sorry boss.”

“As I was saying,” he continued. “We have a...significant number of systems. Now how many alerts currently exist in the monitoring system which could generate a ticket?”

“436, with 6 currently in active development.” I respond, eager to show that I'm just on top of my systems as they are of theirs.

“So how many of those affect our systems?” the manager asked.

Now I'm in my element. I answer, “Well, if you aren't getting tickets, then none. I mean, if nothing has a spiked CPU or RAM or whatever, then it's safe to say all of your systems are stable. You can look at each node's detail page for specifics, although with 4,000I can see where you would want a summary. We can put something together to show the current statistics, or the average over time, or...”

“You misunderstand,” he cuts me off. “I'm fully cognizant of the fact that our systems are stable. That's not my question. My question is…should one of my systems become unstable, how many of your... what was the number? Oh, right: How many of your 436-soon-to-be-442 alerts WOULD trigger for my systems?”

“As I understand it, your alert logic does two things: it identifies the devices which could trigger the alertAll Windows systems in the 10.199.1 subnet, for exampleand at the same time specifies the conditions under which an alert is triggeredsay, when the CPU goes over 80% for more than 15 minutes.”

“So what I mean,” he concluded, “Is this: can you create a report that shows me the devices which are included in the scope of an alert logic irrespective of the trigger condition?”

Your Mission, Should You Choose to Accept it...

As with the other questions we've discussed in this series, the specifics of HOW to answer this question is less critical than knowing you will be asked it.

In this case, it's also important to understand that this question is actually two questions masquerading as one:

  1. For each alert, tell me which machines could potentially be triggers
  2. For each machine, tell me which alerts may potentially triggered

Why is this such an important questionperhaps the most important of the Four Questions in this series? Because it determines the scale of the potential notifications monitoring may generate. It's one thing if 5 alerts apply to 30 machines. It's entirely another when 30 alerts apply to 4,000 machines.

The answer to this question has implications to staffing, shift allocation, pager rotation, and even the number of alerts a particular may approve for production.

The way you go about building this information is going to depend heavily on the monitoring solution you are using.

In general, agent-based solutions are better at this because trigger logic – in the form of an alert name -  is usually pushed down to the agent on each device, and thus can be queried (both “Hey, node, what alerts are on you?” and “hey, alert, which nodes have you been pushed to?”)

That's not to say that agentless monitoring solutions are intrinsically unable to get the job done. The more full-featured monitoring tools have options built-in.

Reports that look like this:

part5_2.png

Or even resources on the device details page that look like this:

part5_1.png

Houston, We Have a Problem...

What if it doesn't though? What if you have poured through the documentation, opened a ticket with the vendor, visited the online forums and asked the greatest gurus up on the mountain, and came back with a big fat goose egg? What then?

Your choices at this point still depend largely on the specific software, but generally speaking there are 3 options:

  • Reverse-engineer the alert trigger and remove the actual trigger part


Many monitoring solutions use a database back-end for the bulk of their metrics, and alerts are simply a query against this data. The alert trigger queries may exist in the database itself, or in a configuration file. Once you have found them, you will need to go through each one removing the parts which comprise the actual trigger (i.e.: CPU_Utilization > 80%). This will likely necessitate your learning the back-end query language for your tool. Difficult? Probably, yes. Will it increase your street cred with the other users of the tool? Undoubtedly. But once you've done it, running a report for each alert becomes extremely simple.

  • Create duplicate alerts with no trigger

If you can't export the alert triggers, another option is to create a duplicate of each alert that has the “scope” portion, but not the trigger elements (so the “Windows machines in the 10.199.1.x subnet” part but not the “CPU_Utilization > 80%” part). The only recipient of that alert will be you and the alert action should be something like writing to a logfile with a very simple string (“Alert x has triggered for Device y”). Every so oftenevery month or quarterfire off those alerts and then tally up the results that recipient groups can slice and dice.

  • Do it by hand


If all else fails (and the inability to answer this very essential question doesn't cause you to re-evaluate your choice of monitoring tool), you can start documenting by hand. If you know up-front that you are in this situation, then it's simply part of the ongoing documentation process. But most times it's going to be a slog through of existing alerts and writing down the trigger information. Hopefully you can take that trigger info and turn it into an automated query against your existing devices. If not, then I would seriously recommend looking at another tool. Because in any decent-sized environment, this is NOT the kind of thing you want to spend your life documenting, and it's also not something you want to live without.

What Time Is It? Beer:o’clock

After that last meetingnot to mention the whole dayyou are ready pack it in. You successfully navigated the four impossible questions that every monitoring expert is asked (on more or less a daily basis)Why did I get that alert, Why didn't I get that alert, What is being monitored on my systems, and What alerts might trigger on my systems? Honestly, if you can do that, there's not much more that life can throw at you.

Of course, the CIO walks up to you on your way to the elevator. “I'm glad I caught up to you,” he says, “I just have a quick question...”

Stay tuned for the bonus question!


Related Resources

SolarWinds Lab Episode 24 - Web-based Alerting + Wireless Heat Maps, Duplex Mismatch Detection & More

http://www.youtube.com/watch?v=nE4kpmhKG4s?CMP=THW-TAD-GS-WhatsNew-NPM-PP-fourquestions_4

Tech Tip:  How To Create Intelligent Alerts Using Network Performance Monitor

http://cdn.swcdn.net/creative/v13.0/pdf/techtips/how_to_create_intelligent_alerts_with_npm.pdf?CMP=THW-TAD-GS-TechTip_Alerts-NPM-PP-fourquestions_4

New Features & Resources for NPMv11.5

http://www.solarwinds.com/network-performance-monitor/whats-new.aspx?CMP=THW-TAD-GS-WhatsNew-NPM-PP-fourquestions_4

Recommended Download: Network Performance Monitor

  http://www.solarwinds.com/register/registrationb.aspx?program=607&c=70150000000Dlbw&CMP=THW-TAD-GS-rec_DL-NPM-DL-fourquestions_4

Thwack - Symbolize TM, R, and C