cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

This is a Disaster! Knowing When to Call It

Level 9

Disasters come in many forms. I’ve walked in on my daughters when they were younger and doing craft things in their bedroom and said “this is a disaster!” When it comes to serious events though, most people think of natural disasters, like floods or earthquakes. But a disaster can also be defined as an event that has a serious impact on your infrastructure or business operations. It could be any of the following events:

  • Security-related (you may have suffered a major intrusion or breach)
  • Operator error (I’ve seen a DC go dark during generator testing because someone forgot to check the fuel levels)
  • Software faults (there are many horror stories of firmware updates taking out core platforms)

So how can SNMP help? SNMP traps, when captured in the right way, can be like a distress signal for your systems. If you’ve spent a bit of time setting up your infrastructure, you’ll hopefully be able to quickly recognise that something has gone wrong in your data centre and begin to assess whether you are indeed in the midst of a disaster. That’s right, you need to take a moment, look at the evidence in front of you, and then decide whether invoking your disaster recovery plan is the right thing to do.

Your infrastructure might be sending out a bunch of SNMP traps for a variety of reasons. This could be happening because someone in your operations team has deployed some new kit, or a configuration change is happening on some piece of key infrastructure. It’s important to be able to correlate the information in those SNMP traps with what’s been identified as planned maintenance.

Chances are,  if you’re seeing a lot of errors from devices (or perhaps lots of red lights, depending on your monitoring tools), your DC is having some dramas. Those last traps received by your monitoring system are also going to prove useful in identifying what systems were having issues and where you should start looking to troubleshoot. There are a number of different scenarios that play out when disaster strikes, but it’s fair to say that if everything in one DC is complaining that it can’t talk to anything in your other DC, then you have some kind of disaster on your hands.

What about syslog? I like syslog because it’s a great way to capture messages from a variety of networked devices and store them in a central location for further analysis. The great thing about this facility is that, when disaster strikes, you’ll (hopefully) have a record of what was happening in your DC when the event occurred. The problem, of course, is that if you only have one DC, and only have your syslog messages going to that DC, it might be tricky to get to that information if your DC becomes a hole in the ground. Like every other system you put into your DC, it’s worth evaluating how important it is and what it will cost you if the system is unavailable.

SNMP traps and syslog messages can be of tremendous use in determining whether a serious event has occurred in your DC, and understanding what events (if any) lead up to that event occurring. If you’re on the fence about whether to invest time and resources in deploying SNMP infrastructure and configuring a syslog repository, I heartily recommend you look to leverage these tools in your DC. They’ll likely come in extremely handy, and not just when disaster strikes.

10 Comments
Level 13

Another good Article

Level 16

Thanks for the write up. SNMP Traps and Syslogs don't get the attention they deserve.

MVP
MVP

Nice write up

Level 13

I concur !!!!   

I'm a supporter of using a belt-and-suspender solution to keeping from being embarrassed.  One does not want their actions to become the focus of irate customers,  nor their company to show up in the newspaper/TV/Radio/blogs as having a problem that should have been prevented.

That means setting up your network equipment to send syslog messages to an appropriate destination, and to have a SIEM reviewing those messages, sorting through the chaff, and forwarding appropriate alerts to the right people.

It also means doing the same regarding traps.  And that's problematic, given the ability of manufacturers like Cisco creating MANY traps that might be enabled.  So many traps that perhaps the device itself could be negatively impacted by being configured to send them all to NPM.  That opens the question of which traps are actually necessary, which have some value, and which are NOT valuable.  Understanding that is a topic worth its own Geek Speak page.  Does one enable ALL traps?  If not, which traps should be enabled, which ones should NOT.  And then knowing the difference, and why there is a difference.  I've run into both ends of the spectrum of philosophies for traps: 

  • Enable ALL traps so you don't miss anything important.  After all, the vendor wouldn't have provided them if they weren't important valuable.
  • Enable SOME traps through some Network Decision that results in a policy about what's correct standard configuration for your environment.  This might be done through a Compliance Report and Remediation in NCM.
  • Enable NO traps.  IF there's a risk of them overwhelming a switch or a SIEM, that's a problem.  And if they were truly necessary, they'd be enabled by default from the manufacturer, right?

And finally, it means using both syslog and traps while also having NPM doing the appropriate amount of polling of devices to see that they are working and configured as expected.

Getting back to knowing when to call an event a "disaster", that topic should be examined in advance by your teams and leadership.  Defining the criteria for when to call for "All hands on deck!" is important so that you can get the right:

  • People on the job of fixing the issue quickly
  • People knowing there's a problem so they don't flood the Help Desk
  • Upper level management people knowing about the issue so they aren't blindsided by customers or peers.  Management must be made aware of the problem, its scope, what resources have been deployed to address it.  Management can be a huge help by focusing attention on escalation of problems to the right levels at service providers, at notifying internal resources, at notifying the media and/or the public, etc.
  • People knowing about the issue so they can implement Down Time Procedures

It's important to have the right tools and right configurations in your network equipment to enable you to know the extent of an issue quickly.  That puts you in the driver's seat and gives you a lot of what you need to know so you can make the right call.  Is it a Disaster from enough different points of view?  Are there any very-important-devices or clients or customers impacted?  NPM can tell you this--if you have it and your network devices configured correctly.

Level 20

Traps have been tricky to get right from my experience.  Configuring them at the source is pretty important to get it right.

MVP
MVP

Good article.

A disaster is often defined after it happens. How many times have you had someone say "something went wrong yesterday and that server, can you turn logging on in case it happens again?" A big part of our jobs is planning for what might happen. We hope it won't happen, but in case it does we have the information that we need to catch, triage, mitigate, resolve, etc. Getting everyone involved in planning the alerting and capturing is very helpful. Don't just do it yourself, your team, your team and leaders, but get the customers and end users involved. What they want to see is invaluable in deciding what to consider up and down.

So you have a network rack. In the rack are (at least) 2 intelligent APC PDU's. There is a Cisco network switch, two Dell servers, an EMC disk array, and a Brocade fiber switch. There is also an AVTECH environmental sensor. All of this equipment is configured to send their SNMP Traps to your SIEM.

So! What do you do with all of these disparate Traps? Server Traps are different then PDU Traps. Switch Traps are different then environmental Traps, and so on. Heaven forbid you ever move your SIEM! (Hint: Configure your nodes to send your Traps to a DNS name, not an IP) How do you know which Traps to alert on? Which Traps are importand and which Traps are just noise? And what about Reporting? Reporting is very important.

And oh yeah!!! You repeat the same steps for syslog messages.

Level 9

That's a great point. Companies rarely do much in the way of disaster recovery planning until they've been neck deep in a serious event. And it's very important to have everyone in the business engaged and buying in to the process. Otherwise it's just seen as another unnecessary cost.

The solution is (apparently) using a very-well-thought-out-and-designed SIEM that can recognize all traps & syslogs, winnow through the chaff, deduplicate the data, analyze the information, and make an appropriate alert decision that contains recommendations about what should be done with the event(s) that caused the trap(s).

I'd have liked LEM to be our solution so it could be part of the single pane of glass that is Orion, but it's not robust enough for my environment.  We went with Splunk, and it does a good job at the above requirements.  I'll bet there are other products that are also good in larger environments.  Hopefully they cost less than the $750K we spent a few years ago on our SIEM.