Disasters come in many forms. I’ve walked in on my daughters when they were younger and doing craft things in their bedroom and said “this is a disaster!” When it comes to serious events though, most people think of natural disasters, like floods or earthquakes. But a disaster can also be defined as an event that has a serious impact on your infrastructure or business operations. It could be any of the following events:

  • Security-related (you may have suffered a major intrusion or breach)
  • Operator error (I’ve seen a DC go dark during generator testing because someone forgot to check the fuel levels)
  • Software faults (there are many horror stories of firmware updates taking out core platforms)

 

So how can SNMP help? SNMP traps, when captured in the right way, can be like a distress signal for your systems. If you’ve spent a bit of time setting up your infrastructure, you’ll hopefully be able to quickly recognise that something has gone wrong in your data centre and begin to assess whether you are indeed in the midst of a disaster. That’s right, you need to take a moment, look at the evidence in front of you, and then decide whether invoking your disaster recovery plan is the right thing to do.

 

Your infrastructure might be sending out a bunch of SNMP traps for a variety of reasons. This could be happening because someone in your operations team has deployed some new kit, or a configuration change is happening on some piece of key infrastructure. It’s important to be able to correlate the information in those SNMP traps with what’s been identified as planned maintenance.

 

Chances are,  if you’re seeing a lot of errors from devices (or perhaps lots of red lights, depending on your monitoring tools), your DC is having some dramas. Those last traps received by your monitoring system are also going to prove useful in identifying what systems were having issues and where you should start looking to troubleshoot. There are a number of different scenarios that play out when disaster strikes, but it’s fair to say that if everything in one DC is complaining that it can’t talk to anything in your other DC, then you have some kind of disaster on your hands.

 

What about syslog? I like syslog because it’s a great way to capture messages from a variety of networked devices and store them in a central location for further analysis. The great thing about this facility is that, when disaster strikes, you’ll (hopefully) have a record of what was happening in your DC when the event occurred. The problem, of course, is that if you only have one DC, and only have your syslog messages going to that DC, it might be tricky to get to that information if your DC becomes a hole in the ground. Like every other system you put into your DC, it’s worth evaluating how important it is and what it will cost you if the system is unavailable.

 

SNMP traps and syslog messages can be of tremendous use in determining whether a serious event has occurred in your DC, and understanding what events (if any) lead up to that event occurring. If you’re on the fence about whether to invest time and resources in deploying SNMP infrastructure and configuring a syslog repository, I heartily recommend you look to leverage these tools in your DC. They’ll likely come in extremely handy, and not just when disaster strikes.