Disaster Recovery - How Logging Can Help Ensure You'll Get There

Disaster recovery can be a complicated beast to tame. If you’re attempting to recover your IT infrastructure from some kind of disaster, you’re already faced with a number of challenges. First, there’s been an event that’s serious enough to warrant disaster recovery. Whether it’s flood, a plague of locusts, or someone cutting the wrong fibre cables, things aren’t going swimmingly. As a result, people are going to be a little stressed out. Invariably, there will be a lot to contend with, not just from an infrastructure perspective, but also from a people and process point of view. If there’s been a disaster, people will be worried about their families, possibly their homes. Some staff may not be able to make it to the recovery site (if you have one) to help with the recovery.

You’ll also likely be faced with the reality that your data centre is a complicated environment with a lot of moving parts. Some of that technical debt that you hoped wouldn’t be a problem any time soon has come back to bite you. And when was the last time you tested your recovery process? Did you failover a few non-production workloads during the day and run a few ping tests? That’s probably not enough to see how things really behave in a disaster.

I’m a big fan of trying to keep things simple. But data centre operations aren’t always a simple thing. And the recovery of a data centre is usually less so. You can help yourself by leveraging event monitoring tools to understand your progress at any given time, and how far you've still got to go. It seems odd that something as simple as syslog could be useful in the event of a disaster. But keep in mind you have a whole lot of moving pieces, slightly stressed out staff members, and various upset business units to deal with. A tool like syslog can provide insights into what has happened previously as well as what point you’re up to in the recovery process. It’s one thing to follow a checklist or run sheet during a recovery. It’s another thing to be able to validate your progress during the recovery by looking at actual log files generated by hosts and applications coming back online. In my opinion, this is why leveraging tools such as syslog and SNMP is so critical to achieving a level of sanity when managing and operating data centre infrastructure.

Beyond validation, tools like these give you a way to prove to concerned business units, other infrastructure staff, and (potentially) vendors what happened with the infrastructure. This is particularly useful when a recovery activity has gone awry and people are left scratching their heads as to why that’s the case. Grabbing the current bundle of logs after the machine has come up is one thing, but if you can go back through the events of the last few hours, there may be some additional insights that can be had.

Disaster recovery is no fun at the best of times, and people rightfully try to avoid it if they can. But you can make things at least a little easier on yourself and your business by investing time and effort in tools that can give you the right information when you need it most. Sure, it’s a bad thing that your primary data centre is under 3 feet of water now, but at least you’ll have some clarity about what happened when your applications came up for air in your secondary data centre.

  • I don't like logging but it's a fact of life now.

  • Sometimes I feel like I should hold the title of "Captain Paranoid" when it comes to network downtime, and Murphy sits on my shoulder whispering into my ear all the tricks he might play on my company.

    This time he asked me "What will you do when the next disaster takes away your ability to quickly access your logging solution?  Heh, heh, heh . . ."

  • Sometimes it isn't until things have gone wrong that people can understand the value. Similar to house insurance and good data protection I guess.

  • The technical debt you describe and the unsureness of your DR planning I covered in my blog post, "The Devil Is In the Details". "Going to DR..." scares every DR professional.

    Your notes on Syslog is very interesting. Make sure your SIEM isn't being stored in the facility that got destroyed... ;-) My advice: Configure you syslog destination to a FQDN and not IP. That way you can move your SIEM on the fly for business continuity purposes.

  • Plenty of other good reasons for logging as well.  Having worked in environments with very mature monitoring deployments and then gone to others where it was an afterthought or was non-existent, it felt like my hands were tied.  One of the environments had no real monitoring solution either, so they were really flying blind.