Disaster recovery can be a complicated beast to tame. If you’re attempting to recover your IT infrastructure from some kind of disaster, you’re already faced with a number of challenges. First, there’s been an event that’s serious enough to warrant disaster recovery. Whether it’s flood, a plague of locusts, or someone cutting the wrong fibre cables, things aren’t going swimmingly. As a result, people are going to be a little stressed out. Invariably, there will be a lot to contend with, not just from an infrastructure perspective, but also from a people and process point of view. If there’s been a disaster, people will be worried about their families, possibly their homes. Some staff may not be able to make it to the recovery site (if you have one) to help with the recovery.
You’ll also likely be faced with the reality that your data centre is a complicated environment with a lot of moving parts. Some of that technical debt that you hoped wouldn’t be a problem any time soon has come back to bite you. And when was the last time you tested your recovery process? Did you failover a few non-production workloads during the day and run a few ping tests? That’s probably not enough to see how things really behave in a disaster.
I’m a big fan of trying to keep things simple. But data centre operations aren’t always a simple thing. And the recovery of a data centre is usually less so. You can help yourself by leveraging event monitoring tools to understand your progress at any given time, and how far you've still got to go. It seems odd that something as simple as syslog could be useful in the event of a disaster. But keep in mind you have a whole lot of moving pieces, slightly stressed out staff members, and various upset business units to deal with. A tool like syslog can provide insights into what has happened previously as well as what point you’re up to in the recovery process. It’s one thing to follow a checklist or run sheet during a recovery. It’s another thing to be able to validate your progress during the recovery by looking at actual log files generated by hosts and applications coming back online. In my opinion, this is why leveraging tools such as syslog and SNMP is so critical to achieving a level of sanity when managing and operating data centre infrastructure.
Beyond validation, tools like these give you a way to prove to concerned business units, other infrastructure staff, and (potentially) vendors what happened with the infrastructure. This is particularly useful when a recovery activity has gone awry and people are left scratching their heads as to why that’s the case. Grabbing the current bundle of logs after the machine has come up is one thing, but if you can go back through the events of the last few hours, there may be some additional insights that can be had.
Disaster recovery is no fun at the best of times, and people rightfully try to avoid it if they can. But you can make things at least a little easier on yourself and your business by investing time and effort in tools that can give you the right information when you need it most. Sure, it’s a bad thing that your primary data centre is under 3 feet of water now, but at least you’ll have some clarity about what happened when your applications came up for air in your secondary data centre.