Disaster Recovery - The Postmortem

penguinpunk over 5 years ago 2 minute read time

So you’ve made it through a disaster recovery event. Well done! Whether it was a simulation or an actual recovery, it was possibly a tense and trying time for you and your operations team. The business is no doubt happy to know that everything is back up and running from an infrastructure perspective, and they’re likely scrambling to test their applications to make sure everything’s come up intact.

How do you know that your infrastructure is back up and running though? There are probably a lot more green lights flashing in your DC than there are amber or red ones. That’s a good sign. And you probably have visibility into monitoring systems for the various infrastructure elements that go together to keep your business running. You might even be feeling pretty good about getting everything back without any major problems.

But would you know if anything did go wrong? Some of your applications might not be working. Or worse, they could be working, but with corrupt or stale data. Your business users are hopefully going to know if something’s up, but that’s going to take time. It could be something as simple as a server that’s come up with the wrong disk attached, or a snapshot that’s no longer accessible.

Syslog provides a snapshot of what happened during the recovery, in much the same way as you can use it as a tool to validate your recovery process when you’re actually in the midst of a recovery. When all of those machines come back on after a DC power failure, for example, there’s going to be a bunch of messages sent to your (hopefully centralized) syslog targets. In this way, you can go back through the logs to ensure that hosts have come back in the correct order. More importantly, if an application is still broken after the first pass at recovery, syslog can help you pinpoint where the problem may be. Rather than manually checking every piece of infrastructure and code that comprises the application stack, you can narrow your focus and, hopefully, resolve the problem in a quicker fashion than if you were just told “there’s something wrong with our key customer database.”

Beyond troubleshooting though, I think syslog is a great tool to use when you need to provide some kind of proof back to the business that their application is either functional or having problems because of an issue outside of the infrastructure layer. You’ve likely heard someone say that “it’s always the network,” when it’s likely nothing to do with the network. But proving that to an unhappy end-user or developer who’s got a key application that isn’t working anymore can be tricky. Having logs available to show them will at least give them some comfort that the problem isn’t with the network, or the storage, or whatever.

Syslog also gives you a way of validating your recovery process, and providing the business, or the operations manager, evidence that you’ve done the right thing during the recovery process. It’s not about burying the business with thousands of pages of logs, but rather demonstrating that you know what you’re doing, you have a handle on what’s happening at any given time, and you can pinpoint issues quickly. Next time there is an issue, the business is going to have a lot more confidence in your ability to resolve the crisis. This can only be a good thing when it comes to improving the relationship between business users and their IT teams.

Top Comments

petergwilson over 5 years ago +2

If stuff doesn't come back up properly here we soon know about it. The 'helldesk' phones start ringing off the hook. If my SAM RAG screen is all green then I can feel fairly confident that my servers…
rschroeder over 5 years ago +2

Relying solely, or heavily, on syslog isn't great. In fact, I think that's a bad practice. But syslog CAN be a useful tool in determining root causes and affected systems. And once you have a list of…
CourtesyIT over 5 years ago +2

I would tend to agree, with both rschroeder and penguinpunk on the view of syslog. Unfortunately, syslog is being feed into splunk, which is another group in another state, so getting access to the…

rschroeder over 5 years ago

I've been in your same "information overload" / "death by syslog-trap" situation.
A helpful option for me was installing Kiwi Syslog and pointing only certain key systems to it, which enables Kiwi to have high-impact information show up and not be buried. My systems that report both to Kiwi AND to NPM are Core and Distribution switches/routers, certain data center equipment, and WAN routers in 7x24 critical care centers. I keep a dedicated monitor RDP'd into the Kiwi server's interface, and I can instantly react when I see lines scrolling or appearing on that view. Obviously, alerts are fired off to others from Kiwi, depending on the importance of the event.
For bigger power across the enterprise of our 100 hospitals, clinics, and business service centers, all syslogs go to Splunk, while lower-output-devices' syslogs also go to NPM.
Splunk made it easy to winnow through the syslog chaff, separating the garbage from the gold, and offering us very useful recommendations for what to do about recognized patterns that may indicate malware or malicious intent.
My environment generates too many lines of syslog data for NPM or LEM to work with, and Splunk was purchased and sized correctly ($750K) for the need. It properly handles the ridiculous number of syslog entries per second that are generated by our 100 ASA's and multiple 8540 WLC's, as well as the much smaller number of syslog messages generated by 800 switches and routers. When we added ISE to our access switches, that generates orders of magnitude more information to a syslog server, and Splunk was the only option for receiving it all, and for generating useful advice about what any given syslog message or pattern indicates.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
zennifer over 5 years ago

This article motivated me .. I have the Kiwi Syslog Server set up .. the last time I really used it was during the migration to this site about a year ago ... I got down and dirty last night and basically re-deployed .... gathering data and fine tuning my filters. Thanks for the push!!! I am highly motivated to play this infinite game; you all at THWACK have given me that push on days when I really need it ... something that I could never get in this government installation!!! Thanks penguinpunk
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
penguinpunk over 5 years ago

Glad you liked the article, and good luck with your testing!
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
zennifer over 5 years ago

Nice write up! I agree ... the proof must be in the pudding!!! I like you reasoning to use syslog as another mechanism to validate. We will be setting up a test run in the very near future. We just set up a DR site, replicating, 15 min RPO/RTO. Even though it will be scheduled, it will be most stressful. I am taking your advice ... check the syslog for additional verification.
Thanks for taking the time to share!
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
david.botfield over 5 years ago

Good Article
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel