cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Disaster Recovery - The Postmortem

Level 9

So you’ve made it through a disaster recovery event. Well done! Whether it was a simulation or an actual recovery, it was possibly a tense and trying time for you and your operations team. The business is no doubt happy to know that everything is back up and running from an infrastructure perspective, and they’re likely scrambling to test their applications to make sure everything’s come up intact.

How do you know that your infrastructure is back up and running though? There are probably a lot more green lights flashing in your DC than there are amber or red ones. That’s a good sign. And you probably have visibility into monitoring systems for the various infrastructure elements that go together to keep your business running. You might even be feeling pretty good about getting everything back without any major problems.

But would you know if anything did go wrong? Some of your applications might not be working. Or worse, they could be working, but with corrupt or stale data. Your business users are hopefully going to know if something’s up, but that’s going to take time. It could be something as simple as a server that’s come up with the wrong disk attached, or a snapshot that’s no longer accessible.

Syslog provides a snapshot of what happened during the recovery, in much the same way as you can use it as a tool to validate your recovery process when you’re actually in the midst of a recovery. When all of those machines come back on after a DC power failure, for example, there’s going to be a bunch of messages sent to your (hopefully centralized) syslog targets. In this way, you can go back through the logs to ensure that hosts have come back in the correct order. More importantly, if an application is still broken after the first pass at recovery, syslog can help you pinpoint where the problem may be. Rather than manually checking every piece of infrastructure and code that comprises the application stack, you can narrow your focus and, hopefully, resolve the problem in a quicker fashion than if you were just told “there’s something wrong with our key customer database.”

Beyond troubleshooting though, I think syslog is a great tool to use when you need to provide some kind of proof back to the business that their application is either functional or having problems because of an issue outside of the infrastructure layer. You’ve likely heard someone say that “it’s always the network,” when it’s likely nothing to do with the network. But proving that to an unhappy end-user or developer who’s got a key application that isn’t working anymore can be tricky. Having logs available to show them will at least give them some comfort that the problem isn’t with the network, or the storage, or whatever.

Syslog also gives you a way of validating your recovery process, and providing the business, or the operations manager, evidence that you’ve done the right thing during the recovery process. It’s not about burying the business with thousands of pages of logs, but rather demonstrating that you know what you’re doing, you have a handle on what’s happening at any given time, and you can pinpoint issues quickly. Next time there is an issue, the business is going to have a lot more confidence in your ability to resolve the crisis. This can only be a good thing when it comes to improving the relationship between business users and their IT teams.

15 Comments
Level 14

If stuff doesn't come back up properly here we soon know about it.  The 'helldesk' phones start ringing off the hook.  If my SAM RAG screen is all green then I can feel fairly confident that my servers are up (and services and other monitored stuff).  Then it is over to the application support peeps to check their bits.  We can see if there are any network issues from another monitoring tool used by the network team.  We also have application and business owners defined so we contact them and get them to check their apps.  That covers most of it.  Anything left is usually a team effort to investigate (helpfully expedited by several managers standing over us asking idiotic questions and, heaven forbid, offering 'advice'      )  .  We do collect some syslogs and have an old version of Kiwi but that project kinda died before my time.  After I finish sorting out SAM I hope to get LEM installed and working.

Of course, the alternative is turning the phone off and going to the pub   (BOFH).

Relying solely, or heavily, on syslog isn't great.  In fact, I think that's a bad practice.

But syslog CAN be a useful tool in determining root causes and affected systems.  And once you have a list of affected systems from syslog, and the reactions and symptoms of outages that syslog can also provide, you have a nice set of things to investigate.

Syslog can't tell you what parts of a complicated system are still down after a recovery.  You'll still need awesome monitoring and actual human input from teams who specialize in supporting satellite applications that tie into those systems which were down.

If our Electronic Medical Healthcare Record application/system were to go down, recovering it isn't as simple as restarting its server, or failing over to its hot standby system.  The dozens of ancillary apps that rely on that EMHR may need to be shutdown gracefully and restarted--IN THE RIGHT ORDER--before they each will function properly again.

And syslog isn't a help there.  Only understanding what must occur, and when it must occur, to recover after an environment-down situation, will get you back fully up and running.  Knowing the apps, what they depend on, and who's responsible for their management and troubleshooting, will be more helpful than syslog in getting mission critical systems back up.

On the other hand, syslog IS useful for the P.I.R. and the R.C.A.  It can show you what failed first, the order in which things failed, and what happened prior to those failures.  This information is GREAT for discovering the cause of the outages.  But syslog insufficient for ensuring everything is back up and running during the restoration procedures.

Don't rely on syslog solely.  But don't throw it out--it has great use in the right situation.

Level 20

Logging and SIEM seems to me to still have a long way to go before it's really "intelligent."

MVP
MVP

Nice write up

I would tend to agree, with both rschroeder​ and penguinpunk​ on the view of syslog.  Unfortunately, syslog is being feed into splunk, which is another group in another state, so getting access to the data records maybe an issue.  I used to feed syslog into my Solarwinds but that also caused issue with the database filling up with errors from poorly configured switches around the global complaining about bad Cisco ISE Authentication.  I want to setup a kiwi filter but another Solarwinds Server is a bit much for my organization to handle right now.  I am still trying to pick up the pieces from a recovery/upgrade from last week. 

Level 14

I tend to agree with all of you who pointed out that syslog is only part of the story. The more complex the environment, the more important  documentation and processes that are staged to come up/online only when prerequisites are acheived. It's a bit like conducting an orchestra in that timing is everything. That said your comment about "always the network" hits home... but sometimes perception is reality and fighting it can be at the very least challenging!

Great topic... Thanks for the write up! penguinpunk

Level 9

I agree, there're usually plenty of people ready to tell you something's wrong. And, of course, I'm a big fan of the alternative approach too

Level 9

I agree. I don't think there's any one tool you can rely on to get the job done, and there's an awful lot about disaster recovery that relies heavily on having good documentation and staff who work well under pressure.

Syslog is your supporting data much like any logging that occurred during a disaster recovery event. It will be you processes that will come under the most scrutiny. (Shameless plug: My blog on Disaster Recovery, "The Devil Is In the Details")

Dashboards, UX monitoring, Service Desk call volume, and so on... will also be key indicators to tell you how you're making out post disaster recovery.

Haven't utilized Syslog very often in the past. I tend to see it as....yes we have the Information but it is burried in the noise of other syslogs.

Filtering and searching are the key I guess. Hoping for the new Orion Logmanager. I didn't like the "old" Syslog Viewer in Orion. Also LEM has been somewhat frustrating with certain tasks....

Level 13

Good Article

MVP
MVP

Nice write up!  I agree ... the proof must be in the pudding!!!   I like you reasoning to use syslog as another mechanism to validate.   We will be setting up a test run in the very near future.  We just set up a DR site, replicating, 15 min RPO/RTO.   Even though it will be scheduled, it will be most stressful.  I am taking your advice ... check the syslog for additional verification.

Thanks for taking the time to share!

Level 9

Glad you liked the article, and good luck with your testing!

MVP
MVP

This article motivated me .. I have the Kiwi Syslog Server set up .. the last time I really used it was during the migration to this site about a year ago  ... I got down and dirty last night and basically re-deployed .... gathering data and fine tuning my filters.  Thanks for the push!!!   I am highly motivated to play this infinite game; you all at THWACK have given me that push on days when I really need it ... something that I could never get in this government installation!!! Thanks penguinpunk

I've been in your same "information overload" / "death by syslog-trap" situation. 

A helpful option for me was installing Kiwi Syslog and pointing only certain key systems to it, which enables Kiwi to have high-impact information show up and not be buried.  My systems that report both to Kiwi AND to NPM are Core and Distribution switches/routers, certain data center equipment, and WAN routers in 7x24 critical care centers.  I keep a dedicated monitor RDP'd into the Kiwi server's interface, and I can instantly react when I see lines scrolling or appearing on that view.  Obviously, alerts are fired off to others from Kiwi, depending on the importance of the event.

For bigger power across the enterprise of our 100 hospitals, clinics, and business service centers, all syslogs go to Splunk, while lower-output-devices' syslogs also go to NPM.

Splunk made it easy to winnow through the syslog chaff, separating the garbage from the gold, and offering us very useful recommendations for what to do about recognized patterns that may indicate malware or malicious intent.

My environment generates too many lines of syslog data for NPM or LEM to work with, and Splunk was purchased and sized correctly ($750K) for the need.  It properly handles the ridiculous number of syslog entries per second that are generated by our 100 ASA's and multiple 8540 WLC's, as well as the much smaller number of syslog messages generated by 800 switches and routers.  When we added ISE to our access switches, that generates orders of magnitude more information to a syslog server, and Splunk was the only option for receiving it all, and for generating useful advice about what any given syslog message or pattern indicates.