In a previous post I outlined the escalation process that your network monitoring system should be able to provide as part of efficient triage of multiple simultaneous events. With the right automated communications workflow, a small team of three could easily begin work on three different events within minutes.
Actually resolving the issues depends on their scope and the required fixes; reducing the time to resolution depends on what happens before problems occur.
Establishing a Point of Truth for Device Configurations
Let’s take one of the most common issues in IT management: an erroneous configuration is pushed to multiple devices, generating many connectivity and access alerts.
Let’s even assume that astute IT engineers quickly infer the cause of the problem. And better, the monitoring software automatically downloads the running configuration from each affected device as part of the alerting workflow.
Since the running config is presumably the one with a problem, downloading it is useful only if you have a back-up of the previous configuration. If you scheduled nightly back-ups for all impacted devices, comparing the running with the last nightly back-up should quickly tell you the changes, show you the problem within those changes, and let you know if resolving the problem is as simple as pushing the backed-up configuration to the relevant devices or requires you to edit the back-up with selected non-erroneous lines from the running config.
If you have the back-up and the running config downloaded at the time of alert, then the time to resolve the problem on all affected devices is dramatically reduced to some minutes (possibly as few as 10). Essentially, you have in the form of the two comparable configs a way of determining the point of truth for the config files in play.
If you do not have a recent back-up for each device, or you cannot compare the back-up to the erroneous config causing the current problem, then you are going to need considerably more time to verify what config will fully resolve the current crisis. As a start, you might reboot the devices and let them come up with their start-up configs; then work from the start-up configs to piece together appropriate additions that restore the devices to their functionality prior to the crisis.