Showing results for 
Search instead for 
Did you mean: 
Create Post

You Can't Always Prevent a Disaster, But Preparing for One is the Next Best Thing!

Level 12

Network outages due to configuration errors are common in a network. However, the readiness to tackle such instances and minimize network downtime depends on the administrator.

The critical need when an outage occurs is to identify, find, and correct configuration issues in a matter of minutes.  And, this is why SolarWinds Network Configuration Manager (NCM) integration with SolarWinds Network Performance Manager (NPM) is one of the compelling needs for disaster recovery preparedness. 

NPM-NCM Duo 2.png

NPM & NCM Duo Takes You From Problem to Resolution in Just 7 Clicks!

For example, consider the following scenario. NPM alerts you when a critical router interface is pegged at 90% utilization.

  1. You are in NCM where you instantly spot a config change alert (‘Config Change Notification’ must be pre-configured in NCM). 
  2. The ‘Main Node Details’ page reveals the exact problem that there was a config change made recently.
  3. ‘Compare Configs’ to check for changes between the ‘new’ and the ‘working’ config. Changes are highlighted in a block on the config code (changes could be - in the interface speed or QoS policy or anything else).
  4. Select the device from the ‘Configuration Management’ page to view configuration and node details.
  5. Click on the ‘Upload’ button to select the right configuration file revert changes made earlier
  6. Click ‘Upload’ to push the last-know-good config.
  7. Click ‘Transfer Status’ to view status of upload and problem solved!

Having all of your configurations stored, catalogued, and backed up allows you to recover from a hardware failure in minutes instead of hours of grueling work.  With NCM, it actually takes longer to physically rack and wire replacement network gear than it takes to get your device back to its pre-disaster status!

An NCM customer related the following story to us.

When we had a switch (Cisco Catalyst 4507R) fail in one of our larger offices it took down 240 ports on our network. This this included phones, computers, wireless devices, printers, etc. on an entire floor.  No wireless, no phones, no printing, no computer connections – nothing for an entire floor! It actually took us longer to remove and replace the physical hardware than it did for us to recover from our last known good configuration. Putting it down to numbers it’s something like this:

    • Time to drive replacement gear to office: 2 hrs
    • Time to remove and re-rack the replacement hardware: 1 hr
    • Time to get the switch back online with the proper configuration: 5 min
    • Time for the local team to plug all the network cables back into the switch: Not My Problem!

Integrated solutions are hard, but monitoring configuration changes and recording this information is extremely useful to determine if a configuration change is the cause of a network outage. So, instead of just collecting device configuration (backup) at a specific time each day, you can configure NCM to alert you every time there is a config change. This means you can immediately determine if a reported issue coincides with a configuration change and if it did, push the previous configuration to replace the new one and Bob’s your Uncle – the network is back!

In sum, you can’t always prevent network outages from happeningbut you can sure be proactive by following best practices and tackle incidents with least amount of network downtime. Take necessary steps for disaster recovery preparedness and when there is a network incident – you’re ready to roll!

What similar NPM/NCM stories do you have?  Please share them with us here!

Level 7

You've kinda mixed two very different types of issues.

Hardware failures are very rare these days and outages are mostly (over 70%) cause by human error. Usually wrong or out of date configs.

Using a config diff when some basic alert is raised will rarely solve the problem. The network admin will need to do the root cause analysis manually, many times under pressure (there's an outage after all).

Network operations is moving to proactive, preemptive maintenance. Fixing config mistakes before there's an outage. That's the way to go in my mind.

Level 15

Thanks for the post!