In part one I talked about the first goal of a NOC, network management visibility. As I said in that post, if you can't see it, you can't manage it. Once you have achieved network management visibility breadth and depth, we need to look at how we approach management. This leads us to our second goal.
Goal 2 - Network Fault Management
Some might argue that security management trumps fault management, but remember that we are talking about NOC operations. Typically, security management is handled outside the NOC. Fault management wins out over performance management because of the following: It does not matter how fast your network devices are if they are unavailable. In my opinion, and in that of other folks who have designed NOC processes, fault management is a two pronged effort. One effort is purely reactive; find and fix outages. The other effort is proactive fault prevention. Fault prevention saves the time and money required to mitigate a failed system. Let's consider a real-world fault and how prevention would have saved a lot of time and money.
The fault I am using in this example first showed up as a performance issue. We noticed in the NOC WAN monitoring screen that several T1 links saturated in the middle of the night. User calls about poor application performance flooded the call center as soon as the business day began. Keep in mind that this was before the days of NetFlow, so isolating the traffic culprit meant going to the data center with a Y cable and a sniffer. When we were able to analyze the WAN traffic using the Sniffer, we saw that a database replication application was making sync requests every few milliseconds to several remote database servers. What we found when we looked into the server is that one of the hard drives essential to the application had failed, causing the application to spew sync requests endlessly. All of this was from a rogue system, placed on the network without adding it to the NOC systems.
This entire episode cost four hours time from three network engineers and a half day of lost productivity for about 400 network users. An automated alert and service ticket for a failed drive would have completely avoided the issue.
The lesson here is to use as much automation in fault management as you can. While it is true the detection of the drive fault is reactive, the results if the failure was quickly detected, would have been proactively avoiding a much larger network issue.
Stay tuned for Part Three - Performance Management