If you're running a network performance monitoring system, I'll bet you think you have visibility into your network.
I say you don't – or at least that your vision may be a bit more blurry than you realized.
There are three kinds of lies: lies, d***ed lies, and statistics.
In reality there's nothing wrong with the statistics presented by a network performance management system so long as you understand the implications of the figures you're looking at, and don't take them as a literal truth. For example, when I set a utilization alarm threshold at 90% on a set of interfaces, what does 90% actually represent? If an application sends a burst of data and the interface is used at 95% of its capacity for 3 seconds, should that trigger an alarm? Of course not. How about if it's at 95% utilization for 3 minutes while a file transfer takes place; should that trigger an alarm? Maybe rather than triggering alarms on specific short term utilization peaks I should be forecasting on an hourly average utilization instead; that would even out the short term peaks while still tracking the overall load on the link.
And thus we touch on the murky world of network performance management and the statistical analysis that takes place in order to present the user with a single number to worry about. Each product developer will make their own decisions about how to process the numbers which means that given the same underlying data, each product on the market will likely generate slightly different results.
Garbage In, Garbage Out
Before any statistical analysis can take place, data must be gathered from the devices. "GIGO" implies that if the data are bad, the outputs will be bad, so what data are we gathering, and how good are they?
Monitoring systems will typically grab interface statistics every 5 minutes, and a standard MIB-II implementation can grab information such as:
- ifSpeed (or ifHighSpeed); the speed of the interface
- ifInOctets / ifHCInOctets (received octets, or bytes)
- ifOutOctets / ifHCOutOctets (sent octets, or bytes)
Since there is no "current utilization" MIB entry, two polls are required to determine interface utilization. The first sets a baseline for the current in/out counters, and the second can be used to determine the delta (change) in those values. Multiply the deltas by 8 (convert bytes to bit), divide that by polling interval in seconds and I have bits per second values which I can use in conjunction with the interface speed to determine the utilization of the interface. Or rather, I can determine the mean utilization for that time period. If the polling interval is five minutes, do I really know what happened on the network in that time?
The charts below represent network interface utilization measured every 10 seconds over a five minute period:
All four charts have a mean utilization of 50% over that five minutes, so that's what a 5-minute poll would report. Do you still think you have visibility into your network?
Network performance management is one big, bad set of compromises, and here are a few of that issues that make it challenging to get right:
- Polling more often means more (and better resolution) data
- More data means more storage
- More data means more processing is required to "roll up" data for trending, or wide data range views, and so on.
- How long should historical data be kept?
- Is it ok to roll up data over a certain age and reduce the resolution? e.g. after 24 hours, take 1-minute polls and average them into 5-minute data points, and after a week average those into 15-minute data points, to reduce storage and processing?
- Is the network management system able to cope with performing more frequent polling?
- Can the end device cope with more frequent polling?
- Can I temporarily get real-time high-resolution data for an interface when I'm troubleshooting?
What Do You Do?
There is no correct solution here, but what do you do? If you've tried doing 1-minute poll intervals, how did that work out for you in terms of the load on the polling system and on the devices being polled? Have storage requirements been a problem? Do you have any horror stories where the utilization on a chart masked a problem on the network (I do, for sure). How do you feel about losing precision on data older than a day (for example), or should data be left untouched? Do you have a better way to track utilization than SNMP polling? I'm also curious if you simply hadn't thought about this before; are you thinking about it now?
I'd love to hear from you what decisions (compromises?) you have made or what solutions you deployed when monitoring your network, and why. Maybe I can learn a trick or two!