In my last post on the subject of what I called "monitoring-spam" I talked about what can happen when you install a solution like NPM and turn on all alerts--or at least leave on the out-of-the-box alerts. I also asked how you all deal with those problems and promised that I'd cover how my team mitigates these issues.
Several of you have similar strategies to ones I use today, including:
* Don't monitor access ports, printers, or other devices that either tend to "flap" or aren't as critical to know about immediately. Consider that a lot of users, for instance, undock and transition from wired to wireless during meetings--do you really need to know each time that happens?
* Separate logging from monitoring. I assume you log events somehow, but that's very different from active monitoring or alerting. We log every event that happens on our network, generating hundreds of gigabytes daily (every device with power and the ability to "talk" on the network sends all logs to a central location), but that is a very different thing from actively monitoring and possibly alerting on trouble areas.
* Establish reasonable baselines for alerting. If a server uses all of its memory for a short period, that may not be as critical as a server constantly pegged. Then again, even that may not be an issue as certain servers (SQL, Exchange) will grab everything they can get.
What I'm interested in now, however, is a little more granularity in the discussion--specifically in one area: links and routing:
* Do you monitor link saturation?
* Do you monitor unicast routing tables?
* Do you monitor multicast route tables?
Sounds similar to how we approach things here. Interestingly enough, I don't see a lot of folks monitoring unicast routing tables outside of very large scale or multi-tenant environments. The small enterprise and SMB seem to not watch route changes much. Then again, most in that market probably aren't multi-homed and also shouldn't see a lot of route flapping of any kind. Down is down for them, especially with limited redundancy.
"Don't monitor access ports" - yes, so long as they have end users attached. Access ports in a data center with servers attached is another issue, and I strongly recommend monitoring those as well as capturing utilization data (it's amazing how many problems have been identified in the past by looking at the pattern of a server's port utilization).
I've worked in a few places that monitor unicast routing tables, and it can be an incredibly useful tool to have when some application has suddenly failed and you're wondering "what changed?" I forget what the tool was as another team ran it, but I've certainly seen OSPF and BGP monitoring in place, with every route change being logged and timestamped with the details of the change (e.g. next hop changed from X to Y).
Link saturation - assuming you mean to monitor utilization, then yes, yes, and thrice yes. It's very important to know what's going on on your links, but it does lead to a need for some statistical considerations about the information you are seeing.
- Peak utilization can be meaningless as a measure of utilization on some links, as bursty traffic will often saturate a link; it doesn't necessarily mean the link is overloaded. 95th percentile can at least start to remove some of the extremes.
- Average utilization over 24 hours... uh, yeah. 100% for 12 hours a day, 0% for 12 hours a day = 50% average utilization, yet you're dropping packets for half the day. Some monitoring systems allow you to monitor values based on your work hours, or similar, and that can be very revealing.
- Granularity of your data. If you poll once every 15 minutes, the data can get smoothed over very easily and you can miss serious performance problems. On the other hand, poll every minute and your monitoring system might be unhappy (and it's a lot of data to store an process).
- Setting alarm thresholds. Depending on the level of resiliency, use of Ethernet aggregation, and willingness to be oversubscribed during a failover, setting the right alarm threshold for a given link can be complex.
- Trending. It's really useful to see where your data is going, whether you process this manually or the NMS does it for you. It's good to know 3-6 months ahead of time that you're likely to hit your alarm threshold and be able to install more capacity before that happens.
And so much more...
Monitoring server access ports:
There was some debate here at first regarding whether to do so or not.
All the pros of doing so could equally be applied to any resource anywhere in the environment if a department/group admin wished to do so.
Ultimately we took the stance of the network group will monitor the network hardware and the ports that are required for network connectivity. Server admins can worry about their NIC - we (network admin group) are not really concerned at that level of granularity. If there's an issue with a switchport it'll probably show up in the myriad of logical adapters (usually all) the server owner chooses for monitoring at the node level.
Great information, John!
We monitor unicast routes, but that's largely my own personal paranoia and the business we're in. 95th percentile monitoring on links is standard here too, if for no other reason than to match how we're billed by our various providers. We don't monitor multicast at this point.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.