What kind of devices are in the network? Can they be configured to send traps?
You should have two forms of monitoring going:
- Your NPM should be configured to poll your elements every X seconds
- Your devices should be configured to send syslog AND trap information back to NPM for each event/condition.
NPM could be set to its minimum polling frequency (10 seconds? I don't recall it specifically). But setting this frequency as a default for all your nodes could have negative impact on your network, nodes, WAN, and NPM solution, depending on how large your network is, how many nodes/elements are being polled, how big your WAN pipes are, and how robust your NPM environment is.
The trap and syslog information could also negatively impact your world if you have extreme logging (level 0 debugging) enabled. If you're using debug logging on your devices, you might overwhelm your WAN or NPM solution.
A good solution might be to select some polling frequency that is less than the default 120 seconds on NPM for a subset of your devices--perhaps your most important routers/switches/servers, etc. Then set them up to send traps and syslogs to NPM, but with something less active than "debug" information.
You must consider your WAN's limitations before changing polling/reporting settings. Sending a lot of polling or syslog info over a T1 or fractional frame relay will impact other critical traffic over those links.
1 of 1 people found this helpful
If you need more frequent information than the polling/reporting above offers, consider some one-off polling that might use something different, like the Engineer's Toolset's Bandwidth Gauge solution, which you could point at a few high-importance routers and set for something like polling them every eight seconds. For every gauge you build, you can set the Toolset to display its historical traffic as a graph. This can come pretty close to real-time displaying of bandwidth utilized. I do that for core and distribution and Internet and VPN routers, and have a PC with four monitors dedicated to that information. It's very useful to me when troubleshooting congestion or flow issues, and it looks pretty impressive to passers by.
SNMP polling is too far from near RT.
Central SNMP polling software never was build to that...
If that element support syslog or SNMP trap so you can get much closer to RT
For extremely critical items you can always use Traps and Syslogs ... those are just about as real time as you can get.
Cisco Module Change *cefcmodulestatus* I think is the flag to use; just watch those conditions and how they send the logs then hone in on them.
But yes, to get more details if your DB/Storage can handle the more frequent polls, but I would not do more than 30 seconds for polling on your status and 4 minutes on statistics. This will allow an alert checked every 30 seconds to fire the alert when your condition hits at no more than 59 seconds after it hits. That should be faster than someone can call the help desk and log a ticket that you would then have routed your way. So at least your looking before the ticket hits, or the desk calls over to escalate. The problem with checking your alerts TOO often is the check has to roll all the way through the DB. So you don't want it to start over checking status in the DB before it gets to the end. Watch your logs for long queues or other issues regarding your sql DB not being fully checked before it rolls back to the top. 4 minutes on your statistics allows for your to graph every 5 minutes.