cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Level 11

How to Monitor Effectively

In my last post on the subject of what I called "monitoring-spam" I talked about what can happen when you install a solution like NPM and turn on all alerts--or at least leave on the out-of-the-box alerts.  I also asked how you all deal with those problems and promised that I'd cover how my team mitigates these issues.

network.jpg

Several of you have similar strategies to ones I use today, including:

* Don't monitor access ports, printers, or other devices that either tend to "flap" or aren't as critical to know about immediately.  Consider that a lot of users, for instance, undock and transition from wired to wireless during meetings--do you really need to know each time that happens?

* Separate logging from monitoring.  I assume you log events somehow, but that's very different from active monitoring or alerting.  We log every event that happens on our network, generating hundreds of gigabytes daily (every device with power and the ability to "talk" on the network sends all logs to a central location), but that is a very different thing from actively monitoring and possibly alerting on trouble areas.

* Establish reasonable baselines for alerting.  If a server uses all of its memory for a short period, that may not be as critical as a server constantly pegged.  Then again, even that may not be an issue as certain servers (SQL, Exchange) will grab everything they can get.

What I'm interested in now, however, is a little more granularity in the discussion--specifically in one area: links and routing:

* Do you monitor link saturation?

* Do you monitor unicast routing tables?

* Do you monitor multicast route tables?

Labels (1)
24 Replies
Highlighted
Level 13

Re: How to Monitor Effectively

"Don't monitor access ports" - yes, so long as they have end users attached. Access ports in a data center with servers attached is another issue, and I strongly recommend monitoring those as well as capturing utilization data (it's amazing how many problems have been identified in the past by looking at the pattern of a server's port utilization).

I've worked in a few places that monitor unicast routing tables, and it can be an incredibly useful tool to have when some application has suddenly failed and you're wondering "what changed?" I forget what the tool was as another team ran it, but I've certainly seen OSPF and BGP monitoring in place, with every route change being logged and timestamped with the details of the change (e.g. next hop changed from X to Y).

Link saturation - assuming you mean to monitor utilization, then yes, yes, and thrice yes. It's very important to know what's going on on your links, but it does lead to a need for some statistical considerations about the information you are seeing.

- Peak utilization can be meaningless as a measure of utilization on some links, as bursty traffic will often saturate a link; it doesn't necessarily mean the link is overloaded. 95th percentile can at least start to remove some of the extremes. 

- Average utilization over 24 hours... uh, yeah. 100% for 12 hours a day, 0% for 12 hours a day = 50% average utilization, yet you're dropping packets for half the day. Some monitoring systems allow you to monitor values based on your work hours, or similar, and that can be very revealing.

- Granularity of your data. If you poll once every 15 minutes, the data can get smoothed over very easily and you can miss serious performance problems. On the other hand, poll every minute and your monitoring system might be unhappy (and it's a lot of data to store an process).

- Setting alarm thresholds. Depending on the level of resiliency, use of Ethernet aggregation, and willingness to be oversubscribed during a failover, setting the right alarm threshold for a given link can be complex.

- Trending. It's really useful to see where your data is going, whether you process this manually or the NMS does it for you. It's good to know 3-6 months ahead of time that you're likely to hit your alarm threshold and be able to install more capacity before that happens.

And so much more...

Highlighted
Level 11

Re: How to Monitor Effectively

Great information, John!

We monitor unicast routes, but that's largely my own personal paranoia and the business we're in.  95th percentile monitoring on links is standard here too, if for no other reason than to match how we're billed by our various providers.  We don't monitor multicast at this point.

0 Kudos
Highlighted
Level 13

Re: How to Monitor Effectively

Monitoring server access ports:

There was some debate here at first regarding whether to do so or not.

All the pros of doing so could equally be applied to any resource anywhere in the environment if a department/group admin wished to do so.

Ultimately we took the stance of the network group will monitor the network hardware and the ports that are required for network connectivity.  Server admins can worry about their NIC - we (network admin group) are not really concerned at that level of granularity.  If there's an issue with a switchport it'll probably show up in the myriad of logical adapters (usually all) the server owner chooses for monitoring at the node level.

0 Kudos
Highlighted
Level 17

Re: How to Monitor Effectively

Do you monitor link saturation? Yes, yes, yes

Do you monitor unicast routing tables? No not yet.

Do you monitor multicast route tables? Yes; but not 100% utilized

The 15 minute statistics on interfaces is a good threshold for data; use of pollers and traps on a 2 minute or instant trigger is proper for more critical items.

(That is if you are alerting on Usage thresholds)

I like a group of alerts for interface errors. One that checks every 59 minutes for errors within the last hour; and if the condition on the next check has been reset (meaning no more new errors) then a error subsided message then sends. But, If on each 59 minute check I get an interface error msg on, then i need to send a tech to clean/check the fiber. We use this on DC connections and Core/Distribution links mainly. As this is not an alert you want triggering on the access layer.

Thresholds are great to use on trap or syslog alerts; as they prevent you from getting a million emails, and also can give you a window of response before the next log entry will be translated into an alert.

Highlighted
Level 11

Re: How to Monitor Effectively

Sounds similar to how we approach things here.  Interestingly enough, I don't see a lot of folks monitoring unicast routing tables outside of very large scale or multi-tenant environments.  The small enterprise and SMB seem to not watch route changes much.  Then again, most in that market probably aren't multi-homed and also shouldn't see a lot of route flapping of any kind.  Down is down for them, especially with limited redundancy.

0 Kudos
Highlighted
Level 15

Re: How to Monitor Effectively

SomeClown wrote:

* Do you monitor link saturation?

* Do you monitor unicast routing tables?

* Do you monitor multicast route tables?

Absolutely

Not yet.

Not yet.

With the scale of our environments (tens of thousands of elements globally) it is a daunting task to have the amount of data we do. However, there is always a drive for better data points and metrics for our customers, so we are always investigating new and more effective ways to predict and respond to incidents. We have some very precise templates used when network gear (switches, routers, load balancers, firewalls, etc) are built out, so they are all monitored the same way. However, we are always looking for better solutions and more information for support, sales, and customers to review. We are going to be looking very strongly at route monitoring in the near future and seeing how others perform that function will be interesting this week.

0 Kudos
Highlighted
Level 11

Re: How to Monitor Effectively

I'm curious how many people are monitoring for departments outside of traditional IT.  For example, monitoring sales applications and reporting out to key stakeholders in the sales organization.  We haven't seen that sort of thing here, aside from a few corner cases, but I hear stories... 

0 Kudos
Highlighted
Level 15

Re: How to Monitor Effectively

We specifically monitor Data Centers. We do have a team that does some more in depth monitoring for server applications and URLs and such, but my little team is NPM only

0 Kudos
Highlighted
Level 17

Re: How to Monitor Effectively

Basically anything I monitor outside our own equipment has alerts that go to the team that manages that item.

If it's a server, or app - these would be related to a specific department.  So something from a Rx Server/Service would go to our informatics team, or basically Rx IT(Special IT to the RX group and Apps)

I have some boxes that facilities likes to know status on (they hold keys and allow for codes for access and retrieval of they keys, catalogging everything to a server); so they want to know up/down.

And our Telehealth group wants to monitor Room Controls and Projectors. Alerting on Lamp Life or use, and breaks or disconnects within their system that I am able to get viewability with using snmp.  Latter on this telehealth stuff has some examples setup for them, as the full picture is a work in progress.

Usually I will set myself up on these alerts as well to make sure i am aware for one, in case they call. Also to make sure the Alerts work properly.

0 Kudos