23 Replies Latest reply on Jan 30, 2014 1:30 PM by that1guy15

    How to Monitor Effectively

    SomeClown

      In my last post on the subject of what I called "monitoring-spam" I talked about what can happen when you install a solution like NPM and turn on all alerts--or at least leave on the out-of-the-box alerts.  I also asked how you all deal with those problems and promised that I'd cover how my team mitigates these issues.

      network.jpg

      Several of you have similar strategies to ones I use today, including:

       

      * Don't monitor access ports, printers, or other devices that either tend to "flap" or aren't as critical to know about immediately.  Consider that a lot of users, for instance, undock and transition from wired to wireless during meetings--do you really need to know each time that happens?

       

      * Separate logging from monitoring.  I assume you log events somehow, but that's very different from active monitoring or alerting.  We log every event that happens on our network, generating hundreds of gigabytes daily (every device with power and the ability to "talk" on the network sends all logs to a central location), but that is a very different thing from actively monitoring and possibly alerting on trouble areas.

       

      * Establish reasonable baselines for alerting.  If a server uses all of its memory for a short period, that may not be as critical as a server constantly pegged.  Then again, even that may not be an issue as certain servers (SQL, Exchange) will grab everything they can get.

       

      What I'm interested in now, however, is a little more granularity in the discussion--specifically in one area: links and routing:

       

      * Do you monitor link saturation?

      * Do you monitor unicast routing tables?

      * Do you monitor multicast route tables?

        • Re: How to Monitor Effectively
          jgherbert

          "Don't monitor access ports" - yes, so long as they have end users attached. Access ports in a data center with servers attached is another issue, and I strongly recommend monitoring those as well as capturing utilization data (it's amazing how many problems have been identified in the past by looking at the pattern of a server's port utilization).

           

          I've worked in a few places that monitor unicast routing tables, and it can be an incredibly useful tool to have when some application has suddenly failed and you're wondering "what changed?" I forget what the tool was as another team ran it, but I've certainly seen OSPF and BGP monitoring in place, with every route change being logged and timestamped with the details of the change (e.g. next hop changed from X to Y).

           

          Link saturation - assuming you mean to monitor utilization, then yes, yes, and thrice yes. It's very important to know what's going on on your links, but it does lead to a need for some statistical considerations about the information you are seeing.

           

          - Peak utilization can be meaningless as a measure of utilization on some links, as bursty traffic will often saturate a link; it doesn't necessarily mean the link is overloaded. 95th percentile can at least start to remove some of the extremes. 

          - Average utilization over 24 hours... uh, yeah. 100% for 12 hours a day, 0% for 12 hours a day = 50% average utilization, yet you're dropping packets for half the day. Some monitoring systems allow you to monitor values based on your work hours, or similar, and that can be very revealing.

          - Granularity of your data. If you poll once every 15 minutes, the data can get smoothed over very easily and you can miss serious performance problems. On the other hand, poll every minute and your monitoring system might be unhappy (and it's a lot of data to store an process).

          - Setting alarm thresholds. Depending on the level of resiliency, use of Ethernet aggregation, and willingness to be oversubscribed during a failover, setting the right alarm threshold for a given link can be complex.

          - Trending. It's really useful to see where your data is going, whether you process this manually or the NMS does it for you. It's good to know 3-6 months ahead of time that you're likely to hit your alarm threshold and be able to install more capacity before that happens.

           

          And so much more...

            • Re: How to Monitor Effectively
              SomeClown

              Great information, John!

               

              We monitor unicast routes, but that's largely my own personal paranoia and the business we're in.  95th percentile monitoring on links is standard here too, if for no other reason than to match how we're billed by our various providers.  We don't monitor multicast at this point.

              • Re: How to Monitor Effectively
                wbrown

                Monitoring server access ports:

                There was some debate here at first regarding whether to do so or not.

                All the pros of doing so could equally be applied to any resource anywhere in the environment if a department/group admin wished to do so.

                Ultimately we took the stance of the network group will monitor the network hardware and the ports that are required for network connectivity.  Server admins can worry about their NIC - we (network admin group) are not really concerned at that level of granularity.  If there's an issue with a switchport it'll probably show up in the myriad of logical adapters (usually all) the server owner chooses for monitoring at the node level.

              • Re: How to Monitor Effectively
                cahunt

                Do you monitor link saturation? Yes, yes, yes

                Do you monitor unicast routing tables? No not yet.

                Do you monitor multicast route tables? Yes; but not 100% utilized

                 

                The 15 minute statistics on interfaces is a good threshold for data; use of pollers and traps on a 2 minute or instant trigger is proper for more critical items.

                (That is if you are alerting on Usage thresholds)

                 

                I like a group of alerts for interface errors. One that checks every 59 minutes for errors within the last hour; and if the condition on the next check has been reset (meaning no more new errors) then a error subsided message then sends. But, If on each 59 minute check I get an interface error msg on, then i need to send a tech to clean/check the fiber. We use this on DC connections and Core/Distribution links mainly. As this is not an alert you want triggering on the access layer.

                 

                Thresholds are great to use on trap or syslog alerts; as they prevent you from getting a million emails, and also can give you a window of response before the next log entry will be translated into an alert.

                  • Re: How to Monitor Effectively
                    SomeClown

                    Sounds similar to how we approach things here.  Interestingly enough, I don't see a lot of folks monitoring unicast routing tables outside of very large scale or multi-tenant environments.  The small enterprise and SMB seem to not watch route changes much.  Then again, most in that market probably aren't multi-homed and also shouldn't see a lot of route flapping of any kind.  Down is down for them, especially with limited redundancy.

                  • Re: How to Monitor Effectively
                    zackm

                    SomeClown wrote:

                    * Do you monitor link saturation?

                    * Do you monitor unicast routing tables?

                    * Do you monitor multicast route tables?

                    Absolutely

                    Not yet.

                    Not yet.

                     

                    With the scale of our environments (tens of thousands of elements globally) it is a daunting task to have the amount of data we do. However, there is always a drive for better data points and metrics for our customers, so we are always investigating new and more effective ways to predict and respond to incidents. We have some very precise templates used when network gear (switches, routers, load balancers, firewalls, etc) are built out, so they are all monitored the same way. However, we are always looking for better solutions and more information for support, sales, and customers to review. We are going to be looking very strongly at route monitoring in the near future and seeing how others perform that function will be interesting this week.

                      • Re: How to Monitor Effectively
                        SomeClown

                        I'm curious how many people are monitoring for departments outside of traditional IT.  For example, monitoring sales applications and reporting out to key stakeholders in the sales organization.  We haven't seen that sort of thing here, aside from a few corner cases, but I hear stories... 

                          • Re: How to Monitor Effectively
                            zackm

                            We specifically monitor Data Centers. We do have a team that does some more in depth monitoring for server applications and URLs and such, but my little team is NPM only

                            • Re: How to Monitor Effectively
                              cahunt

                              Basically anything I monitor outside our own equipment has alerts that go to the team that manages that item.

                              If it's a server, or app - these would be related to a specific department.  So something from a Rx Server/Service would go to our informatics team, or basically Rx IT(Special IT to the RX group and Apps)

                              I have some boxes that facilities likes to know status on (they hold keys and allow for codes for access and retrieval of they keys, catalogging everything to a server); so they want to know up/down.

                              And our Telehealth group wants to monitor Room Controls and Projectors. Alerting on Lamp Life or use, and breaks or disconnects within their system that I am able to get viewability with using snmp.  Latter on this telehealth stuff has some examples setup for them, as the full picture is a work in progress.

                               

                              Usually I will set myself up on these alerts as well to make sure i am aware for one, in case they call. Also to make sure the Alerts work properly.

                          • Re: How to Monitor Effectively
                            michael stump

                            I'm in the habit of watching the Top 10 % Utilization very closely throughout the day. I care more about Data Center than WAN, so I'm looking at interface utilization for hosts more than other elements. Many server-side problems are the result of overburdened interfaces, so I'm always on the lookout for saturated internal links.

                            • Re: How to Monitor Effectively
                              byrona

                              Do you monitor link saturation?

                                   Yes

                              Do you monitor unicast routing tables?

                                   No

                              Do you monitor multicast route tables?

                                   No

                               

                              While we monitor link saturation, we don't alert on it.  We have found that at peak times it's common for some links to be saturated, it's known, accepted and not considered an incident.  This goes back to only alerting on true incidents which this generally isn't.  Unfortunately monitoring is rarely black and white when it comes to determining what is and isn't an incident, it often requires situational awareness and context, this is where a NOC Tech comes into play.

                              • Re: How to Monitor Effectively
                                RandyBrown

                                Yes on link saturation monitoring ... but we don't alert on it currently.

                                Right now we do not monitor unicast/multicast route tables.

                                • Re: How to Monitor Effectively
                                  Scott Sadlocha

                                  Do you monitor link saturation?

                                       In both environments I have worked with, the answer is yes. With my previous environment, we monitored and had quite a few alerts set up for different scenarios, including discards and errors along with utilization. At my current place, we are monitoring, but not really fully utilizing alerting for this yet.

                                  Do you monitor unicast routing tables?

                                       No

                                  Do you monitor multicast route tables?

                                       No

                                  • Re: How to Monitor Effectively
                                    rharland2012

                                    Link saturation - absolutely. MPLS links especially.

                                    URT - we don't monitor, per se - but we are collecting the info and can view it. I'm the only in our team who has even a passing interest at this point. If we changed our routing model, I feel like some interest would be elicited.

                                    MRT- never.

                                    • Re: How to Monitor Effectively
                                      Alen Geopfarth

                                      We monitor link saturation as well. I have a page created that gives me 24 hr. 1 week and 1 month polling for WAN circuits. I created that one before updating our NPM version and they changed to the new graphs. With the new graphs in place I created another web page that loads the last weeks worth of internet circuit activity which allows us to drill down to specific times. These pages are useful in giving me a sense over time of what is normal and what isn't. I then tweak the alerting to ignore "normal" and log "abnormal" and alert on "wrong".

                                      • Re: How to Monitor Effectively
                                        wbrown

                                        Link saturation:  Not really as such.  We'll monitor uplinks on physical uplink ports so we can have something to possibly point to when users complain of slow response times.  After we get such complains we can make the suggestion that they pay for a larger connection to their area.

                                         

                                        Unicast routing tables:  not yet but definitely wish to do so after we upgrade from 10.3 to 10.5 (or higher depending on timing).

                                         

                                        Multicast routing tables:  not yet, probably need to do so when available, but not really a priority.

                                        • Re: How to Monitor Effectively
                                          jeffnorton

                                          So in reference to monitoring data center ports, Solarwinds provides a wonderful feature that lets you monitor but not monitor the data center access ports.  Confusing?  It's called the unplugged feature.  What this does is allow NPM to capture stats when it's up and wait to capture stats when it's down.   But wait, you say, isn't that normal operations?  The unplugged features squelches the alarms that would normally trigger on an interface down.  So you can monitor for patterns of usage and not be bombarded by monitoring spam for devices that might be taken down for maintenance or the dreaded Microsoft OS update.  So if you are network nerd and you just care about network related alarms and not server or user related alarms but still want to provide performance analysis for those elements, then unplugged is your solution.

                                          • Re: How to Monitor Effectively
                                            802jr

                                            We also monitor Link Saturation. When you have partial T-1s you need to do this. otherwise we have issues. Most of the times it is something that is not supposed to be generating traffic is. We find the culprit and the issues get resolved.

                                             

                                            Great information for everyone.

                                            • Re: How to Monitor Effectively
                                              superfly99

                                              SomeClown wrote:

                                               

                                              * Do you monitor link saturation?

                                              * Do you monitor unicast routing tables?

                                              * Do you monitor multicast route tables?

                                               

                                              Yes. We get alerted if a remote site's link sits on or over 90% continously for 10 mins. In the email if contains a link to the netflow page for that device so it's easy to see where the traffic is coming from and if it is legit traffic.

                                              No

                                              No

                                               

                                              I'll have to look into the monitoring of route tables. I do monitor for BGP changes on links. It's the only way to tell if a link goes down in most cases as the ports are still in up/up.

                                              • How to Monitor Effectively
                                                storn

                                                We do not monitor the access layer, with one exception, the uplinks to the distribution layer.

                                                We also do not monitor unicast or multicast routes. However we have been discussing our options for unicast monitoring.

                                                 

                                                We have established baselines for all nodes and have created exception rules for nodes like SQL and Linux servers.  Nodes with exceptions are grouped by type for reporting and alerting purposes.  Makes it much easier for both.

                                                 

                                                We do monitor link saturation but on specific links not on all. We also have a couple links that we monitor and alert on  when the  link utilization is NOT above 15% for 5mins during certain times of the day.  We perform 1000s of FTP transfers several times a day.  If the outgoing links are not being utilized during the transfers, something is not right.

                                                • Re: How to Monitor Effectively
                                                  that1guy15

                                                  My current network is a work in progress.

                                                   

                                                  Im pretty much inline with what everyone else is saying. We only monitor uplinks and core interfaces. I monitor both physical and logical interfaces such as Port-channel and VLAN interfaces. We also have centralized syslog and Trap collection in multiple locations. Netflow is in place as well in all the key areas of routing.

                                                   

                                                  Next step for me is to better monitor my routing tables and BGP peers and also fine tune our alerts.

                                                   

                                                  Im also still in the phase of getting everyone on our team familiar with our SW NMS and netflow appliance. My end goal is for the whole team to use SW as their first response to any issue instead of SSHing into the effected device. Slow and steady!