2 Replies Latest reply on Apr 12, 2017 7:43 AM by mesverrum

    Best practices for monitoring and alerting

    sonic9t9

      As far as networking we are a Cisco shop for the majority and layer 2 switches.  ASR for wan and nexus for the data center.   Are there any documents saying what is the fundamnetal way NPM should be utilized.  I understand NPM is polling centric. What reason would you have to basically turn away from all the functionality it provides via hardware sensors and all the other polling and relly more heavily on SNMP traps? Aren't you taking away from the point of the product?  Is there any documentation to back this up on proper use of the system?  We do own several other modules including NCM and SAM just to name a few.  Some don't want to utilizie the UdP and claim it's too taxing and would prefer to turn to traps. Seems like a lot of leg work to go this route to me. What is your input? Any documenation would be very helpful.

       

      Thank you

        • Re: Best practices for monitoring and alerting
          michael malendoski

          I don't believe Solarwinds created NPM, and its other products, to be utilized in one specific way. They created the products to be customizable to the point in which the customer can choose the best way to monitor their environments. My employer is still pushing us to find the best ways to monitor network devices and servers. This would include a combination of active and passive monitoring techniques.

           

          The universal device poller is a great way to start building out your own customized monitoring resource. Although, not all switches contain the same MIB sets. Just yesterday I was attempting to gather information regarding port-channel member port statuses via SNMP OID. The OID didn't exist, unfortunately. Have to keep that in mind. The SNMPwalk.exe file comes with NCM (I believe). This tool can be used to discover all OIDs hosted on the device you are scanning. Using the results, you could then disseminate which OIDs you would like to include in a UnDP.

           

          My caveat to passive monitoring (traps, syslogs, etc.) is the fact that some traps may not be received if a section of your network goes down; or if a remote WAN site goes down. The Solarwinds server wouldn't receive any kind of traps from those devices because a network path between the server and the remote devices may not be available if there is no failsafe solution through another circuit. Same goes for Syslogs. This would be a reason you would rather utilize the "polling centric" aspect of the application(s). You'd be notified when a node, or set of nodes, aren't responding.

           

          However, a pro to SNMP trap and Syslog monitoring is the fact that you can monitor events in real-time. For example, a port-channel could be in a degraded state because one member port is running at the wrong speed. Maybe the device on the other side of the port was changed, which caused the port to autoneg to a slower speed. It would be best to utilize the Syslog Viewer and its associated alerting feature to capture these types of issues. Same goes for many other events that can occur on network devices.

           

          I have no documentation to help guide you in the right direction. It's really all about understanding your network environment and knowing what events, fields, etc. need to be monitored. I hope this helps you in some way. Thanks!

           

          Michael

            • Re: Best practices for monitoring and alerting
              mesverrum

              Piggybacking off this comment, I would also say when you are getting into traps vs polling you have to consider the purpose of the information you are hoping to collect.  A significant amount of the polling most people do with NPM is for the purposes are gauging utilization rates for baselines and historical documentation, as in "how much bandwidth does this interface typically use, is it using an unusually high amount today?"  You can't collect that kind of information via traps and it would be near impossible to make a useful chart of the values if you could.  If you are only doing event based monitoring then you are permanently in a position of being reactive rather than being able to proactively analyzing your capacities and plan for the future.  If you aren't using polling data and only rely on inbound traps then NPM is probably overkill and you should probably just use Kiwi Syslog, or some other similar program.

               

              Getting to the "polling is too taxing" side of the question, unless you are running hardware from the 90's across dial up links then SNMP should not be anywhere near too taxing for your Cisco gear.  Based on my experiences looking at the netflow data at my client sites I have never seen SNMP traffic as even 1% of total network traffic even in networks with tens of thousands of monitored nodes.  Consider that in a default Solarwinds configuration we are going to get a single ping to the node every 2 minutes, then once every 10 minutes you gather cpu/mem type metrics, once every half hour you will get hostname/machinetype/ios version type of values rediscovered.  Solarwinds has this KB indicating that they would expect to see about 2.5 MBps of traffic total when monitoring an environment with 20,000 elements (elements are interfaces + nodes + disk volumes).

              How much bandwidth does SolarWinds NPM require for monitoring? - SolarWinds Worldwide, LLC. Help and Support

               

              You also generally don't need to monitor every port, only the major choke points and uplinks to other critical infrastructure.  In many cases you can get all the useful info you need from just a handful of ports.  Solarwinds lets you adjust the frequencies of polling so if SNMP did somehow represent a significant challenge to your network then you just scale back the polling intervals.

               

              I will say that in the past I have seen cases where SNMP polling did tax certain specific devices CPU measurably.  One case I can think of was a cluster of heavily used F5 load balancers, they had so many tiny sites behind them that you could see their cpu maxing out every 10 minutes when we queried them, but in that case the balancers were already running at high utilization prior to being monitored, so if you are already taxing your hardware and think you might push it over the edge then you have grounds to be cautious, but without polling based monitoring then you probably wouldn't even know if you did have hardware running that hot.

               

              -Marc Netterfield

                  Loop1 Systems: SolarWinds Training and Professional Services