This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Does NPM actually monitor stack members?

The scenario:

  1. We have quite a few Cisco 3850 stacks.
  2. We do not monitor stack ports or individual member ports, because the traffic is not useful to us, and it increases the load on the poller.
  3. We are doing manual firmware upgrades.
  4. Alerts for the stack are muted to suppress alerts during the upgrade.
  5. One member fails to come up after firmware upgrade.
  6. All is green in Orion.

We expected to see some sort of color change on the device on the summary or device page based on Hardware Health monitoring of the stack itself. No change was observed. The member simply disappeared from Orion. We are now down 48 ports, and didn't get a clue from Orion.

Is this the expected behavior?

  • That's normal behavior when you choose to not monitor stack ports.

    I recommend using a VM environment for your pollers, and purchasing an unlimited license.  The license might seem expensive, I know, but it enables you to monitor every physical port, along with stack ports.  The benefits? 

    • VM's make it easy to expand quickly and cheaply to as many APEs as needed to manage all your switch ports.  I used seven pollers to monitor 55,000 switchports, and the environment remained fast and reliable--without needing the Enterprise Operations Console.
    • Knowing which ports have had problems and being able to address them before your users complain about slowness. 
    • Knowing which switch in a stack has failed without having to wait for Help Desk tickets and complaints to start piling up.

    At a minimum, to do right by your clients, monitor the stack ports.  Ensure you have alerting enabled for them, and that your filters allow those alerts through to your e-mail, or to whatever alerting solution you might be using (e.g.:  sending the stack port traps or switch stack traps to Splunk, then setting up Splunk to alert the right people at the site who can inspect, troubleshoot, reboot, or replace the failed switch within a stack).

    Hopefully you're using at least two ports as uplinks from a stack of switches, and that each uplink port comes from a different switch.  It helps ensure uptime no matter which switch fails--the stack cables will reroute the traffic to a switch with an active uplink port in the port channel when a different switch in the same stack provides an uplink--and that switch fails.

    Swift packets!

    Rick Schroeder

  • If you have Switch Stack monitoring enabled against the switch, you should be able to see the Switch Stack sub-view on the Node Details view.

    This view will give you a list of the switches in the stack (model, serial, master, etc) as well as a visual of the the stack data and power rings.

    If a stack cable is unplugged or a switch member goes down then the visual will change to show where the fault is, and there are also some out-of-the-box alerts for switch stacks (search for switch stack in Manage Alerts).

    https://oriondemo.solarwinds.com/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N%3a98&ViewID=58

    stackexample.png

  • Hi Rick,

    Today I only monitor stack ports and uplink ports.  I am considering adding access ports for monitoring.  How do you handle alerting from all those ports?  For example, we have a Problems page with the Down Interfaces resource.  I've added all ports on one switch stack, and as expected the down interfaces from that switch appear.   I would like to continue to use the Down Interfaces resource just for important ports (router interfaces, trunks...) and not see the access ports.  I've read of a user that uses the interface caption to filter (for example adding access, L2, or L3 to an interface) alerts.  I was curious on how you handle that.


    Thanks!

  • You asked how to handle down ports when monitoring all ports. 

    First off, I don't worry about ports on an Access Switch (ASW) being down unless they go to a Distribution Switch (DSW) or a router.  If they connect an ASW to a DSW, I alert on them using a filter built to alert on up/down changes for members of port-channels (it's a best practice to use at least two uplinks from an ASW to a DSW, and they'd go into a port-channel).  This is nice--you can quickly see your full resiliency is present or not, and that's directly associated with planned bandwidth provided for communications between ASW and DSW.

    Next, I DO care about ASW ports if they're in a data center, so I use a different set of criteria for alerting those down ports by filtering data center access ports differently than I do for office ASW ports.  An office port can be up or down, depending on if that person is in the office or not, if they're using a laptop.  It could be with them at home.  But a data center access port is either planned to be up, or planned to be down.  And it gets monitored that way each time a server is added or removed from the network.  If link is present, it needs to be monitored.  If it goes away without being manually unmonitored or put into maintenance mode in NPM, then it needs to generate an alert.  I'll admit that it took some thinking and some planning and coordination & training to get everyone in the data centers on board with the idea of manually adding or removing port monitoring in NPM every time a server was shut down or added.  But they quickly understood the rationale, and they immediately saw the benefits of actionable and useful alerts.

    Back in the day I started out with an embarrassingly primitive solution for this:  I did it manually.  It worked, and I was able to keep up with adding new ports (or NOT adding them) to my special groups for monitoring as time went by and growth occurred.  If I had to do it today from scratch, it would be a bit more of a project for planning and initial implementation, but would end up being less hassle and would consume less time in the long run than doing it manually, given the current state of servers and the data center.  VM has made a big difference there.

    My initial concern, when I first started monitoring switches & their ports with Solarwinds (back in 2004), was finding out which routers, switches & firewalls were up, and alerting on only those that were down.  

    The next concern was tracking the uplink / downlink ports between routers, Core Switches (CSW's), DSW's, and ASW's.  Every one of those was critical to successful resilience and throughput, so I put those interfaces into a group I named "Critical Interfaces".  Today you can write some filtering logic, or use NPM's built-in GUI filter logic feature, and have NPM automatically add those kinds of ports into your new Critical Interfaces monitoring group.  I set that group's alerting up to only display down ports, and then I put it on the front page of NPM, with a "Critical Interfaces" label in bold on the top of the Resource/Widget, and I placed it near the top of the front page so it stands out.  I had it only alert my Network Team.

    I stopped alerting for any down interfaces that weren't in the Critical Interfaces group because end user device nodes go down at random for any number of normal reasons.  And who needs the unnecessary alerts?

    Then I started adding ALL data center ports into that Critical Interfaces group, and I learned that was a mistake because it generated unnecessary alerts.  Data center access ports go down normally, just like business office ASW ports, and there's less importance in a single server's interfaces going down than there is for an entire switch's uplink ports to be down.  Data center access ports may temporarily be down for maintenance of servers or for replacements/renewals/etc., so I learned to only put port-channel members/uplinks/downlinks between ASW's and DSW's and CSW's in the Critical Interfaces group.

    To combat alert fatigue for data center access ports, I built a similar, but separate, group for them, and only sent those alerts to the Data Center Operators group.  That team is responsible for ensuring every server has at least two active port-channeled uplinks to two different switches.  They're the ones who want to know if one port of a two-port-port-channel is down.  I made that resource available to the System Admins so they could easily see any issues with throughput or resilience that could be associated with one or more links between a server and its switches being down.

    Later, when we started using ACI and we moved nearly every server into VM world on Cisco UCS chasses, the System Admins began doing their own monitoring of the host connections inside a server chassis, and that took a good chunk of monitoring responsibility off of my shoulders.  They wanted their own monitoring solution, so that was fine with me.  I still monitoring the multiple 40-Gig port-channel connections between the UCS chasses and the ACI servers, and the SA's keep track of the hosts' internal connections from server to chassis internal to the chassis.  Those rarely go down unexpectedly; either they're correctly configured from the start, or they're not.  Inactive ports get diagnosed and corrected before the server they connect gets put into production.

    There still are one-offs for the rule about every server having at least two active uplinks to two different servers, but thankfully those stand-alone servers are few and far between.

    Here's hoping this gives you some idea about doing helpful filtering of down ports.  Tracking down ports, for me, HAS had some benefits in the past; we had a site that was burglarized and Solarwinds was the only thing that could tell the police when the access ports lost link to the laptops & computers.  It was a critically useful tool for the insurance report and for the police, who were able to use the syslogs showing ports going down to coordinate with video cameras in the neighborhood to capture the burglar's vehicles in action.  

    But the nice thing was that my team didn't have to receive all the down ports alerts all the time, yet their traps and syslog entries were there for us to look up in syslog or Splunk.