    NPM 12 Switch Stack Alerts false positives


      Has anybody else enabled the built-in alerts for Switch Stacks in NPM 12?  I've noticed that we're getting false positive stack member number change and stack master change alerts on switch stacks located in remote sites with poor WAN connectivity for monitoring.  My hunch is that the 'stack member number changed' and 'stack master changed' values are not staying consistent when a polling cycle is missed or incomplete.  Most of the poorly connected sites (India and China) actually aren't even stacks, they're just stackable switches in a stack of '1', so I know the member number or master status isn't actually changing.  I've checked the switch logs as well and don't see any interruptions or reboots occurring, merely the WAN link being lossy enough that polling failed.  I did have one actual 3 member stack in a better-connected Italy site have a brief WAN interruption that caused the same alerts.  The stack itself has no indication that any of the 3 members changed state, and all 3 member uptimes are the same 3 months.  All 3 members have configured member number and priorities as well.  All the change alerts come in pairs in the same minute, so the values in NPM are always what they were before the alerts triggered by the time I can look.


      I'm suspecting the Solarwinds database might be recording a null or something when a stack member value can't be polled which triggers the false alert?  It's so quick though I can't catch what value is going into the database when the alert triggers.


      Here's the sequence of events as seen on one switch:

          I ended up creating a support case on this.  They confirmed it's a problem that will be addressed in a future version (possibly 12.0.1 ?)

            I just enabled this alert, and the only hit so far was a false positive from a WS-C2960S-24PS-L that is not stacked.


            I checked the Release notes for NPM12.1 as of March 14, 2017, and there's no mention of the bug, or slebbon's case.



                I can confirm that upgrading to the latest March 2017 releases hasn't resolved this issue yet.


                i think i noticed a similar root-cause issue with the Netflow CBWFQ polling today as well, but I don't have time to open a support case and go through all the motions to report it right now.  Basically the CBWFQ graph shows the same value that was last recorded successfully for times the poller failed to capture data.  I have a bunch of netflow graphs with a 'hole' in the data where all the values are 0 for an interval of a few minutes (i didn't look for root cause, but i can assume the netflow collector or netflow data was lost for this interval, as the actual data of the circuit certainly wasn't 0 in this interval); but during the same interval the CBWFQ graphs show a completely flat-line value for the entire duration as whatever each section's value was last until polling data resumes.  Basically the inverse of the Switch stack polling issue.


                The root causes of both these seem the same type of mistake.  The switch stack poller stores a '0' for a failed poll, when it should keep the 'last' value, and the CBWFQ poller stores the last poll value again in the next interval, when it should actually store a '0'.

                Back in the day in Cacti/rrdtool this was a mistake of selecting the wrong data type. (http://oss.oetiker.ch/rrdtool/tut/rrd-beginners.en.html )  In Solarwinds, it seems to be an exercise in trying to explain to support why the developers choose the wrong data type, while they make excuses that the development team probably knows better than I do what values should be recorded for my switch which lost some polling data replies due to horribly unreliable Chinese Internet access.

                This just started happening for us on a new set of 3850's. However, of all the 3850's we have deployed, it only started happening to 1 of our stacks. Same IOS, same config, ... any updates from anyone on this? We're on 12.1.

                Switch Stack master number changed from 1 to 0 --- we see this over and over.