This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

NPM 12 Switch Stack Alerts false positives

Has anybody else enabled the built-in alerts for Switch Stacks in NPM 12?  I've noticed that we're getting false positive stack member number change and stack master change alerts on switch stacks located in remote sites with poor WAN connectivity for monitoring.  My hunch is that the 'stack member number changed' and 'stack master changed' values are not staying consistent when a polling cycle is missed or incomplete.  Most of the poorly connected sites (India and China) actually aren't even stacks, they're just stackable switches in a stack of '1', so I know the member number or master status isn't actually changing.  I've checked the switch logs as well and don't see any interruptions or reboots occurring, merely the WAN link being lossy enough that polling failed.  I did have one actual 3 member stack in a better-connected Italy site have a brief WAN interruption that caused the same alerts.  The stack itself has no indication that any of the 3 members changed state, and all 3 member uptimes are the same 3 months.  All 3 members have configured member number and priorities as well.  All the change alerts come in pairs in the same minute, so the values in NPM are always what they were before the alerts triggered by the time I can look.

I'm suspecting the Solarwinds database might be recording a null or something when a stack member value can't be polled which triggers the false alert?  It's so quick though I can't catch what value is going into the database when the alert triggers.

Here's the sequence of events as seen on one switch:

pastedImage_2.png

  • I ended up creating a support case on this.  They confirmed it's a problem that will be addressed in a future version (possibly 12.0.1 ?)

  • Could you provide the case number?

  • I have a similar issue with polling false positives on random switches.  it continued when I upgraded to 12.0.1

  • We don't see this often.  We'd love to take a look at your environment.  Would you be able to create a ticket so we can investigate?

  • I think i might be facing similar behavior as what TS is facing. A switch that is stack capable but not in stack mode, but false alerts being sent out as stack status has changed. Running on NPM 12.0

  • This is a bug we're tracking in our case: NPM-3131.

  • Sorry, didn't see this reply earlier.  It's Case # 1018114.  I still haven't gotten a fix.  I believe it's a design issue, as the poller writes a '0' into the database when the snmp query times out/is lost mid-transfer.  For some values this makes sense, like a traffic counter or temperature (so a graph over time show missed values as 'empty'), but for others (like 'uptime' or this stack member count) it doesn't.  This occurs frequently in our environment only for our China Offices which have very poor connectivity, with periods every day of high latency and packet loss on the VPN links traversing the regular Internet in/out of China.  The poller is in the UK and when connectivity is right on the border of unusable, we get very frequent occurrences of these stack change alerts.

    This was not fixed in 12.0.1.

    I have yet to see a fix, besides silencing the alerting, or increasing the alert trigger duration, so the value changes back in the next polling cycle before triggering the alarm.  The ticket was closed with status of "This appears to be a known bug with the product. Our Engineering team is working on the fix for this issue though no ETA has been specified yet."

  • I just enabled this alert, and the only hit so far was a false positive from a WS-C2960S-24PS-L that is not stacked.

    I checked the Release notes for NPM12.1 as of March 14, 2017, and there's no mention of the bug, or slebbon's case.

    =Foon=

  • I can confirm that upgrading to the latest March 2017 releases hasn't resolved this issue yet.

    i think i noticed a similar root-cause issue with the Netflow CBWFQ polling today as well, but I don't have time to open a support case and go through all the motions to report it right now.  Basically the CBWFQ graph shows the same value that was last recorded successfully for times the poller failed to capture data.  I have a bunch of netflow graphs with a 'hole' in the data where all the values are 0 for an interval of a few minutes (i didn't look for root cause, but i can assume the netflow collector or netflow data was lost for this interval, as the actual data of the circuit certainly wasn't 0 in this interval); but during the same interval the CBWFQ graphs show a completely flat-line value for the entire duration as whatever each section's value was last until polling data resumes.  Basically the inverse of the Switch stack polling issue.

    The root causes of both these seem the same type of mistake.  The switch stack poller stores a '0' for a failed poll, when it should keep the 'last' value, and the CBWFQ poller stores the last poll value again in the next interval, when it should actually store a '0'.

    Back in the day in Cacti/rrdtool this was a mistake of selecting the wrong data type. (http://oss.oetiker.ch/rrdtool/tut/rrd-beginners.en.html )  In Solarwinds, it seems to be an exercise in trying to explain to support why the developers choose the wrong data type, while they make excuses that the development team probably knows better than I do what values should be recorded for my switch which lost some polling data replies due to horribly unreliable Chinese Internet access.

  • Any updates on this? I sent roughly 1000 emails to myself. The alert manager said that 0 devices would trigger.