I have alerts set up to notify me if a routing neighbor goes down on any router or switch in our network. There are some devices that have BGP neighbors that are shut down for various reasons and occasionally SolarWinds sends alerts on them because they appear as Idle. Normally this would be fine, but they are Idle for years so I'm not sure why the alerts are re-triggering. The alert settings seem fine but for some reason SolarWinds is seeing status changes on the routing neighbors when in fact there isn't. Example:
None of the times match up between the router and SolarWinds. Is this normal behavior, and is there any way to fix it? This is happening on multiple routers so the alerts always require additional research to make sure it's just not a false alert.
This sounds like what is happening is that we have a setting on the Settings table called NPM_Settings_RoutingNeighbor_Retain_Days, by default that is 30 days. So after 30 days, when the maintenance is ran, it deletes the historical routing data. If a alert was triggered by that historical data it gets reset. The alert query runs again and then since the conditions are still meet to trigger the alert it triggers again and the customer gets the emails. I would suggest they either increase the retention for the routing neighbors, max is 365 days,
We also have issues with is, one problem being that the Idle state that fires off the alert, when examined from the devices CLI shows a nugget of additional information Idle(admin).
This Idle(admin) is similar to an admin-down'ed interface port, but when viewed from SNMP (poll or trap) this is lost.
Though this is the vendor's implementation of the BGP MIB that is at fault.
I am currently trying to tie incoming SNMP traps to syslog messages, but with little success.
Once that neighbor interface is administratively shut down, the state shouldn't change at all though causing a re-trigger correct? It should remain 'idle' which is the problem I'm having. Neighbors that have been idle are once a month showing some form of state change which is re-tripping the alert, even though a previous down alert existed.
This was happening to us as well on that 30 day cycle.
I don't recall which retention setting this is, but there is one that clears the state, and clears old routes out of the database. Then they are free to fire the alert again, especially if those routes are still in the routing tables on the router.
The other thing we did was fix a logic error in the out of the box alert to prevent some deleted routes from firing alerts. This significantly lessened the number of bogus alerts we were getting.
What's odd is that after upgrading from v12.1 to v12.3 my Orion Protocol Status only allows me to use numeric value. This initially broke my alert rule that I had configured once I upgraded a couple of months ago.
Well, we did also have our router admins clear out obsolete routes so those wouldn't fire anymore. That helped as well. But the 30 day cycle is still there, IF the route is still in the routing tables, or more precisely the SNMP OID that I think SolarWinds polls for the routing table still has the route in it. bgpPeerState 18.104.22.168.22.214.171.124.1.2 This is the OID I was using to track down this issue for us and verify that when the alerts did fire, that route was still in this table on the router. Talking with support confirmed routes are cleared from the database on a cycle, that we determined was that 30 day cycle.
We do have some intentionally idle routes that I have been told are there for failover/fallback routes. Those still fire every 30 days, but at least we are aware of what is happening now, and for the most part we only see the routing alert fire now along with a circuit or interface down.
From what I can tell the only options for routing status are up and down so I'd think you would see the issue as well if you were affected. I don't have any criteria set to specify BGP or EIGRP but the only false alerts we get are for BGP so restricting the alert to those protocols shouldn't change the behavior we're seeing.
My support case is being escalated to the developers. If you've not an open case I encourage that of course. The more they are aware the more attention and hopefully some resolution or valid work around will come.
I'm having the exact same issues and it is causing my network engineers to doubt NPM. I'm on NPM v12.3 and have an active ticket with support, but so far I've not been able to escape level 1 to get better help. I too see the the state changes on interfaces that have been shutoff per my engineers. There is clearly something going on here. I don't think I saw this problem prior to upgrading from v12.1 to v12.3 a few months back.
Worth noting, When I look at the 'Last Change' timestamps in NPM they are changing and its every 30 days-ish almost on the money. Which is where I'm at odds with my network folks as they claim it's shutdown and shouldn't be flapping at all. It should consistently be in idle.
Having run a NOC, one of the questions is 'why is this still configured if it is not being used'
but, no, there's no easy way to do this.
The alternative is to have nodes for each of the BGP peers, external nodes for the ones you don't want to actively monitor.
you would then use alerts on the nodes if the BGP peering with them is down (may require a custom SQL alert)
I totally agree with the "why is this still configured part" but I guess my main question was why does SolarWinds seemingly at random fire off the alerts when the status hasn't changed on the down neighbor?
I've not looked deeply into this because I alerted for BGP issues off the BGP Backward Transition traps. Solarwinds/SNMP polling does scale the right way to manage a large (hundreds of peers and millions of routes(*)) BGP network.
note: with BGP the state is Idle -> connect -> Established and I'm not sure which of these are condered 'down' -- Certainly 'Established' is 'up'
but 'Connect' is an intermediate state that indicates the node is trying to come up;
idle is what a node reverts to if the connect fails, OR it may mean that we're in passive-open mode, i.e. waiting for the remote side to bring up BGP.
BGP peerings ought to be always up. BGP preferences should be used to preferentially pick the best route -- there are a 13 knobs that can be twisted to make this work.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.