cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

False Routing Neighbor Down Alerts

I have alerts set up to notify me if a routing neighbor goes down on any router or switch in our network. There are some devices that have BGP neighbors that are shut down for various reasons and occasionally SolarWinds sends alerts on them because they appear as Idle. Normally this would be fine, but they are Idle for years so I'm not sure why the alerts are re-triggering. The alert settings seem fine but for some reason SolarWinds is seeing status changes on the routing neighbors when in fact there isn't. Example:

SolarWinds:

SW Neighbors.png

Router:

Router Neighbors.png

Alert trigger:

Trigger.png

None of the times match up between the router and SolarWinds. Is this normal behavior, and is there any way to fix it? This is happening on multiple routers so the alerts always require additional research to make sure it's just not a false alert.

0 Kudos
25 Replies
Level 9

This sounds like what is happening is that we have a setting on the Settings table called NPM_Settings_RoutingNeighbor_Retain_Days, by default that is 30 days. So after 30 days, when the maintenance is ran, it deletes the historical routing data. If a alert was triggered by that historical data it gets reset. The alert query runs again and then since the conditions are still meet to trigger the alert it triggers again and the customer gets the emails. I would suggest they either increase the retention for the routing neighbors, max is 365 days,

0 Kudos
Level 10

Are your logs getting cleared and then NPM seeing the down status as a "new" event?

0 Kudos
Level 14

We also have issues with is, one problem being that the Idle state that fires off the alert, when examined from the devices CLI shows a nugget of additional information Idle(admin).

This Idle(admin) is similar to an admin-down'ed interface port, but when viewed from SNMP (poll or trap) this is lost.

Though this is the vendor's implementation of the BGP MIB that is at fault.

I am currently trying to tie incoming SNMP traps to syslog messages, but with little success.

0 Kudos

Once that neighbor interface is administratively shut down, the state shouldn't change at all though causing a re-trigger correct?  It should remain 'idle' which is the problem I'm having.  Neighbors that have been idle are once a month showing some form of state change which is re-tripping the alert, even though a previous down alert existed. 

0 Kudos

This was happening to us as well on that 30 day cycle.

I don't recall which retention setting this is, but there is one that clears the state, and clears old routes out of the database.  Then they are free to fire the alert again, especially if those routes are still in the routing tables on the router.

The other thing we did was fix a logic error in the out of the box alert to prevent some deleted routes from firing alerts.  This significantly lessened the number of bogus alerts we were getting.

Original:

pastedImage_1.png

Corrected:

pastedImage_0.png

John Handberg
0 Kudos

What's odd is that after upgrading from v12.1 to v12.3 my Orion Protocol Status only allows me to use numeric value.  This initially broke my alert rule that I had configured once I upgraded a couple of months ago. 

pastedImage_0.png

pastedImage_4.png

0 Kudos

I have tried this method but its generated false alert for even up neighbors aswell ( it become worse) so i reverted back to original settings of mine

0 Kudos

Interesting. Looks like mine are every 30 days as well. Were you able to resolve that issue or is it something that you still deal with?

0 Kudos

Well, we did also have our router admins clear out obsolete routes so those wouldn't fire anymore.  That helped as well.  But the 30 day cycle is still there, IF the route is still in the routing tables, or more precisely the SNMP OID that I think SolarWinds polls for the routing table still has the route in it.   bgpPeerState   1.3.6.1.2.1.15.3.1.2  This is the OID I was using to track down this issue for us and verify that when the alerts did fire, that route was still in this table on the router.  Talking with support confirmed routes are cleared from the database on a cycle, that we determined was that 30 day cycle.

We do have some intentionally idle routes that I have been told are there for failover/fallback routes.  Those still fire every 30 days, but at least we are aware of what is happening now, and for the most part we only see the routing alert fire now along with a circuit or interface down.

John Handberg
0 Kudos

I alert on link status and Neighbor status changes in EIGRP and BGP.  Seems to work well.

0 Kudos

From what I can tell the only options for routing status are up and down so I'd think you would see the issue as well if you were affected. I don't have any criteria set to specify BGP or EIGRP but the only false alerts we get are for BGP so restricting the alert to those protocols shouldn't change the behavior we're seeing.

pastedImage_0.png

0 Kudos

My support case is being escalated to the developers.  If you've not an open case I encourage that of course.  The more they are aware the more attention and hopefully some resolution or valid work around will come. 

0 Kudos
Level 11

I'm having the exact same issues and it is causing my network engineers to doubt NPM.   I'm on NPM v12.3 and have an active ticket with support, but so far I've not been able to escape level 1 to get better help.   I too see the the state changes on interfaces that have been shutoff per my engineers.  There is clearly something going on here.  I don't think I saw this problem prior to upgrading from v12.1 to v12.3 a few months back.

Worth noting, When I look at the 'Last Change' timestamps in NPM they are changing and its every 30 days-ish almost on the money.  Which is where I'm at odds with my network folks as they claim it's shutdown and shouldn't be flapping at all.  It should consistently be in idle.

0 Kudos

Heh yeah it doesn't exactly instill confidence when it does this. For what it's worth this has been happening to me since well before 12.x

0 Kudos
Level 16

Having run a NOC, one of the questions is 'why is this still configured if it is not being used'

but, no, there's no easy way to do this.

The alternative is to have nodes for each of the BGP peers, external nodes for the ones you don't want to actively monitor.

you would then use alerts on the nodes if the BGP peering with them is down (may require a custom SQL alert)

0 Kudos

I totally agree with the "why is this still configured part" but I guess my main question was why does SolarWinds seemingly at random fire off the alerts when the status hasn't changed on the down neighbor?

0 Kudos

I've not looked deeply into this because I alerted for BGP issues off the BGP Backward Transition traps. Solarwinds/SNMP polling does scale the right way to manage a large (hundreds of peers and millions of routes(*)) BGP network.

note: with BGP the state is Idle -> connect -> Established and I'm not sure which of these are condered 'down' -- Certainly 'Established' is 'up'

but 'Connect' is an intermediate state that indicates the node is trying to come up;

idle is what a node reverts to if the connect fails, OR it may mean that we're in passive-open mode, i.e. waiting for the remote side to bring up BGP.

BGP peerings ought to be always up. BGP preferences should be used to preferentially pick the best route -- there are a 13 knobs that can be twisted to make this work.

Level 12

If the BGP link has been shut down, I would remove that interface from monitoring.  That should prevent NPM from alerting and free up a bit of resources as well.

Is it possible to have a view/webpage that shows this verses alerting on it?

0 Kudos

I created a Custom Table Resource to show all the BGP routes with issues.  Maybe this will help you.

pastedImage_1.png

pastedImage_0.png

John Handberg
0 Kudos