sounds like your poller is overloaded, off top of my head.
go to admin page -> Details section -> polling engines ==> what is the polling rate and polling completion %?
Also, keep in mind that on large switch devices, SNMP can take a while as it does its interface walk - only so many results per packet can be returned. I'd guess the overload though.
It looks like the polling completion is at 99.99%. And Like I tried explaining before when I first add all of the interfaces they show up fine ( It even polled every interface once for statistical data) however afterwords they slowly all drop to unknown. I am only experiencing this issue with this single node. We are opening a support ticket regarding this however I was hoping maybe someone has encountered this before.
1 of 1 people found this helpful
I managed to resolve this but it didn't really make any sense. I had to redirect all my NPM traffic over a backup circuit during an outage on our main connection. As soon as I did this all the polling and stuff went back to normal on the node, however as soon as I switched the NPM traffic back to the main connection where I was experiencing issues before it continued to work fine. No firewall no ACL changes nothing of that sort, the only things that were changed were some static routes on my solar winds server itself. However I have other nodes in the same sub net at this one that didn't have any problems so I can't see it being the cause for this.
Looks like I have the same issue - again with Nexus 5k switches.
At random intervals interfaces on these nodes go unknown (and then back to active after a few minutes).
It impacts all of our 5k devices after last nights upgrade from 10.3 - no other device type apprear to have this issue - and we have 6000+ others in our deployment.
Polling completion is 99.85+ on all pollers - I'm fairly sure it not related to an orion capacity type issue.
I'm more thinking it's a bug with the Nexus NX-OS rather than Orion.
The question is however - did the polling logic change between 10.2.2 and 10.3 that might have triggered this? (devices were fine under 10.2.2).
I can add that this also started happening for me on a 5K Nexus after going to 10.3 from 10.2.2
Looks like its a common issue.
We think it's more likely to be a Cisco bug rather than something wrong with the get bulk snmp read that Orion is doing (given it seems to be specific to one device type).
We will log a call to Cisco - it would be good if others did the same.
Would still be nice to hear from the Product Manager what changed in terms of the polling between 10.2.2 and 10.3 - it might give us some lead to sorting out this issue (or even a bypass until we can get the issue sorted).
These 5k devices are key customer Data Center type equipment - I wouldn't think anyone wants to have this issue hang around for long.
I'm running NPM 10.2.1. All Cisco Nexus 5020 interfaces are in unknown state, and I haven't seen them switch to active at all.
What NXOS are you all running?
We are running n5000-uk9 5.1(3)N2(1a) and are not experiencing this problem.
We are running 5.1.3.N2.1.
I might take a quick look at what was introduced in the way of fixes for 1a.
Thanks for that!
There doesn't seem like many bugs listed in 5.1.3.N2.1 that might cause this impact.
One is of interest however is CSCso74872.
Its described as "When two SNMP walks are started simultaneously, one of them may fail with the following error - 'OID not increasing'."
A litle strange however that I can't get further detail in the Cisco bug tool (or even confirmation that it was resolved in 1a).
we are running 5.0(3)N2(1)
I don't think it's just isolated to Nexus devices. I'm having the same issue regarding interfaces randomly going into unknown states however this is on Cat6509 running VSS. No problems prior to upgrading to 10.3.
I have the same problem after upgrading my 4 Nexus 5000 switches to v5.2(1)N1(1)
They completely swamp my Event Log now.
Maybe a support case agains Solarwinds should be established?
Have you opened a support ticket for this ?
First we now Poll Topology in v10.x, It cause consume a lot of Resources on device if pulling back a lot of INTO.
Be default it will poll every 30 minutes, and over time may consume resources.
In v10.3.1 there are some improvements so can manually change settings, Id suggest if not already done so to use the new settings we added in v10.3.1.
Go to NPM Settings Page and choose POLLING SETTINGS.
In here Topology Polling, set this to 600 And APPLY TO ALL.
There few setting that will MASS Apply to all nodes.
Could also simply EDIT that 1 NEXUS Switch, node Details page and EDIT Node,
And make single change there for Topology to poll every 600 mins vs 30 minutes.
Also can turn off Topology Polling for a NODE, by LIST Resources and UNTICK Topology for that Nexus 5000.
Topology Polling might be effecting CPU on that Switch if returning 1000's records from ARP Table and Bridge MIBs
I found Issue BUG Below,
I cant stand over 3rd part as its 3rd party but thought it might be relevent.
It is a bug in the code that I’m running on the N5000s (I’m running 5.0(3)N2(1)).
When I entered the command Cisco suggested to see if was indeed the memory leak bug, SNMP started working again.
From Cisco TAC:
There is a known issue in the code version you are running, which is fixed in 5.0(3)N2(2b).
Could you try the work-around below to see if this resolves the issue? This is non-disruptive.
Copy/paste the command below, as this is a hidden command.
no snmp-server load-mib dot1dbridgesnmp
We can also confirm this by looking at:
Show proc cpu sort | in snmp
Show system internal ipqos mem-stats detail
snmp memory leak associated with libcmd()
The process libcmd has a memory leak and is increasing in memory.
Timeouts can be seen when the Nexus 5000 is snmp polled from a network management server.
Nexus 5000 switch enabled for snmp and the BRIDGE-MIB is being snmp polled.
Unload the BRIDGE-MIB from the switch by using snmp server commands. If the switch is reloaded
you will have to redo step 1 as this command is not persistent. For detailed information on the
workaround please contact Cisco TAC.
im on NPM 10.6 and my Nexus 5K is on 5.2(1)N1(5)
we are also having the same problem, where interfaces go unknown.
I was on version 5.2(1)N1(4) and saw a bug related to solarwinds and snmp, so I upgraded but im still having the same problem!!!
The interfaces going in warning / up / unknown state because we might not get the response on interface status poll in time this can be seen in Wireshark Trace for each request ID for the interface polling .
I have seen this before due to the topology polling overload the devices. I would be more interested to remove the Topology polling form the Node and then check the Interface polling if it get address the issue.
Solution / Isolation Steps : -
we can remove topology pollers to reduce the SNMP load on devices.
Simply go to the Node Details page > "List Resources"
Untick " Topology Layer 2" > Submit . Now wait for few another 18 min and check the Interface status .
This should help. If not, we can also remove "Topology Layer 3" to reduce the SNMP load on devices.
If you still have the same issue even after removing the both Topology Pollers i will be further get the Wireshark Traces and Diagnostics from the Server in order to verify the "Response" time between the polling packets sent from Orion .
Thanks Malik for your post.
I have removed both layer 2 and 3, but still getting the same issues.
Have you opened support ticket ? We need to check the Diagnostics / Wireshark and determine the response time from the Node .
I had UDT monitoring the switches as well, I have turned it off. Was running better but still happening. will log ticket.
I have discovered that when this happens to me, I unmanage the device, then remanage the device. Then it brings everything back to normal.
There could be multiple reason depending on each environment , I would strongly advise to create support ticket so we can address each issue separately .
We do need Diagnostics from the server and Wireshark to determine the root cause of the issue .
I Have this issue. I Have had a case open of ages. Can re-produce at will. Only fix is unmange your Nexus before you reboot it and then remange It there is no issue, assuming its doesn't crash. If you forget you can hack the database to reenter the interface OIDs because they become stored with value -1 during interface remapping after reboot (which to me is the problem.....why does NPM do that?). Ugly. You can re-add the interfaces and have orphaned interfaces with your historical data. Really ugly.
this doesn't affect port channels or mngt interface only 'Eth' interfaces. Apparently it's the Nexus at fault, which I find difficult to believe and apparently I should contact the vendor who should be able to fix the issue. Problem is......what is the issue? Why can I correct the Orion data and make it work and why does unmanage/manage work. I'm i bit feed-up with trying to get an answer or be taken seriously and now possibly being stuck in the inter vendor blame gap doesn't help.
surely others are having this issue?
Shadow, did that fix you? We extended our time out in a dll (Solarwinds.NPM.BussinessLayer.dll.config) and increased the key = "SNMPTimeout" value="3" to 10. That worked for a couple hours. I just now unmanaged-managed our 3 Nexus appliances and am waiting for the alerts to come back...
I have since upgraded to a newer version and it still does the same thing. At this point sometimes unmanaging the devices works, and at other times, I have to delete the device and re-add it back again. I try to avoid deleting and re-adding the device due to history information being lost.