We are having the same issue with Orion v9.1 SP5. We have an Avaya Cajun P330 stack where some of the monitored interfaces will go to an unknown state and the condition will not clear unless we restart the "Solarwinds Network Performance Monitor" service. I can List Resources and the ports will show green. If I try to Rediscover, the progress bar flashes up very briefly and then disappears. It's annoying to say the least. Any help is appreciated.
I do face this problem for few of unix based billing server...suddenly all the interfaces become unknown and cause the panic. I need to restart al the services then it becomes ok.
Please solve this redundancy issue.
I have a similiar problem with certain Cisco devices behind an ASA. I map SNMP through to the device by mapping a range of high ports to the interface of the ASA to the device behind it, and out of 8 851 routers, 2 of them show no interfaces at all, yet I can discover them.
This seems to be a recurring issue with a number of customers. Here is another recent post with the same symptoms. Random Nodes 'seem' to stop polling
We have seen similar issues in prior releases (prior to 9.x), but it occurred very seldomly, and wasn't nearly as prevalent as with the new 9.1 SP5 instance that we are now running.
In our particular case, we are still building out our new Orion instance, and are currently monitoring more elements than I would like on the polling engine that has exhibited this issue, so I'm about to move some of them off to see if it's potentially load related. I've also noticed that the problem seems to crop up about the same time that I'm adding or deleting elements, and I have yet to see it happen during periods where there are no changes being made, though I don't believe enough time has passed to definitively link the two events.
Has anyone noticed a pattern of this happening around the same time that you are adding or deleting elements, or perhaps simply modifying them in System Manager or from the web management page? Or... Can anyone verify that this occurred in their environment at times when you KNOW that nothing has changed?
I'll keep watching to see if I can find a pattern.
We had this issue pop up with another device today, so we did a little digging, and here's what we found.
First off, our new Orion instance is behind a firewall cluster, so we have to traverse it to reach all the devices we monitor. This is a departure from our previous configuration where the Orion servers resided in the same security domain as 98% of the devices we monitored. BTW, the firewall cluster is a pair of Checkpoint firewalls running on Nokia HW.
Like others have mentioned, this was a case where we noticed that we were no longer collecting SNMP data for a particular device, but ICMP was still working. We saw no CPU usage, interface statistics, or interface status (as it's gathered via SNMP as well). So, all the interfaces on the device appeared to have a status of "Unknown". If we went into System Manager, the interfaces also showed up as "?", but a "List Resources" would always properly display all the interfaces, which indicates that we are able to talk to it via SNMP.
The first thing I wanted to do was see what the firewall was showing. With that, we looked into the firewall logs to make sure there wasn't anything being dropped, and also to verify that it was seeing the requests from the Orion polling engine. Interestingly enough, we didn't see any SNMP traffic in the firewall logs from the polling engine to the router in question as a result of the normal polling and statistics collection process. However, when we did a "List Resources", we would see this traffic in the firewall logs. Hmmmm...???
To take it a step further, we used "tcpdump" on the inbound firewall interface, and here we did see the SNMP traffic, but we noticed that the normal polling and statistics collection process was using one source UDP port, and the ad hoc "List Resources" was using a different source port. Keep in mind that the "List Resources" works, and the normal process doesn't.
With that in mind, we turned around and did a tcpdump on the outbound interface, and we didn't see the normal polling and statistics collection traffic hitting this interface. This tells us that for some reason the firewall stopped forwarding this traffic, but it is still forwarding traffic for all the other nearly 17,000 elements we're monitoring. Why it would stop forwarding for this particular device, using this particular UDP port number is the question.
The best we can figure, this traffic is not showing up in the firewall logs because it's viewed as an open connection and thus not re-inspected and logged. Of course, it can't escape a tcpdump though. Still, the question is why it's not being forwarded.
With the notion in mind that it was possibly related to this particular open connection, we looked to see what the SNMP connection timeout value was on the firewall and discovered it to be 200 seconds. Our statistics collection rates for nodes and interfaces is set to 5 minutes, and the status polling interval for interfaces is set to 2 minutes. This configuration would not allow the connection to "close" (from the firewall's perspective), as it was never idle for more than 120 seconds. With that, I unmanaged this router for just over 200 seconds, then re-managed it again. Upon re-managing the device, all the interfaces went back green, and SNMP statistics began to show up again.
I don't really want to change our collection intervals and such, so I'm now wondering if we might want to set the SNMP connection timeout on the firewall to something below 120 seconds. Of course, that still doesn't explain why the firewall suddenly stopped forwarding this traffic, but if we can force a new connection, we should be able to avoid this issue in the future... Maybe... :)
Not sure if this helps anyone, but thought I would mention it... :)
We have a FlexiHybrid (Nokia), a microwave transport system. We noticed that no longer collecting SNMP data for this particular device. The interfaces have "Unknown" state.
We tried to rediscover the device, restart the SolarWinds Server but no success. The Solarwinds Orion version is 8.5.1 and polling every 60 seconds.
The only way to return the object to normal is restarting the FlexiHybrid. Our Orion Server is behind some switches Enterasys (N1, N3).
Is this a bug? Does it depend on the version? How can we solve this?
If you do a "List Resources" in System Manager, is it able to get the list of resuorces? Also, do you have some other app you can use to try to talk to it via SNMP (Engineers Toolset maybe) so you can determine if it's just Orion that can't talk to it?
We have some issues from time to time where a particular device will simply stop responding to SNMP from Orion or any other app, and the only way to resolve it is to reboot the device. For instance, we have seen this several times with one of our Cisco CSS's in particular. In these cases, it won't respond to SNMP queries from any application or source address, not just Orion. If you can't talk SNMP to it from any application, then it sounds like it may be something on the device itself.
No, it is not able to get the list of resources. It appears a timeout message (no answer with those SNMP credentials).
With Enginees Tools the message is ..."Communication Problem: <IP-address> is currently down, unreachable, or 'mahu76' is not a valid SNMP community string"
This issue happens quite often and it is not normal to have to reset the device so periodically.