background inforamtion: Orion 8.5.1 SP3 with APM SP3 on a Win 2003 server EE running SQL 2005 EE All servicepacks and patches are up to date
375 nodes 1300 interfaces 300 volumes
polling timming is 1min for ping 2min for interface stats and 15min for volume stats
Have done more research on this and with some sniffer captures we can see that the devices are responding to the SNMP get requests. The responses are properly formatted and look the same when the device is in the SNMP failure and after the SNMP stop/start.
The problem seems to be that ORION just stops listening to the devices. The device will report UP and due to the ping being answered but no information sent back via SNMP is accepted or recorded in the database. Some times there are no devices affect like this and others can be 1 or more. Have seen as high as 15 devices out at one time.
Of the few hundred requests/responses that were in the capture we saw very few that took over 10 thousandths of a second to respond. and the ones above this were still less than 30 thousandths of a second. Each response has the OID corresponding to the request. Compared the sequence of OID requests and responses from when it was not working and after stop/start of SNMP when ORION saw the device again. The gets and responses are identical tracking from any of the ping requests and after.
The device SNMP information is accepted again once the SNMP service on it has been stopped and started.
This puts it back in Solarwinds area to determine what is causing this problem. Becasue from the packet captures we can see that the servers are processing the SNMP requests and responding to it properly.
Multiple users have reported having this issue and it needs addressed as this is not a satifactory state of operation. I can only see this issue getting worse as we load more servers on ORION coupled with APM.
There is no specific time or any pattern that we can see and the problem affects multiple servers at different times.