We have an infrastructure that's pretty extensive with all sorts of devices and os versions etc. Many types of different devices etc. We are at an elemental count of around 50k elements and counting as our infra continues to under go changes and grow. Through the process I've noticed a weird error come up that I cannot for the life of me get others to believe or accept that it's a valid problem. They just want to put the finger at solarwinds and say the software is broken when in reality it's a configuration error in the infra somewhere causing this problem.
I have an old deployment of solarwinds old environment. And a new environment that was built separately. We have many network devices that either work on one environment and fail the other and vice versa. Every couple days it's choosing to work on one environment and fail the other and this is causing gaps in data. Has anyone ever experienced such a thing? The devices are currently being monitored by two separate environments. But I think think this would cause an issue in snmp monitoring. To be clear what I mean working is on the snmp monitoring. if you run snmp tests you'll see that one day it fails a few days down the line it works, then a few days later it works again. And when it doesn't work on one environment it works on the other. It is a really strange issue.
My thought is that there is filtration of some kind happening. There maybe a firewall or a river bed device or a palo alto that's messing with the packets and how things are being monitored. And when it see requests coming from two different sources it is probably detecting one source as suspect and blocking that traffic and keeps flipping around on me. Because when it fails on either sw environments if you do a pcap you get nothing back from the device it's a time out. and when it works, everything works.
There are a mix of tipping point devices, palo alto devices, f5 devices, and firewall devices with in the communication routes of these devices. It's not a straight shot from the SW server to the device.
We are trying to bring live the new deployment but can't because of these instabilities that involve both environments.
Any help at all is appreciated.