So I have several pain points and I was curious to know if anyone has seen worst? Or if anyone has tips, tricks, advice, help, etc..that will be well appreciated as well!
1. The case of the orphaned phantom service. 4 of our additional pollers decided that it no longer wanted to consider one of the business layer processes valid. The process does run and is started on every reboot. Force kill doesn't work, says PID is invalid. Through task manager says access denied. And it's evident that it's running as it's locking files. We know this because on these 4 pollers, configuration wizard will not run. And permissions are correct as confirmed by permissions checker and verified manually. Head scratcher isn't?
2. Primary poller randomly opening 12k plus WMI connections to whois.iana.org. weird, why do it through wmi? why is whois needed? Why not reach local dns, etc. Lots of why's. And no answers.
3. Agent communication randomly dropping out. A poller will show a server up. Show data, but it's obsolete. Agent says connected and is green. Agent restart command works. still green. Yet list resources hangs forever. And poll no won't return new data. Poller reboot all of a sudden syncs the agents back on and we are flooded with servers showing reboot alerts and properly showing data.
4. Site rendering some weirdness. On the login page we see a space at the top of the page and randomly that space will show a button with something regarding widget setup???? Ummmm, I'm confused. Then different sections of the site will randomly show errors. Configuration wizard is clean on primary and site has been rebuilt at least 30 times already with the same effect.
5. When database maintenance runs alot of connection timeouts to the database during the process. Can see the server, can ping the server, just can't continue to sql saying connection timed out waiting response. Because of this we have frequent pollers failures among issues with data consistency and accuracy. DB was built with all solarwinds recommendations and than some more. Has a 10gig network pipe and fastest storage we have available in our infrastructure.
6. IPAM is a complete mess. Unusable.
7. Alerts work weirdly. You clear one alert the system automatically clears every thing after it and retriggers everything. Alerts fail too.
8. Reports provide inaccurate information.
And I can go on and on and on and write till I reach character limits. The issues all seem compounding each other in that you fix one and create another 20 in the process. I'm almost convinced best thing is to drop everything and start fresh. But then question becomes how can we preserve current historical data?
So what do you think? could it get worst? and if not is there hope for a brighter day? hahaha.