A couple of times over the past 6 months, we've had an event (db failover or network blip) that would basically sever the connection between our Orion platform and the DB (separate servers) and I wouldn't know till after the business day started. Reviewing the Orion platform server event logs, I can find multiple event 4001 entries which lay it out that data wasn't flowing.
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The remote computer refused the network connection.
It's even more evident when logging into the Orion dashboard and it's lit up with red x's and error blocks. So far, my only solution is to reboot the server.
My question is: how can I alert myself that this happened? Can SolarWinds monitor itself and alert me if something internally is wrong? Do i need to set up something from another server to check for that error.
Secondarily, given that error - what is the best way to re-establish the connection? I wasn't sure if the Reboot was the only way or if restarting any and all Orion services would do the trick. That would be nice since it could probably be scripted.
I did contact support, but so far their response is 'yup, you won't be able to poll stuff when that error happens'. Great, but I need to be able to get ahead of that error and what to do about it. As being the SW admin is only part of my job, I'm not as versed in the platform as I should be and would appreciate any tips.