Just need my maintenance sorting (today hopefully) and i can get the next RC downloaded to take a look.
Was there anything particular that was addressed that you can share details on? would be good to understand what was the cause and if our environments was a contributing factor.
It's a little too in the details to explain, and quite honestly without a lot of Orion development context, probably even more difficult to understand. If you're uncomfortable upgrading without validation that the issue you're experiencing is the same as the ones addressed in Orion Platform 2019.2, then I recommend opening a case with support. They will be able to investigate the issue and work with engineering to determine if the issue you're encountering is related.
Not a problem, to be honest hearing the issue should be resolved is enough.
If it still occurs then will raise a case to see if our issue is a different one.
Upgraded to 2019.2 HF2 a couple of weeks back and so far so good, haven't had a single issue with HA
Email alerts have been a little bit noiser on the main poller but i can live with that, thanks for the work on this.
Update on our case. We need to run validation for a few more days but I think we found the root cause. We are running SQL HA in 2 separate sites. Our DBA's had mistakenly set the replication to synchronous. Since the database VM's are in separate sites we need asynchronous replication. After configuring asynchronous replication in SQL we have not seen the HA nodes change to critical state and the HA table stop updating. We also setup the configuration manager to point to the FQDN of the SQL listener. I think the replication configuration may be the culprit. Will update the case once we are sure.
They key error we saw and that support pointed out was in the Solarwinds application event log:
og Name: SolarWinds.Net
Date: 4/23/2019 3:51:49 AM
Event ID: 4001
Task Category: None
Service was unable to open new database connection when requested.
SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The remote computer refused the network connection.)
Connection string - Data Source=OUR_SQL_Instance ;Initial Catalog=SolarWindsOrion;Persist Security Info=False;User ID=SQL-dummy-user ;X;Max Pool Size=1000;Connect Timeout=20;Load Balance Timeout=120;Packet Size=4096;Application Name="Orion Syslog Service@SyslogService.exe";Workstation ID=Ourhost;MultiSubnetFailover=True
We did finally close our case. I have seen the our HA environment stay in healthy state for over 10 days now. So yes, if you have your HA pool go to critical state, look for any SQL errors in the Solarwinds event log and make sure SQL connectivity is not the root cause.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.