Overview
We have a unique scenario where a certain amount of SolarWinds Agents go into an unknown state. While we are still investigating the root cause, we noticed that if the Polling Engine assigned to those agents fails over using HA, then the agents will eventually reconnect.
This alert uses a fairly extensive SWQL query to look for certain conditions before it triggers. You will need to update three items in this query to adapt this to your deployment.
Update the following items
- Percentage of Agent in unknown state (Yellow): From the screenshot this is set to 80% of all agents assigned to that APE that are in an unknown state.
- Minimum count of Agents per APE (Blue): They might be a wide range of agents assigned to a single APE. For the use-cases where there is a low agent count, this allows you to have a minimum agent count per APE. This is important since the previous number (yellow) could be greatly skews in environments with low agent per APE ratios
- Minimum amount of time after a failover event before the alert can re-trigger (Green): This is in minutes since the last failover for that respective HA Pool. 720 minutes = 12 hours. It might take some time for all agents to reconnect and we did not want to trigger another HA failover prematurely.
Details of Alert Actions
- 1st Alert Action: NetPerfMon Event. This logs the details surrounding the conditions of the alert into the Events/Message center. This allows a reference point for historical purposes.
- 2nd Alert Action: This is a PowerShell scrip that calls into the SolarWinds API to force a failover. The script can be downloaded here and saved to your Main SolarWinds Server. If you run HA on your main server, then download and save it to both the main and the main standby servers.
https://thwack.solarwinds.com/content-exchange/the-orion-platform/m/scripts/3767
Misc
As an additional source of monitoring, I created this SAM Application Template so I can track unknown agent status in my environment.