I have an issue affecting my Node down alerts that trigger off of the node status being DOWN for 5 minutes or more. I noticed today that I have a node, which is agent monitored, is switching between a status of DOWN and Critical. This is happening within the 5 minute designation for how long the down condition must exist for. It is preventing my down node alert from triggering. I could probably just lower the 5 minutes to capture the alert, but I really need to understand why the node is going to a Critical status at all when the server is completely offline and packet loss is 100%.
Can anyone explain this? I fear this issue may be more widespread than the node I'm looking at.
Here are some additional details that may be relevant about the node in question:
- It's a domain controller and we have the AppInsight for AD template applied. It's status is in "Unknown", as expected.
- The node is monitor by the Agent
- The node status is monitored by ICMP Ping (not agent)
- The alert triggers off of a "DOWN" status that exists for more than 5 minutes, nothing more.
Any help would be greatly appreciated!