Hey everyone!
Now that our DBAs are starting to dig in and really pay attention to our alerting and monitoring in Orion, we're finding some places where I feel we haven't set up things properly. With this in mind, I wanted to reach out to the community, to see if what I've done lines up with what others have done?
I'm going to use an example of one of our DEV environments.
2 Nodes: 03a-d and 03b-d, agents installed
2 database instances - EMAS03Dev and IAI03Dev.
Nodes are using windows server failover cluster tech for active/passive configuration.
The database instances have unique listener IPs/VIPs, separate from the underlying nodes.
2 Listerner Nodes: We have the listener IPS configured using WMI (agentless) instrumentation. One is for EMAS03Dev and one with IAIA03Dev. It detects the underlying node as 03a-d or 03b-d depending on which is primary.
On the listener Nodes, we have the AppInsight for SQL set up (we needed WMI to choose the AppInsight for SQL template).
Also on the listener Nodes, we also have the Windows Server 2012-2016 failover cluster configured for monitoring.
We do not have any application monitors configured on the individual nodes as of yet.
What we are looking for:
Our #1 priority is the health of the entire cluster. I think we've got that pretty well covered -- the AppInsight for SQL being assigned to the cluster (without using an agent) follows the primary failover server, so it looks like it is up at all times in our testing so far.
However, our DBAs also want to be informed if any issues arise with the underlying cluster health. Specifically they want to know if an individual SQL instance goes down, or if the failover cluster can't form a quorum or has errors.
Ideally I'd like to set up alerts that send off updates on the underlying cluster health, because the DBAs don't want to live in Orion to know if something has happened.
What we are seeing:
As mentioned, the AppInsight for SQL seems fine. However, there are 2 lacking things:
1. I am not sure how to monitor the individual nodes. I would think we could turn on the AppInsight at the nodes level, but then one might show as unknown or offline when it is in a passive configuration.
2. It seems like the Failover Cluster application is only really tracking things when the cluster listener is pointed at one server, and not the other. In the below example, we know that the primary/secondary failover has occurred at specific times, and I think the unknown matches up to that.
Suggestions?
Honestly, I am mostly just looking for how to set this up. I thought putting everything on the cluster node was working, but now that we see the Failover Cluster data is coming through as "unknown" and we aren't really getting reports of state changes, I'm thinking we need to have the failover cluster setup for each OS node (and maybe even the cluster listener node too). I also am not sure if I should be setting up AppInsight for SQL on both individual servers and the cluster, but I have a strong feeling that will just duplicate data for no reason and add to confusion. Ideally we'd just have small monitors that would tell us if the SQL services were running, if they were in active/passive, and if an individual instance was running.