cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 10

Best practices for Microsoft SQL / AppInsight for SQL?

Hey everyone!

Now that our DBAs are starting to dig in and really pay attention to our alerting and monitoring in Orion, we're finding some places where I feel we haven't set up things properly. With this in mind, I wanted to reach out to the community, to see if what I've done lines up with what others have done?

I'm going to use an example of one of our DEV environments.

2 Nodes: 03a-d and 03b-d, agents installed

ahbrook_0-1591123383796.png

ahbrook_1-1591123441108.png

 

2 database instances - EMAS03Dev and IAI03Dev.

Nodes are using windows server failover cluster tech for active/passive configuration.

The database instances have unique listener IPs/VIPs, separate from the underlying nodes.

2 Listerner Nodes: We have the listener IPS configured using WMI (agentless) instrumentation. One is for EMAS03Dev and one with IAIA03Dev.  It detects the underlying node as 03a-d or 03b-d depending on which is primary.

ahbrook_2-1591123543195.png

ahbrook_3-1591123761360.png

 

On the listener Nodes, we have the AppInsight for SQL set up (we needed WMI to choose the AppInsight for SQL template).

Also on the listener Nodes, we also have the Windows Server 2012-2016 failover cluster configured for monitoring.

We do not have any application monitors configured on the individual nodes as of yet.

 

ahbrook_5-1591124939939.png

 

What we are looking for:

Our #1 priority is the health of the entire cluster. I think we've got that pretty well covered -- the AppInsight for SQL being assigned to the cluster (without using an agent) follows the primary failover server, so it looks like it is up at all times in our testing so far.

However, our DBAs also want to be informed if any issues arise with the underlying cluster health. Specifically they want to know if an individual SQL instance goes down, or if the failover cluster can't form a quorum or has errors.

Ideally I'd like to set up alerts that send off updates on the underlying cluster health, because the DBAs don't want to live in Orion to know if something has happened.

What we are seeing:

As mentioned, the AppInsight for SQL seems fine. However, there are 2 lacking things:

1. I am not sure how to monitor the individual nodes. I would think we could turn on the AppInsight at the nodes level, but then one might show as unknown or offline when it is in a passive configuration.

2. It seems like the Failover Cluster application is only really tracking things when the cluster listener is pointed at one server, and not the other. In the below example, we know that the primary/secondary failover has occurred at specific times, and I think the unknown matches up to that.

ahbrook_4-1591124196967.png

Suggestions?

Honestly, I am mostly just looking for how to set this up. I thought putting everything on the cluster node was working, but now that we see the Failover Cluster data is coming through as "unknown" and we aren't really getting reports of state changes, I'm thinking we need to have the failover cluster setup for each OS node (and maybe even the cluster listener node too). I also am not sure if I should be setting up AppInsight for SQL on both individual servers and the cluster, but I have a strong feeling that will just duplicate data for no reason and add to confusion. Ideally we'd just have small monitors that would tell us if the SQL services were running, if they were in active/passive, and if an individual instance was running.

 

Tags (3)
3 Replies

I do exactly what you describe at the end, appinsight at the listener level, lightweight service monitors at the node level. For tracking things like losing individual cluster members and such I just have some event log monitors and sql error log monitors. If I was working with clients who wanted to get fancy I'd cook up swql widgets that would look at the node I'm on, figure out which others nodes also had instances of the same database and link them all up so I could quickly see which nodes had active copies of what.
- Marc Netterfield, Github

Interesting finding. I applied the failover cluster monitor to the individual nodes, and quite a few went red. Turns out, in an active/passive failover situation, the secondary/failover node doesn't actually report its resources as online. So I've copied the primary failover cluster template and made a new one with that threshold removed. It's probably more heavy handed than I need, but it should give proper visibility.

(also I thought this would go below my existing post.)

0 Kudos

I'm glad I'm on the right path! 🙂 I just need to work with my DBAs to get those service/event monitors built up.

0 Kudos