Open for Voting

Capture and chart Always-On Replica synchronization "drift"

A recent issue I ran across with leveraging a readable secondary in a SQL Server Always-On Availability Group was that the secondary can get behind in applying logs to the replica. In normal operations that time is usually a second or less, but under certain load conditions, the sync delay can grow and issues may arise for users of the secondary replica.

The current "AG Status" page shows a metric of "Estimated Recovery Time". Capturing that metric over time and being able to chart it would provide insight into events that you can only see on the "AG Status" dashboard if you are looking at it in real-time.

Surfacing a chart that shows the sync delay on the "AG Status" and other tabs, would allow for correlations between activity on the primary/secondary to a sync delay.

This would also be a reasonable item to be able to create alerts for. The sync delay growing beyond a few seconds should be a warning.

pastedImage_0.png

I've created a metric using another monitoring tool to track sync delays per replica in our AG, and it has already been helpful to determine job clash and other loads on the secondary that were resulting in drift.