Validating Network Blips Using SQL Sentry Data

A customer recently asked me how they could detect any trends in network availability, since they knew that when we lose connectivity to a server, we have a client alert that indicates that - at least as far as we're concerned - the server is offline. They were trying to correlate these network blips with real service outages to show their networking team any patterns that might exist to help troubleshoot the issue, since they knew that multiple alerts had fired without any evidence of the SQL Server services being unavailable to users or applications.

In a recent update to SQL Sentry (8.1), a new table was added: dbo.EventSourceConnectionStatusHistory. This is where the Server Health Status uptime bar in our cloud service gets its data from:

Server Health Status
Server Health Status

We can use this data to help the customer visualize their network blips. First, let's understand how many instances of downtime or connectivity loss we've observed over the past, say, 72 hours, for a named instance:

SELECT h.NormalizedStartTimeUtc
     , h.NormalizedEndTimeUtc
     , duration = DATEDIFF(SECOND, h.NormalizedStartTimeUtc, h.NormalizedEndTimeUtc)
FROM dbo.EventSourceConnectionStatusHistory AS h
INNER JOIN dbo.EventSourceConnection AS c
     ON h.EventSourceConnectionID = c.ID
WHERE c.ObjectName = N'Server\Instance'
     AND h.ServiceConnectionStatus = 2 -- offline
     AND h.ServiceConnectionType = 0 -- primary
     AND h.ServiceConnectionException LIKE N'A network-related%'
     AND h.NormalizedStartTimeUtc >= DATEADD(HOUR, - 72, SYSUTCDATETIME());

Now, there might just be a few, so that should be enough to line up with any real outages or known service restarts and spot any that are out of place. But if this returns a lot of data, it may be harder to grasp it by looking at a grid result in Management Studio. So, we can visualize it better by joining against a time sequence, and then plotting the outages in Excel or Tableau. Borrowing techniques from one of my posts on generating sets, and assuming we want a marker for every 1-minute period over the last three days:

DECLARE @linkedServer SYSNAME = N'Server\Instance'
      , @days INT             = 3
      , @today DATE           = SYSUTCDATETIME();
DECLARE @start DATETIME2      = DATEADD(DAY, - @days, @today);

;WITH t (marker)
AS (
     SELECT DATEADD(MINUTE, n - 1, @start)
     FROM (
          SELECT TOP (DATEDIFF(MINUTE, @start, @today) + 1) ROW_NUMBER() OVER (
                    ORDER BY s1.Number
                    )
          FROM master.dbo.spt_values AS s1
          CROSS JOIN master.dbo.spt_values AS s2
          ) AS x(n)
     )
SELECT t.marker
     , outage = CASE 
          WHEN d.start IS NOT NULL
               THEN 1
          ELSE 0
          END
FROM t
LEFT JOIN (
     SELECT [start] = h.NormalizedStartTimeUtc
          , finish = COALESCE(h.NormalizedEndTimeUtc, SYSUTCDATETIME())
     FROM dbo.EventSourceConnectionStatusHistory AS h
     INNER JOIN dbo.EventSourceConnection AS c
          ON h.EventSourceConnectionID = c.ID
     WHERE h.ServiceConnectionStatus = 2
          AND h.ServiceConnectionType = 0
          AND h.ServiceConnectionException LIKE N'A network-related%'
          AND c.ObjectName = N'Server\Instance'
          AND h.NormalizedStartTimeUtc >= @start
     ) AS d
     ON t.marker >= d.start
          AND t.marker < d.finish
ORDER BY t.marker;

We could also dump the data from the target instance's error logs - at least through the last six service starts, or more if you've customized the number of error logs you keep - and determine if any of these blips roughly correlate to an actual downtime. Assuming that the server where the SQL Sentry database is located has a linked server to the monitored instance, and that it is configured to use the same name as the ObjectName we store, you could do this to see the downtime caused by up to the last 7 starts/restarts (hopefully there aren't that many in recent history):

DECLARE @linkedServer SYSNAME = N'Server\Instance'
      , @days INT             = 2
      , @today DATE           = SYSUTCDATETIME();
DECLARE @start DATE           = DATEADD(DAY, - @days, @today)
      , @logNumber INT        = 0
      , @delta INT
      , @rpc NVARCHAR(MAX)    = QUOTENAME(@linkedServer) + N'.master.sys.sp_executesql';

EXEC @rpc N'SELECT @delta = DATEDIFF(MINUTE, SYSDATETIME(), SYSUTCDATETIME());'
     , N'@delta INT OUTPUT'
     , @delta OUTPUT;

SET @rpc = @linkedServer + N'.master.sys.xp_readerrorlog';

CREATE TABLE #errlog (
     LogDate DATETIME
     , ProcessInfo SYSNAME
     , [Text] NVARCHAR(MAX)
     );

WHILE @logNumber <= 6
BEGIN
     INSERT #errlog (
          LogDate
          , ProcessInfo
          , [Text]
          )
     EXEC @rpc @LogNumber = @logNumber
          , @LogType = 1
          , @SearchTerm1 = N'SQL Server is now ready for client connections.'
          , @SearchTerm2 = NULL
          , @StartDate = @start;

     INSERT #errlog (
          LogDate
          , ProcessInfo
          , [Text]
          )
     EXEC @rpc @LogNumber = @logNumber
          , @LogType = 1
          , @SearchTerm1 = N'SQL Server is terminating'
          , @SearchTerm2 = NULL
          , @StartDate = @start;

     SET @logNumber += 1;
END

;WITH d
AS (
     SELECT LogDate = DATEADD(MINUTE, @delta, LogDate)
          , s = CASE 
               WHEN [Text] LIKE N'%terminating%'
                    THEN 'd'
               ELSE 'u'
               END
          , rn = ROW_NUMBER() OVER (
               PARTITION BY LEFT([Text], 17) ORDER BY LogDate
               )
     FROM #errlog
     )
SELECT went_down = log1.LogDate
     , back_up = log2.LogDate
FROM d AS log1
INNER JOIN d AS log2
     ON log1.rn = log2.rn - 1
          AND log1.s = 'd'
          AND log2.s = 'u'
ORDER BY log1.LogDate;

DROP TABLE #errlog;

Now, we can get a little more complex and chart these two things together with a single script and see if all of our network blips are "explained" by actual service restarts.

DECLARE @linkedServer SYSNAME = N'Server\Instance'
      , @days INT             = 3
      , @logNumber INT        = 0
      , @today DATE           = DATEADD(DAY, - 11, CONVERT(SMALLDATETIME, SYSUTCDATETIME()));
DECLARE @start DATE           = DATEADD(DAY, - @days, @today)
      , @delta INT
      , @cmd NVARCHAR(MAX)    = QUOTENAME(@linkedServer) + N'.master.sys.sp_executesql';

EXEC @cmd N'SELECT @delta = DATEDIFF(MINUTE, SYSDATETIME(), SYSUTCDATETIME());'
     , N'@delta INT OUTPUT'
     , @delta OUTPUT;

SET @cmd = @linkedServer + N'.master.sys.xp_readerrorlog';

CREATE TABLE #errlog (
     LogDate DATETIME
     , ProcessInfo SYSNAME
     , [Text] NVARCHAR(MAX)
     );

WHILE @logNumber <= 6
BEGIN
     INSERT #errlog (
          LogDate
          , ProcessInfo
          , [Text]
          )
     EXEC @cmd @LogNumber = @logNumber
          , @LogType = 1
          , @SearchTerm1 = N'SQL Server is now ready for client connections.'
          , @SearchTerm2 = NULL
          , @StartDate = @start;

     INSERT #errlog (
          LogDate
          , ProcessInfo
          , [Text]
          )
     EXEC @cmd @LogNumber = @logNumber
          , @LogType = 1
          , @SearchTerm1 = N'SQL Server is terminating'
          , @SearchTerm2 = NULL
          , @StartDate = @start;

     SET @logNumber += 1;
END

;WITH d
AS (
     SELECT LogDate = DATEADD(MINUTE, @delta, LogDate)
          , s = CASE 
               WHEN [Text] LIKE N'%terminating%'
                    THEN 'd'
               ELSE 'u'
               END
          , rn = ROW_NUMBER() OVER (
               PARTITION BY LEFT([Text], 17) ORDER BY LogDate
               )
     FROM #errlog
     )
     , ExplainedOutages (
     went_down
     , back_up
     )
AS (
     SELECT went_down = log1.LogDate
          , back_up = log2.LogDate
     FROM d AS log1
     INNER JOIN d AS log2
          ON log1.rn = log2.rn - 1
               AND log1.s = 'd'
               AND log2.s = 'u'
     )
     , t (marker)
AS (
     SELECT DATEADD(MINUTE, n - 1, CONVERT(DATETIME, @start))
     FROM (
          SELECT TOP (DATEDIFF(MINUTE, @start, @today) + 1) ROW_NUMBER() OVER (
                    ORDER BY s1.Number
                    )
          FROM master.dbo.spt_values AS s1
          CROSS JOIN master.dbo.spt_values AS s2
          ) AS x(n)
     )
SELECT t.marker
     , ObservedBlip = CASE 
          WHEN d.start IS NOT NULL
               THEN 0.5
          ELSE 1
          END
     , KnownOutage = CASE 
          WHEN o.went_down IS NOT NULL
               THEN 0.5
          ELSE 0
          END
FROM t
LEFT JOIN (
     SELECT [start] = h.NormalizedStartTimeUtc
          , finish = COALESCE(h.NormalizedEndTimeUtc, SYSUTCDATETIME())
     FROM dbo.EventSourceConnectionStatusHistory AS h
     INNER JOIN dbo.EventSourceConnection AS c
          ON h.EventSourceConnectionID = c.ID
     WHERE h.ServiceConnectionStatus = 2
          AND h.ServiceConnectionType = 0
          AND h.ServiceConnectionException LIKE N'A network-related%'
          AND c.ObjectName = @linkedServer
          AND h.NormalizedStartTimeUtc >= @start
     ) AS d
     ON t.marker >= d.start
          AND t.marker <= d.finish
LEFT JOIN ExplainedOutages AS o
     ON t.marker >= o.went_down
          AND t.marker <= o.back_up
ORDER BY t.marker;

DROP TABLE #errlog;

The output of that query is largely unusable as is, since it will create 1,440 rows per day, but this is why we have software other than Management Studio for consuming results. I basically copied the output, pasted into Excel, and was able to make this simple line chart in a couple of minutes (then a couple more to annotate in Photoshop). I'm sure you could do something prettier with tools like Google Charts, Tableau, etc. But for a quick eyeball this was a pretty efficient and productive approach - it immediately shows where we have blips that can be explained away by service restarts, and a blip that cannot be explained by known service disruptions:

Correlating Observed Blips with Known Outages

You can see that at around 2 PM on the 11th and 10 AM on the 12th, we had service restarts coupled with unsuccessful attempts to reach the instance. But on the 12th at around 2 AM, we had difficulty reaching the instance, but this cannot be explained by a service outage, so the reason must lie elsewhere. This is the data you can take to your network personnel.

There's a little bit of messiness here, because I haven't dealt with Daylight Saving Time (and this is not a simple topic - see my set of tips: Part 1, Part 2, and Part 3), but you can get a clear enough picture as long as you are querying a short span of recent history and not immediately after a time change (and even in that case, you can manually adjust any anomalies by shifting them an hour). Personally, I prefer to have all of my servers in UTC to avoid precisely this kind of problem. I also didn't deal with the case where the target instance is down right now (after all, if the service is down, you can't exactly do things like query its error log).

Thwack - Symbolize TM, R, and C