SQL Query monitoring in Orion SAM – best practice

Usage

As a SolarWinds THWACK® MVP and a MS SQL DBA for many years, combining SolarWinds Orion with SQL is close to hand. Especially SolarWinds® Server & Application Monitor (SAM) and the “SQL User Experience Monitor Component” has helped me in several cases. With this component, you can query a monitored SQL Instance with almost any query you want, as long as you format the output from the query in the correct way.

The question, the query, must be formulated so that the answer, the output, is a number.

For example:

  • How many errors have there been in the last 5 minutes?
  • How many backups have failed in the last 12 hours?
  • How many databases have NOT had a valid backup in the last 12 hours?
  • How long is the queue?
  • How old is the oldest post in the queue?
  • How many new orders have we received in the last 10 min?

As you see, the questions can vary a lot. Not just looking at how many errors or failures we have, but also how long a queue is or how many new posts we have in the last x minutes. By knowing what is expected, we can alert if something is not normal, meaning something in the data flow chain is not working as expected.

Configure the SQL User Experience Component

When you add the SQL User Experience Monitor Component to a SAM Template, you are presented with the below form. To get info on all fields, please see https://documentation.solarwinds.com/en/success_center/sam/content/sam-sql-server-user-experience-monitor-sw3308.htm, but I will go through a few of them.

Description – I like to add a good description of what this component does and what to do when an alert is triggered here. I do this on all components. In 6 months, you will have forgotten, and the people receiving the alert will most likely not know what this is all about, so I send this text in the alert email. It could also be a link to a KB.

Credentials for Monitoring – An account that has read-permission on the database

Port and SQL Server Instance – You don't have to do anything if you use a SQL default instance. If you have a named Instance and SQL Browser service available, just fill in the Instance. If that does not work and the SQL Instance listens on a non-standard port, put in the Port Number by changing Port Type from “Use Default Port” to “Use Static Port” and add the port number:

Initial Catalog – What database the query should be running against. You cannot use “USE [DBName];” in the query.

Use Windows Authentication first, then SQL authentication – Check the box if the credentials you are using are a Windows/active directory account. It will still work, but you will get clutter in the log files. It’s just wrong.

Orion Requirements on the SQL Query

The output from the query must follow some rules for Orion to accept it.

  • the output must be a number in the first cell
  • the second cell can be a string
  • no lines beside the first line will be used

Like this:

With above output we give a “0” to Orion as the value we trigger alerts on and create graphs on. The Message “Optional Message” is as it says optional, but I like to use it to provide extra info on what is wrong etc.

The number value MUST be provided. This gives queries that sometimes return nothing - null, maybe if no rows are found, will not be valid and must be handled.

 

Demo SQL Table

In the examples below, I will use the Orion database for the queries.

Simple example

First example, “how many nodes are down”?

 

SELECT

   COUNT(N.Nodeid) AS [Statistic]

FROM Nodes AS N

WHERE N.[Status]=2

 

Another simple example, “how many ‘application down events’ have we had in the last hour”

SELECT

   COUNT(E.Eventid) AS [Statistic]

FROM Events AS E

WHERE

   E.EventTime>Dateadd(HOUR,-1,Getdate())  -- Events last hour

   AND E.EventType=505 – Event type Application down

 

We can also add a simple [Message] to the query to utilize that extra field we have:

SELECT

  COUNT(E.Eventid) AS [Statistic]

  ,CAST(COUNT(E.Eventid) AS varchar(2)) + ' down events' AS [Message]

FROM Events AS E

WHERE

  E.EventTime>Dateadd(HOUR,-1,Getdate())  -- Events last hour

  AND E.EventType=505 – Event type Application down

 

 

If we want to know what application that was down, can we just add the [E].[Message] column as Message? We then need to GROUP BY [E].[Message] but then we will get one row per message and above we said we can only have one row. Now it’s starting to get complicated – next chapter!

 

 

Advanced example

In my experience, there are two big difficulties when writing a bit more advanced queries for the SQL component:

  • Getting the message from several rows into one row
  • Getting a 0, zero, when there are no issues

 

To combine the message column from several rows into one single cell I use “FOR XML PATH”:

SELECT

  N.Caption + ', '

FROM Nodes AS N

WHERE N.[Status]=2

FOR XML PATH ('')

 

To combine it all in a bit more complex query but with a reusable structure, we are going to look at a common example. “How many SQL agent jobs whose name start with “xxx” has failed in the last “yy” hours”?

To get order and structure in the query I use a CTE – Common Table Expression. CTE is a way of using several “subqueries” to get out parts of the result in several steps, where subsequent steps can get data from previous steps. You will see how they are used soon or read more here for example: https://learnsql.com/blog/what-is-common-table-expression/

 

First, we take out a list of what SQL agent jobs that have failed:

-- Variable Declarations

Declare @TimeCheck AS Datetime

Declare @SearchFor AS nvarchar(20)

 

-- Initialize Variables

set @TimeCheck = CONVERT(datetime,DATEADD(HOUR,-24,GETDATE())) -- How long back to search for failed jobs

Set @SearchFor='OffLoad_' --Job Name string to search for

 

SELECT

  Count(h.instance_id) AS S1

  ,j.[name] AS [JobName]

FROM dbo.sysjobhistory AS h

INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id

WHERE

    CAST

    (   CONVERT(VARCHAR, h.run_date)

        + ' ' + STUFF(STUFF(RIGHT('000000'

        + CONVERT(VARCHAR,h.run_time),6),5,0,':'),3,0,':')

        AS DATETIME

    )> @TimeCheck AND      

       h.run_status=0 AND --0=Failed

       h.step_id=0 AND  -- 0=Whole Job outcome

       j.[name] LIKE @SearchFor + '%'

GROUP BY J.[name]

The result looks like this:

Then we want to get all the message text into one row instead of several, using XML Path again:

SELECT

  '"' + TheSource.JobName +'"' + ' failed ' + CAST(TheSource.[S1] AS varchar(2)) + ' times ' + ', ' 

FROM TheSource

  FOR XML PATH('')

“FROM TheSource”?? That is the name of the first CTE you will see when you get the whole query. But now we have the message part as one field with the help of a CTE.

Next, I will count how many times jobs have failed. Maybe an unnecessary step, but it gives an inner peace of mind with an excellent, structured order.

SELECT

  SUM(TheSource.S1)

FROM TheSource

 

Finally, we combine all the data in the answer, the output:

SELECT

  ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic

  ,TheMessage.MessageText AS Message

FROM TheMessage

The ISNULL-part is to send a 0, zero, if no issues are found. Remember, we always have to send a number. Otherwise, SAM thinks the query is broken.

 

Combining all the parts we get this query that will give you how many SQL agents named “OffLoad_*” that has failed last 24 hours:

 

-- Variable Declarations

DECLARE @Count INT

Declare @TimeCheck AS Datetime

Declare @SearchFor AS nvarchar(20)

 

-- Initialize Variables

SET @Count = 0

set @TimeCheck = CONVERT(datetime,DATEADD(HOUR,-24,GETDATE())) -- How long back to search for failed jobs

Set @SearchFor='OffLoad_' --Job Name string to search for

 

 

;WITH TheSource AS -- The source query from here ....

(

SELECT

    Count(h.instance_id) AS S1

    ,j.[name] AS [JobName]

FROM dbo.sysjobhistory AS h

INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id

WHERE

    CAST

    (   CONVERT(VARCHAR, h.run_date)

        + ' ' + STUFF(STUFF(RIGHT('000000'

        + CONVERT(VARCHAR,h.run_time),6),5,0,':'),3,0,':')

        AS DATETIME

    )> @TimeCheck    AND   

    h.run_status=0 AND --0=Failed

    h.step_id=0 AND  -- 0=Whole Job outcome

    j.[name] LIKE @SearchFor + '%'

GROUP BY J.[name]    -- ... to here

),

TheMessage (MessageText) AS

(

SELECT

    '"' + TheSource.JobName +'"' + ' failed ' + CAST(TheSource.[S1] AS varchar(2)) + ' times ' + ', '   -- Format the output message

FROM TheSource

    FOR XML PATH('')

),

Counting (count1) AS

(

SELECT

    SUM(TheSource.S1)

FROM TheSource

)

SELECT

    ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic

    ,TheMessage.MessageText AS Message

FROM TheMessage

 

 

Stripping away the specifics for this use case, we get a structure we can use in many situations. Just change “TheSource” query and format the output as you want it:

 

-- Variable Declarations

DECLARE @Count INT

-- Initialize Variables

SET @Count = 0

 

;WITH TheSource AS -- The source query from here ....

(

SELECT

   Count(t.id) AS S1

   ,t.[name] AS [M1]

FROM table as T

WHERE Something='something'

GROUP BY T.[name]    -- ... to here

),

TheMessage (MessageText) AS

(

SELECT

   '"' + TheSource.M1 +'"' + ' failed ' + ', '   -- Format the output message

FROM TheSource

   FOR XML PATH('')

),

Counting (count1) AS

(

SELECT

   SUM(TheSource.S1)

FROM TheSource

)

SELECT

   ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic

   ,TheMessage.MessageText AS Message

FROM TheMessage

 

 

I hope you enjoyed the reading and are inspired to test these methods on your own use cases. Good luck!

THWACK - Symbolize TM, R, and C