SQL Query monitoring in Orion SAM – best practice
Usage
As a SolarWinds THWACK® MVP and a MS SQL DBA for many years, combining SolarWinds Orion with SQL is close to hand. Especially SolarWinds® Server & Application Monitor (SAM) and the “SQL User Experience Monitor Component” has helped me in several cases. With this component, you can query a monitored SQL Instance with almost any query you want, as long as you format the output from the query in the correct way.
The question, the query, must be formulated so that the answer, the output, is a number.
For example:
- How many errors have there been in the last 5 minutes?
- How many backups have failed in the last 12 hours?
- How many databases have NOT had a valid backup in the last 12 hours?
- How long is the queue?
- How old is the oldest post in the queue?
- How many new orders have we received in the last 10 min?
As you see, the questions can vary a lot. Not just looking at how many errors or failures we have, but also how long a queue is or how many new posts we have in the last x minutes. By knowing what is expected, we can alert if something is not normal, meaning something in the data flow chain is not working as expected.
Configure the SQL User Experience Component
When you add the SQL User Experience Monitor Component to a SAM Template, you are presented with the below form. To get info on all fields, please see https://documentation.solarwinds.com/en/success_center/sam/content/sam-sql-server-user-experience-monitor-sw3308.htm, but I will go through a few of them.
Description – I like to add a good description of what this component does and what to do when an alert is triggered here. I do this on all components. In 6 months, you will have forgotten, and the people receiving the alert will most likely not know what this is all about, so I send this text in the alert email. It could also be a link to a KB.
Credentials for Monitoring – An account that has read-permission on the database
Port and SQL Server Instance – You don't have to do anything if you use a SQL default instance. If you have a named Instance and SQL Browser service available, just fill in the Instance. If that does not work and the SQL Instance listens on a non-standard port, put in the Port Number by changing Port Type from “Use Default Port” to “Use Static Port” and add the port number:
Initial Catalog – What database the query should be running against. You cannot use “USE [DBName];” in the query.
Use Windows Authentication first, then SQL authentication – Check the box if the credentials you are using are a Windows/active directory account. It will still work, but you will get clutter in the log files. It’s just wrong.
Orion Requirements on the SQL Query
The output from the query must follow some rules for Orion to accept it.
- the output must be a number in the first cell
- the second cell can be a string
- no lines beside the first line will be used
Like this:
With above output we give a “0” to Orion as the value we trigger alerts on and create graphs on. The Message “Optional Message” is as it says optional, but I like to use it to provide extra info on what is wrong etc.
The number value MUST be provided. This gives queries that sometimes return nothing - null, maybe if no rows are found, will not be valid and must be handled.
Demo SQL Table
In the examples below, I will use the Orion database for the queries.
Simple example
First example, “how many nodes are down”?
SELECT
COUNT(N.Nodeid) AS [Statistic]
FROM Nodes AS N
WHERE N.[Status]=2
Another simple example, “how many ‘application down events’ have we had in the last hour”
SELECT
COUNT(E.Eventid) AS [Statistic]
FROM Events AS E
WHERE
E.EventTime>Dateadd(HOUR,-1,Getdate()) -- Events last hour
AND E.EventType=505 – Event type Application down
We can also add a simple [Message] to the query to utilize that extra field we have:
SELECT
COUNT(E.Eventid) AS [Statistic]
,CAST(COUNT(E.Eventid) AS varchar(2)) + ' down events' AS [Message]
FROM Events AS E
WHERE
E.EventTime>Dateadd(HOUR,-1,Getdate()) -- Events last hour
AND E.EventType=505 – Event type Application down
If we want to know what application that was down, can we just add the [E].[Message] column as Message? We then need to GROUP BY [E].[Message] but then we will get one row per message and above we said we can only have one row. Now it’s starting to get complicated – next chapter!
Advanced example
In my experience, there are two big difficulties when writing a bit more advanced queries for the SQL component:
- Getting the message from several rows into one row
- Getting a 0, zero, when there are no issues
To combine the message column from several rows into one single cell I use “FOR XML PATH”:
SELECT
N.Caption + ', '
FROM Nodes AS N
WHERE N.[Status]=2
FOR XML PATH ('')
To combine it all in a bit more complex query but with a reusable structure, we are going to look at a common example. “How many SQL agent jobs whose name start with “xxx” has failed in the last “yy” hours”?
To get order and structure in the query I use a CTE – Common Table Expression. CTE is a way of using several “subqueries” to get out parts of the result in several steps, where subsequent steps can get data from previous steps. You will see how they are used soon or read more here for example: https://learnsql.com/blog/what-is-common-table-expression/
First, we take out a list of what SQL agent jobs that have failed:
-- Variable Declarations
Declare @TimeCheck AS Datetime
Declare @SearchFor AS nvarchar(20)
-- Initialize Variables
set @TimeCheck = CONVERT(datetime,DATEADD(HOUR,-24,GETDATE())) -- How long back to search for failed jobs
Set @SearchFor='OffLoad_' --Job Name string to search for
SELECT
Count(h.instance_id) AS S1
,j.[name] AS [JobName]
FROM dbo.sysjobhistory AS h
INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id
WHERE
CAST
( CONVERT(VARCHAR, h.run_date)
+ ' ' + STUFF(STUFF(RIGHT('000000'
+ CONVERT(VARCHAR,h.run_time),6),5,0,':'),3,0,':')
AS DATETIME
)> @TimeCheck AND
h.run_status=0 AND --0=Failed
h.step_id=0 AND -- 0=Whole Job outcome
j.[name] LIKE @SearchFor + '%'
GROUP BY J.[name]
The result looks like this:
Then we want to get all the message text into one row instead of several, using XML Path again:
SELECT
'"' + TheSource.JobName +'"' + ' failed ' + CAST(TheSource.[S1] AS varchar(2)) + ' times ' + ', '
FROM TheSource
FOR XML PATH('')
“FROM TheSource”?? That is the name of the first CTE you will see when you get the whole query. But now we have the message part as one field with the help of a CTE.
Next, I will count how many times jobs have failed. Maybe an unnecessary step, but it gives an inner peace of mind with an excellent, structured order.
SELECT
SUM(TheSource.S1)
FROM TheSource
Finally, we combine all the data in the answer, the output:
SELECT
ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic
,TheMessage.MessageText AS Message
FROM TheMessage
The ISNULL-part is to send a 0, zero, if no issues are found. Remember, we always have to send a number. Otherwise, SAM thinks the query is broken.
Combining all the parts we get this query that will give you how many SQL agents named “OffLoad_*” that has failed last 24 hours:
-- Variable Declarations
DECLARE @Count INT
Declare @TimeCheck AS Datetime
Declare @SearchFor AS nvarchar(20)
-- Initialize Variables
SET @Count = 0
set @TimeCheck = CONVERT(datetime,DATEADD(HOUR,-24,GETDATE())) -- How long back to search for failed jobs
Set @SearchFor='OffLoad_' --Job Name string to search for
;WITH TheSource AS -- The source query from here ....
(
SELECT
Count(h.instance_id) AS S1
,j.[name] AS [JobName]
FROM dbo.sysjobhistory AS h
INNER JOIN msdb.dbo.sysjobs j ON h.job_id = j.job_id
WHERE
CAST
( CONVERT(VARCHAR, h.run_date)
+ ' ' + STUFF(STUFF(RIGHT('000000'
+ CONVERT(VARCHAR,h.run_time),6),5,0,':'),3,0,':')
AS DATETIME
)> @TimeCheck AND
h.run_status=0 AND --0=Failed
h.step_id=0 AND -- 0=Whole Job outcome
j.[name] LIKE @SearchFor + '%'
GROUP BY J.[name] -- ... to here
),
TheMessage (MessageText) AS
(
SELECT
'"' + TheSource.JobName +'"' + ' failed ' + CAST(TheSource.[S1] AS varchar(2)) + ' times ' + ', ' -- Format the output message
FROM TheSource
FOR XML PATH('')
),
Counting (count1) AS
(
SELECT
SUM(TheSource.S1)
FROM TheSource
)
SELECT
ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic
,TheMessage.MessageText AS Message
FROM TheMessage
Stripping away the specifics for this use case, we get a structure we can use in many situations. Just change “TheSource” query and format the output as you want it:
-- Variable Declarations
DECLARE @Count INT
-- Initialize Variables
SET @Count = 0
;WITH TheSource AS -- The source query from here ....
(
SELECT
Count(t.id) AS S1
,t.[name] AS [M1]
FROM table as T
WHERE Something='something'
GROUP BY T.[name] -- ... to here
),
TheMessage (MessageText) AS
(
SELECT
'"' + TheSource.M1 +'"' + ' failed ' + ', ' -- Format the output message
FROM TheSource
FOR XML PATH('')
),
Counting (count1) AS
(
SELECT
SUM(TheSource.S1)
FROM TheSource
)
SELECT
ISNULL((SELECT COUNT1 FROM Counting),0) AS Statistic
,TheMessage.MessageText AS Message
FROM TheMessage
I hope you enjoyed the reading and are inspired to test these methods on your own use cases. Good luck!