cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Node Downtime with Duration and Minimum Length Filtering

**REQUIRES ORION PLATFORM 2018.2 OR ABOVE**

I had assembled this based on a much older SQL report, and then updated it to SWQL, then added some more intelligence to it so you can filter it based on the duration of the outage, search by the device names, and it has a method of letting you know when nodes have been down so long they aged out of the events table.

Based on popular requests I figured it was time to put it out here to make it easier for the Thwackers to find and use.  This is intended to be used inside the Custom Query Resource


pastedImage_1.png

select n.caption as [Device]
-- shows the current status icon
, '/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Device]
-- makes a clickable link to the node details
, n.DetailsUrl as [_linkfor_Device]
-- shows the timestamp of the down event, if there is no timestamp then is says the event was greater than the number of days in your event retention settings
, isnull(tostring(t2.[Down Event]),concat('Greater than ',(SELECT CurrentValue FROM Orion.Settings where settingid='SWNetPerfMon-Settings-Retain Events'),' days ago')) as [Down Event]
-- shows the timestamp of the up event, unless the object is still down
, isnull(tostring(t2.[Up Event]),'Still Down') as [Up Event]
-- figures out the minutes between the down and up events, if the object is still down it counts from the down event to now, displays 99999 if we cannot accurately determine the original downtime, and
, isnull(MINUTEDIFF(t2.[Down Event], isnull(t2.[Up Event],GETDATE())),99999) as Minutes


from orion.nodes n
left join (SELECT   
-- Device nodeid used for our join  
StartTime.Nodes.NodeID    

-- Down Event time stamp in local time zone   
,ToLocal(StartTime.EventTime) AS [Down Event]   
 
-- Up Event time stamp in local time zone   
,(SELECT TOP 1   
ToLocal(EventTime) AS [EventTime]   
FROM Orion.Events AS [EndTime]   
-- picks the first up event that is newer than the down event for this node
WHERE EndTime.EventTime >= StartTime.EventTime  
-- EventType 5 is a node up
AND EndTime.EventType = 5   
AND EndTime.NetObjectID = StartTime.NetObjectID   
AND EventTime IS NOT NULL   
ORDER BY EndTime.EventTime   
) AS [Up Event]   
 
-- This is the table we are querying   
FROM Orion.Events StartTime   
 
-- EventType 1 is a node down
WHERE StartTime.EventType = 1   
   
) t2 on n.NodeID = t2.nodeid


-- this is how I catch nodes that are down but have aged out of the events table
where (n.status = 2 or t2.nodeid is not null)


-- If you want to filter the results to only show outages of a minimum duration uncomment the below line
--and MINUTEDIFF(isnull(t2.[Down Event],(GETUTCDATE()-30)), isnull(t2.[Up Event],GETUTCDATE())) >  60


-- if you want to use this query in a search box of the Custom Query resource uncomment the below line
--and n.Caption like '%${SEARCH_STRING}%'


order by t2.[down event] desc

-Marc Netterfield

    Loop1 Systems: SolarWinds Training and Professional Services

Labels (3)
Comments

WAAHOO! This is terrific! You are my new hero. However, since human's only have an 8 second attention span, you should enjoy that while it lasts.

Seriously, MANY THANKS!

D

I am getting an error when I try to import this into one of our dashboards.  Does it need to be on a specific version?

I'm running the latest release, can't see anything that jumps out to me as requiring a specific version, which one are you running?

We are on 12.1.

Do you have SWQL studio installed?  Try it there and you tend to get much more detailed error messages.  Troubleshooting it through the web UI is rarely productive.

https://github.com/solarwinds/OrionSDK/releases

Its not working on my current version i guess. Im using 11.5.2

-_-

The query im using is below:

====working but not all node down is showing==

SELECT

N.StatusLED as Status,N.Caption AS Device,MAX(E.EventTime) AS DownTime,

Cast(DateDiff(day,MAX(E.EventTime),getdate()) as varchar) + ' Day(s) ' + convert(char(8),dateadd(second,DateDiff(second,MAX(E.EventTime),getdate()),0),14) as Duration

FROM

Nodes N

INNER JOIN Events E ON E.NetworkNode = N.NodeID

where N.status = 2 and eventtype = 1

GROUP BY

     N.StatusLED,

     N.Caption

====working and all down nodes are showing, issue or requirement is the led status and the duration format i preferred is the previous query==

Select Caption, DATEDIFF(Mi, DATEADD(hh,-5, LastSystemUpTimePollUtc), GetDate()) as [Time Down In Minutes], Convert(varchar, DATEADD(hh,-5, LastSystemUpTimePollUtc),110) As [Last Date Up], Convert(varchar, DATEADD(hh,-5, LastSystemUpTimePollUtc),108) As [Last Time Up]

From Nodes

Where Status = 2

Order by [Time Down In Minutes] Desc

I happened to set this up with my client this week and it won't run for them using NPM 12.2, so its possible that this would require 12.3, I'll pin down the version and update the post.

mesverrum​,

I'm running 12.2, and the UP Time is failing. For some reason it doesn't like the "StartTime" on lines 20, 27, and 30. What info do you need to assist in helping?

D

From what I've seen it looks like this query only works in the 2018 and newer versions of the core, older versions of SWQL don't like my extra layers of nested subqueries

Agreed, that's what I'm finding. By the way, for anyone watching this post, these references are AMAZING!!!

Intro to API, SDK & SWQL

D

WOW, very useful!

I'm not really a SWQL wizard,  could this be edited to get the Neighbor(BGP) down time?

Thanks.

I believe it could as long as there is an eventid for bgp neighbors going up and down, but I don't have any neighbors on any of my lab devices so I don't have anything to work with to figure it out right now.

There's no eventid for that.   It uses event type 5000/5001 ( Alert triggered and Alert reset ).  And message contains %Neighbor% , %Down% /%UP%

I am trying to mix your above query with the one found here: Need to create Event report for Neighbor down

But still haven't gotten it to work.

Thanks.

mesverrum awesome work dude, I've been trying to work on something like this but time (and knowledge) has alluded me. I too am unable to get your query working in 12.2 I suspect due to the way the JOIN is created, it doesn't seem to like the different FROM statements, I suspect you might be able to use a UNION to get around this? I'm still new to SQL/SWQL.

For now, I've dumbed it down a little to get it to work for clients running 12.2 although I think once I upgrade to 12.3 I will be moving back to your version.

-- Nodes Down Duration

SELECT

n.Caption AS [Device]

,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Device]

,n.DetailsUrl AS [_LinkFor_Device]

,CONCAT(SUBSTRING(tostring(MAX(e.EVENTTIME)),1,4),SUBSTRING(tostring(MAX(e.EVENTTIME)),5,2),

        SUBSTRING(tostring(tolocal(MAX(e.EVENTTIME))),12,8)) as Downtime,

  CONCAT(HOURDIFF(tolocal(max(e.eventtime)),getdate())/24,' Day(s) ',

        HOURDIFF(tolocal(max(e.eventtime)),getdate())-(HOURDIFF(tolocal(max(e.eventtime)),getdate())/24)*24,'h ',

        MINUTEDIFF(tolocal(max(e.eventtime)),getdate())   -   (MINUTEDIFF(tolocal(max(e.eventtime)),getdate())/60)*60,'m') AS Duration

FROM Orion.Nodes n

INNER JOIN Orion.Events e ON n.NodeID = e.NetworkNode

WHERE STATUS = 2 and E.Eventtype=1 --and nodes.customproperties.SystemsGrouping Like '%CPE%'

GROUP BY NodeName, StatusIcon, DetailsUrl

ORDER BY MINUTEDIFF(tolocal(MAX(E.EventTime)),getdate())  desc

pastedImage_0.png

Hope it helps people until they can upgrade

I'm testing this query with a node that I manually took down 12 minutes ago, but the report shows the Duration as 312 minutes. Ideas?

If the device is still down it should be using getutcdate(), maybe try just getdate().  Are you in the utc-5 time zone?  Another factor to keep an eye would be if you have multiple pollers make sure they are all using the same timezone because of all the documented weirdness that Orion gets when your pollers are not in the same one.

Just tried it on my newly upgraded 2018.2 HF4 box, but it is showing all of my events there are 10454 entries I have copied and pasted from the original post, has anything changed since the 12.3 release?

I did fond that the events that have both a trigger and reset timestamp show the accurate time. The query is already using the getdate(), so I am not sure what changes might work. I am in the UTC-6 zone (CST).

UPDATE: i did change the date variable to "getdate()" for nodes that are "still down", and the values are now correct.

Another update. I am seeing duplicate entries for the same node down event. See image please.outage duration report.jpg

To be accurate you are seeing multiple down events correlating to a single up.  There are some glitchy scenarios in Orion where it records downs over and over despite the fact that the node never came up.  Haven't come up with a good workaround for that yet.

thanks buddy its very useful..

Can you create like the same query for down/unmanage/mute interface

I'm using this on current version with HF3... and it works.

I need to filter on node from the below mention query kindly help me in this.

SELECT

'Unmanaged' as [Status]

,n.Caption AS [Node]

 

,tostring(tolocal(n.UnManageFrom)) AS [From]

,case when n.UnManageUntil is null or n.UnManageUntil = '9999-01-01 00:00:00' then 'Not set'

else tostring(tolocal(n.UnManageUntil)) end AS [Until]

,case when n.UnManageUntil is null or n.UnManageUntil = '9999-01-01 00:00:00' then '-'

else tostring(daydiff(getutcdate(), n.unmanageuntil)) end as [Days Left]

,n.DetailsURL AS [_LinkFor_Node]

,'/Orion/images/StatusIcons/Small-' + n.StatusLED AS [_IconFor_Node]

,CASE

WHEN ae.accountID IS NULL THEN 'Audit Log Not Found'

ELSE ae.AccountID

END AS [Account]

FROM

Orion.Nodes n

JOIN (

    SELECT rec.NetObjectID, max(rec.timeloggedutc) as recent

    FROM Orion.AuditingEvents rec

    WHERE rec.auditingactiontype.actiontype = 'Orion.NodeUnmanaged'

    group BY rec.NetObjectID) mostrecent ON mostrecent.NetObjectID = n.NodeID

JOIN (

    SELECT ae.NetObjectID, ae.AccountID, ae.timeloggedutc

    FROM Orion.AuditingEvents ae

    WHERE ae.auditingactiontype.actiontype = 'Orion.NodeUnmanaged') ae ON ae.NetObjectID = n.NodeID and ae.timeloggedutc=mostrecent.recent

WHERE n.Status = 9

union all

(SELECT

'Muted' as [Status]

,n.caption

,tostring(tolocal(SuppressFrom)) as [From]

,case when SuppressUntil is null or SuppressUntil = '9999-01-01 00:00:00' then 'Not set'

else tostring(tolocal(SuppressUntil )) end AS [Until]

,case when SuppressUntil is null or SuppressUntil = '9999-01-01 00:00:00' then '-'

else tostring(daydiff(getutcdate(), asup.SuppressUntil)) end as [Days Left]

,n.DetailsURL AS [_LinkFor_Node]

,'/Orion/images/StatusIcons/Small-' + n.StatusLED AS [_IconFor_Node]

, ae.AccountID AS [Account]

FROM Orion.AlertSuppression asup

join orion.nodes n on asup.entityuri=n.uri

join (

    SELECT ae.NetObjectID, max(ae.timeloggedutc) as recent

    FROM Orion.AuditingEvents ae

    WHERE ae.auditingactiontype.actiontype in ('Orion.AlertSuppressionChanged','Orion.AlertSuppressionAdded')

    group BY ae.netobjectid) mostrecent ON mostrecent.NetObjectID = n.NodeID

join (

    SELECT ae.NetObjectID, ae.AccountID, ae.timeloggedutc

    FROM Orion.AuditingEvents ae

    WHERE ae.auditingactiontype.actiontype in ('Orion.AlertSuppressionChanged','Orion.AlertSuppressionAdded')

    Order BY ae.TimeLoggedUtc desc) ae ON ae.NetObjectID = n.NodeID and ae.timeloggedutc=mostrecent.recent

)

ORDER BY [node] asc, [status] desc

How would one limit this report to a certain time period - for example the last calendar month? Great work by the way. Thanks.

have you looked at NetObjectDowntime table?

In every client environment I've been the contents of that table were wildly inaccurate, so I had to roll my own data.

Times can be a bit annoying depending what exactly you want

For something simple like last 30 days you could try adding to the where condition near the end

and t2.[down event] > addday(-30,getutcdate())

Getting into previous calendar month is a bit more complex, this is how I'd probably go about that

and t2.[down event] > datetrunc('month',ADDMONTH(-1,getutcdate())) --subtract 1 calendar month before today then truncate the timestamp to the start of that month

and t2.[down event] < datetrunc('month',getutcdate())

Probably best to look at the following lines:

-- If you want to filter the results to only show outages of a minimum duration uncomment the below line 

--and MINUTEDIFF(isnull(t2.[Down Event],(GETUTCDATE()-30)), isnull(t2.[Up Event],GETUTCDATE())) >  60

This report sounds exactly like what I need.

Forgive me, I'm a complete SW report writing n00b but I can't get the provided code to work correctly.

When I create a report and select "Advanced Database Query (SQL, SWQL)" in the report writer, add the query and click Preview Results I receive a "Query is not valid" error.

The query posted by David Smith works fine for me.  I'm running 2018.2

Can an XML file be provide so I can just import it?

Any help would be greatly appreciated.

Thanks!

This query is not meant for the report writer, it doesn't support some of the things I have it doing. Build a report and instead of using the custom table resource look for the custom query resource and paste it into that one instead,  then it should work for you.

Hi pinch04 firstly make sure you are using the full code originally posted at the top of this post, some of the other replies contain scripts that are not working (people looking for assistance). I have tested both my own and mesverrum original script and they produce results.

Secondly, if you're seeing the "Advanced Database Query" then you are in the Dynamic builder window which is only presented in a Custom Table. This SWQL Query is designed to be used in a Custom Query. You can, of course, use a Custom Table, but that means some of the elements will not work (Such as the Link and Image etc) you would be better going to Add Content  and search for Custom Query, that will then present you an empty text window which you can copy and paste the original code in.

Custom Query!  That worked perfect now.

Thanks for the help.

Hi mesverrum ! this is very helpful and much appreciated.

Is it possible to convert or change the formatting for "Minutes" to days-hours-min, like what you tweaked for the clients running 12.2?

  1. -- figures out the minutes between the down and up events, if the object is still down it counts from the down event to now, displays 99999 if we cannot accurately determine the original downtime, and  
  2. , isnull(MINUTEDIFF(t2.[Down Event], isnull(t2.[Up Event],GETUTCDATE())),99999) as Minutes 

I'm a newbie in SolarWinds, and not really into programming stuff to be honest.

Hoping for your response or from anyone who could help!

Thanks in advanced!

Yes possible, it's kind of a pain but here is how I do it with another query I wrote

,CASE

when minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())>1440 then (tostring(round(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())/1440.0,1)) + ' Days')

when minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())>60 then (tostring(round(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())/60.0,1)) + ' Hours')

else (tostring(minutediff(o.AlertActive.TriggeredDateTime,GETUTCDATE())) + ' Minutes')

end as [Time Active]

Now's a good time for you to get some sql practice in to retrofit that block of code into this query

thanks for the prompt response! I'll start familiarizing myself to SQL, so I can integrate (hopefully) what you gave me to the main query you've posted.

thanks much mesverrum

I've built a custom report based on a Advanced Database Query using SWQL as the Query Type.  All works as expected except the current icon status and clickable link.  They just show text in the reports.  Any ideas?

The linkfor and iconfor syntax only works with the custom query resource, nothing else will acknowledge them.   You can embed the custom query resource in Web based reports, on the screen where it asks if you want to do a table or chart just search for query and it pops up.

Got it.  Thx, works as expected.  I did try to uncomment the custom query search code:

-- if you want to use this query in a search box of the Custom Query resource uncomment the below line

and n.Caption like '%${SEARCH_STRING}%'

but it returned nothing when run.  Any ideas?

I am trying to resolve how this query retrieves its data. I ran the query for a single device on my network and it displayed 8 records for the device. The report lists the up/down dates and times with the total minutes for each. When I query the Orion.Event table looking for same one device with the event type of 1 or 5 I get the following results. 4 of the records I could match to the report, one record on the report listed a 11/26 date while the query of the event table listed 11/27, another record on the report listed 12/6 for a date while the event query listed 12/7, the last two records where not found in the event table.

Management wants to use this report to track the equipment that vendors are responsible for so I need to be able to verify the data is correct. Here are a couple of questions.

1. Is there some other table that is being used? I have been looking to find what table "StartTime.EventType" or "EndTime.EventType" are defined in.

2. I have created individual groups that contain just the devices specific vendors are responsible for. How do I modify the query to select a group name for a specific date range and have just those devices appear in the report?

3. I see a reference to T2.xxx, am I correct in assuming that this is some type of temporary table?

Thank you for your help.

So this thing is a bit complex so here's how it works.

The initial query starts out by checking the events table for event type 1, which is a node down, and aliases this as starttime.

Then it uses a subselection that I'm calling endtime, to say "for each starttime you see, check the for the first up event (event type 5) that happens after the node went down, if there is no up event then use the timestamp for right now"

Then it tries to address the cases where a node is currently down, but has been down so long that the start time is no longer on the events table by joining to the nodes table and any node without a starttime should display a message showing that the down event happened more than x days ago (x is 30 by default but the query looks up the value for your system)

So it shouldn't have anything on it except for nodes that are currently down, or nodes that have recorded down events in the events table.

As far as the timestamps being different, the querying the events table directly shows up in UTC, i'm using the tolocal() function to convert them to server time, I'm fairly certain that will explain the issue for your 2 events.

T2 is just the alias I made for the whole down and up events table mashup, so yes it is a temporary table constructed from the events table.

The most common issue I have run into with this report and variations on it is that sometimes there are bugs in orion where it will add some garbage to the events table, like putting down events in over and over for a node that has not actually had any status changes.  Usually these kinds of things happen when the orion services have had to restart or have had some other problems.

As far as getting it to filter based on a group you would need to join from the nodes table (currently aliased as N) to the table that has containermembers on it, I think IIRC its orion.containermemberdefinitions where the n.nodeid is equal to whatever identified they use there, probably instanceid and entitytype like nodes,  and from there join to the containers table on containerid.

Another easy way to handle that without having to even build the join would be to build a new summary view with this and any other resources you want, then apply a view limitation to the group you need, it will automatically filter all the resources on the view based on that group.

Thank you very much for the explanation, it was very helpful. After I posted my note I realized what the ToLocal function was doing, I modified my query of the Event table and the two dates changed to what I was expecting but I still have two missing records. For my testing purposes, that only modification that I made to your query was to restrict it to a single device. I then wrote a separate query looking for the same device and looking for just event types of 1 or 5. The query that you wrote found two more records than what I am finding in the event table.

See the table below, the first three columns are from your query while the last column are from the query I ran on the events table. As you can see there is a record for 12/1 and 11/22 but I cannot find the corresponding events for those dates in the event table. Any ideas?

Here are the results.

  

From Query ExecutionFound in Event Table
Down EventUp EventMinutes
Dec 16
  2018  6:46PM
Dec 16 2018  6:58PM12Yes
Dec
   6 2018 11:18PM
Dec  6 2018 11:28PM10Yes
Dec
   1 2018  9:41PM
Dec  1 2018  9:51PM10Missing
Nov 28
  2018  3:41PM
Nov 28 2018  3:51PM10Yes
Nov 28
  2018 12:57PM
Nov 28 2018  1:09PM12Yes
Nov 26
  2018 11:50PM
Nov 26 2018 11:56PM6Yes
Nov 22
  2018  9:11PM
Nov 22 2018  9:19PM8Missing

Nov 20
  2018  4:31AM

Nov 20 2018  4:41AM10Yes

Found my missing records. After reviewing the query again I changed my event table query to search based on the NetObjectID for that device. Initially I had it display all records for that device. Looking at the results I noticed that the device name had changed. My original event table query was based on searching for the device name. There is a flaw in the device firmware where periodically the device name is lost. Once I changed my query to used the NetObjectID then I was able to match all of the records to your original query. I have been trying to verify the results for a while now and trying to figure out the Solarwinds database has been a challenge. Once again, thank you very much for your help.

I'm trying to alter this to only show where DeviceRole = Firewall.. Where would I insert this into this function?

Figured it out if anyone finds it helpful to them as well.

SELECT

N.StatusLED as Status,N.Caption AS Device,MAX(E.EventTime) AS DownTime,

Cast(DateDiff(day,MAX(E.EventTime),getdate()) as varchar) + ' Day(s) ' + convert(char(8),dateadd(second,DateDiff(second,MAX(E.EventTime),getdate()),0),14) as Duration

FROM

Nodes N

INNER JOIN Events E ON E.NetworkNode = N.NodeID

where N.status = 2 and eventtype = 1 and N.DeviceRole = 'Firewall' and N.Department = 'Office'

GROUP BY

     N.StatusLED,

     N.Caption,

     N.DeviceRole,

     N.Department

Select Caption, DATEDIFF(Mi, DATEADD(hh,-5, LastSystemUpTimePollUtc), GetDate()) as [Time Down In Minutes], Convert(varchar, DATEADD(hh,-5, LastSystemUpTimePollUtc),110) As [Last Date Up], Convert(varchar, DATEADD(hh,-5, LastSystemUpTimePollUtc),108) As [Last Time Up]

From Nodes

Where Status = 2

Order by [Time Down In Minutes] Desc

I have been able to modify the query so that I can enter a group name into the query and all of the nodes within that group are listed on the report. The one thing that I have been struggling with is how to display a subtotal of the downtime minutes for each device.

To do things like subtotals you would need to union basically two separate versions of the query together, one how I have it, and one with sum of the downtime figures and a bunch of placeholders, since a union requires the same number of columns, but on a summary scenario many of these columns wouldn't be appropriate/necessary.

I was thinking that that may be what you would suggest, I am still learning how to build reports in Solarwinds. I built this same report using Crystal Reports and was able to add subtotals and was hoping to be able to do something similar with Solarwinds. I will try your suggestion since Solarwinds will allow me to schedule when to run the report, I don't have that option with Crystal Reports.

Thank you again for your help.

Version history
Revision #:
1 of 1
Last update:
‎05-29-2018 05:42 PM
Updated by: