cancel
Showing results for 
Search instead for 
Did you mean: 

Node Outage Duration

Node Outage Duration

I'd like to propose the calculation and retention of a Node Outage Duration metric and also mask this using a Service Hours profile.

We currently use a SQL query to calculate Node Outage Durations based on the elapsed time between a Node Down event and the corresponding Node Up event. This requires us to retain a huge Event Log for a 90 day visibility.

Our minimum requirement would be for a table which contains Node ID, Node Down Event Timestamp, Node Up Event Timestamp and Outage Duration retained for 90 days.  We've tried using the Hourly/Weekly/Daily availability stats to calculated crude outage durations but we can't mask out non-service hours without a timestamp.

Better still, if a Node (or parent Group) had a defined 'service hours' profile then different outage durations and availabilities could be calculated for SLA reporting purposes.

When a Node Detail view is displayed, a current or previous Service Hours vs. Total Outage Duration could be displayed.

In anticipation.

Tags (2)
22 Comments
pyro13g
Level 13

I would like to suggest this also be expanded to duration over threshold.  Example:  Link  A 2 B was above the utilization threshold of 70% for DD:HH:MM.  I chose to use Splunk to calculate this based on the trigger and reset events.

mitul_darji
Level 8

Please can anyone help on this query ???

Even i have this requirement but my requirement is for node outage_duration of a week i.e 7 days.

Please check below the query i am running....

Here is the Query:

SELECT DATEDIFF(d, T1.DownTime, CURRENT_TIMESTAMP) AS DaysDown, Nodes.Caption,  Nodes.IP_Address, T1.DownTime,  Nodes.Site, Nodes.Location

FROM
(SELECT
  Max(EventTime) AS DownTime, NetObjectID, NetworkNode

FROM Events

WHERE
  (EventType = 1) AND (NetObjectType ='N')

GROUP BY NetObjectID, NetworkNode
) AS T1 INNER JOIN Nodes ON T1.NetworkNode = Nodes.NodeID

WHERE
  (Nodes.Status = '2' OR Nodes.Status = '0')

ORDER BY DaysDown DESC, Nodes.Caption ASC,  T1.DownTime DESC

*****Note****
You may need to remove Nodes.Site because that is a custom property.

Here is a sample of the report:

Days   Node                     IP_Address                          DownTime                  Site       Location

Down

28LXK13E8C6N5/13/2011 3:04:53 PMTower
19xxx.xx.x.xxx5/22/2011 9:11:38 AMTower
19KYKEWESPR2-VIP5/22/2011 8:18:25 AMCOT
16DTSFKSD015/25/2011 10:51:40 AM33R
15NPI43E1F25/26/2011 10:04:36 AM
10HPD110FG04615/31/2011 4:05:08 PM
10HPO110D904615/31/2011 4:00:56 PM
10NPI2E46515/31/2011 4:29:02 PMTowerHP Color LaserJet CP2025dn
8xxx.xxx.xx.xxx6/2/2011 9:50:49 PMHarlan DVR
8Aficio MP 80006/2/2011 2:01:09 PM
8COR/PP & CFC GARRAD CO. 50QLXX56/2/2011 9:50:26 PMDanville DVR152 PLEASANT RETREAT PLAZA LANCASTER, KY
mitul_darji
Level 8

I have got the solution for this...

Try this SQL Query....

SELECT

StartTime.EventTime,

(SELECT TOP 1

EventTime

FROM Events AS Endtime

WHERE EndTime.EventTime > StartTime.EventTime

AND EndTime.EventType = 5

AND EndTime.NetObjectType = 'N'

AND EndTime.NetworkNode = StartTime.NetworkNode

ORDER BY EndTime.EventTime) AS UpEventTime,

Nodes.Caption, StartTime.Message, DATEDIFF(Mi, StartTime.EventTime,(SELECT TOP 1 EventTime FROM Events AS Endtime

WHERE EndTime.EventTime > StartTime.EventTime AND EndTime.EventType = 5 AND EndTime.NetObjectType = 'N'

AND EndTime.NetworkNode = StartTime.NetworkNode  ORDER BY EndTime.EventTime)) AS OutageDurationInMinutes

FROM Events StartTime INNER JOIN Nodes ON StartTime.NetworkNode = Nodes.NodeID

WHERE (StartTime.EventType = 1)

ORDER BY eventtime desc

Here is the report i got :

"Event Time"                 UpEventTime                           Message                                             OutageDurationInMinutes

"03-Apr-13 05:47 AM"                                            172.30.20.99 has stopped responding

"03-Apr-13 05:39 AM"    "4/3/2013 5:45:30 AM"       172.30.20.20 has stopped responding                       6

"03-Apr-13 05:27 AM"    "4/3/2013 5:31:29 AM"       172.30.20.20 has stopped responding                       4

"03-Apr-13 04:37 AM"    "4/3/2013 4:51:30 AM"       172.30.20.20 has stopped responding                       14

"02-Apr-13 07:33 AM"    "4/3/2013 5:37:39 AM"       172.30.20.99 has stopped responding                       1324

"29-Mar-13 11:46 PM"   "3/30/2013 12:17:10 AM"    SWITCH_NSG has stopped responding                     31

"29-Mar-13 11:46 PM"   "3/30/2013 12:17:09 AM"    NSG-Firewall has stopped responding                       31

"29-Mar-13 11:45 PM"   "3/30/2013 12:18:08 AM"    propams has stopped responding                              33

"29-Mar-13 01:14 AM"   "3/29/2013 1:18:44 AM"      propams has stopped responding                              4

This suggestion is fantastic.  The result is exactly what my business is demanding from me, and I can find no way in the product to do it.  I can generate outage durations for all of my nodes, but then I have to comb through them to see if any of the downtime crosses business hours at our locations.  It's a massive pain.  If my DBA was looking for things to do I could probably have him write a query, but Scheduled Maintenance Windows seem like they would be a pretty obvious piece of the Orion solution.  Setting those windows on Groups of objects makes absolute sense.  Then having a canned report that showed all of the *service interrupting* outage events in a scheduled report... how hard could that be for SolarWinds to implement in their package?

What management would really like from me is a method to send a weekly, monthly, and annual report on overall downtime.  If I could do that for different groups of objects and across the different peices of Orion (e.g., report WAN outages from NPM and Application Availability from SAM), that would make... well, it would make me look like some kind of awesome wizard.  I'm loving Orion overall, but there are some simple features (like daily maintenance windows) that are mysteriously absent.

jsbajada
Level 9

Could this be expanded to take into account the amount of Unmanaged time a node has?  This would expose any machines that have been set to Unmanaged in order to manipulate availability statistics.

moshera
Level 9

thank you very much it was very helpful for me , but could you help me to add a Colum for the node IP ? and to select nodes of specific group ?

marc.coxall
Level 9

I like this idea but I think it would also be useful to report against business hours only as an option

pserwe
Level 12

This is one of the best feature requests I've seen, and I would think would be pretty easy to get nailed down.  Supersized +1.

nickzourdos
Level 14

I agree. In order to pull business hour information I have to include a timestamp in the report, export it to Excel, and use a formula to only display outages from 8-5.

nickzourdos
Level 14

I think this would be a good addition to this feature:

pastedImage_2.png

Being able to quickly see how long a node has been offline would be super neat! I'm sure there's a way to add this to NPM by messing with a few style sheets or resource files, maybe I'll put some time into it and post it in the content exchange.