Manage Node's Response Time Thresholds Better By Accounting For WAN Latencies

Version 4

    < BACK TO TO THE MAGIC OF SQL SCRIPTS SERIES LIST

     

    Challenge:

     

    How do you deal with nodes response times across multiple worldwide sites?

     

    011.jpg

     

    OK, I know, I know... NPM platform does give you a way of overriding global thresholds per node through "Edit Node" properties.

     

    012.JPG

     

    However, think about this:

     

    - what if you have a lot of devices worldwide? 100s and 1000s of them?

    - how do you keep track of 100s and 1000s of overrides that you have set in the past?

    - what sort of overhead it creates to manage these overrides on a global scale?

    - what if you need to reduce/increase latency threshold for all devices at the particulate location only?

    - how do you sleep at night knowing that there might be a device (or two...) which was incorrectly overridden (not by you of course) with much higher latency and you potentially have issues without even knowing about it?

    - if for any reason you had a massive latency spike and upon the next poll it has settled - your average is still going to be skewed anyway. By default most of the alerts and resources are based on this average. So, you might be getting some false-positives.

     

    Solution:

     

    We will create a table in auxiliary database which will hold information about Latency Offsets for each site. Then we will use this table to calculate Latency for each device based on location that this device was assigned to. Keep in mind that this is exactly the same custom property which we have used in the Automatic Nodes Placement on the Worldmap solution previously. So, with just a few clicks (setting SiteID custom property) you can eventually control not only your nodes on map but also your response time worldwide - two rabbits with one shot. What is really beautiful about all this is that all your latency offset values and thresholds are stored centrally, in one table.

     

    End result:

     

    Here is the only thing which is needed to be done by Engineering Team - set Site ID custom property. That's it:

     

    SnipImage.JPG

     

    Here is what you will get:


    Main resource for your Summary NOC page:

     

    If everything is OK - it will show nothing

    017.JPG

     

    However, as long as there are some issue - they will show up here:

    014.JPG

     

    Bonus


    Few additional informational resources for the nodes's summary page so that you always have these location details on hand when you land on any node:

    016.JPG

     

    Step-By-Step:


    (1)

    Create few tables in auxiliary database

    0124.PNG

    0126.PNG

    * Please consult DBA for advise on column types and any other constraints. I am not DBA and I have done this to the best of my knowledge. This will need to be improved later. For now - it works as it is


    (2)

    Populate tables with your site details

     

    SITES TABLE

    0123.PNG

     

    LATENCY OFFSET TABLE

    0125.PNG


    * Notice that LocationID will be joining link between these two tables.


    (3)

    Here is a golden SQL for your reporting - remember to change database name

     

    SELECT
      Nodes.NodeID
    ,'Warning.gif' AS 'ICON-STAT'
    ,Nodes.Caption AS 'NODE' --/Orion/View.aspx?View=NodeDetails&NetObject=N:${NodeID}
    ,Nodes.IP_Address AS 'IP' --/Orion/View.aspx?View=NodeDetails&NetObject=N:${NodeID}
    ,CONVERT(NVARCHAR(50), r1.AvgResponseTime) + ' ms' AS 'PING'
    ,CONVERT(NVARCHAR(50), r1.PercentLoss) + ' %' AS 'LOSS'
    FROM Solarwinds.dbo.Nodes
    
    ---------------------------------------------
    
    --join global thresholds
    INNER JOIN Solarwinds.dbo.NodesThresholds nt_resp WITH (NOLOCK) ON (Nodes.NodeID = nt_resp.InstanceId AND nt_resp.Name = 'Nodes.Stats.ResponseTime')
    INNER JOIN Solarwinds.dbo.NodesThresholds nt_loss WITH (NOLOCK) ON (Nodes.NodeID = nt_loss.InstanceId AND nt_loss.Name = 'Nodes.Stats.PercentLoss')
    
    --join last recorded response/loss per node
    LEFT JOIN (
      SELECT NodeID, MAX([DateTime]) AS max_datetime
      FROM [Solarwinds].[dbo].[ResponseTime_Detail]
      GROUP BY NodeID) MaxDates ON MaxDates.NodeID = Nodes.NodeID
    LEFT JOIN [Solarwinds].[dbo].[ResponseTime_Detail] r1 ON (r1.[DateTime] = MaxDates.max_datetime AND r1.NodeID = MaxDates.NodeID)
    
    --join site and latency offset
    LEFT JOIN SolarWindsAux.dbo.Sites s WITH (NOLOCK) ON s.SiteID = Nodes.n_site_id
    LEFT JOIN SolarWindsAux.dbo.LatencyOffset lo WITH (NOLOCK) ON (Nodes.EngineID = lo.EngineID AND s.LocationID = lo.LocationID)
    
    ---------------------------------------------
    
    WHERE
    
      Nodes.UnManaged = 0 AND --node is not unmanaged
      Nodes.Status <> '2' AND --node is not down
      (
        (
          nt_loss.CurrentValue > nt_loss.Level1Value AND --node has high AVG packet loss value
          ISNULL(r1.PercentLoss,0) > nt_loss.Level1Value --node has high LAST packet loss value
        ) OR
        (
          nt_resp.CurrentValue - ISNULL(lo.LatencyOffset,0) > nt_resp.Level1Value AND --node has high AVG responce time (minus Latency offset, based on location, if specified)
          ISNULL(r1.AvgResponseTime,0) - ISNULL(lo.LatencyOffset,0) > nt_resp.Level1Value --node has high LAST responce time (minus Latency offset, based on location, if specified)
        )
      )
    
    
    
    
    

     

    * Notice that we not only check the average here but also check the last recorded value and only trigger an alert if last value exceed threshold as well

     

    (4)

    Download the above SQL as a ready-made report and import into Orion Report-Writer.


    Add this report as an on page resource on your NOC page

    report.JPG


    (5)

    Bonus. Now, let's create few more reports for the node summary page:


    SQL source for Location

     

    SELECT
      Sites.SiteID AS 'SITE'
      ,Sites.Address AS 'ADDRESS'
      ,Sites.Description AS 'DESCRIPTION'
    FROM SolarwindsAux.dbo.Sites
    
    INNER JOIN Solarwinds.dbo.Nodes n ON n.n_site_id = Sites.SiteID
    
    WHERE n.Caption = '${Caption}'
    
    
    
    
    

     

    SQL source for Latency Offset


    SELECT
      Nodes.EngineID AS 'Poller ID'
      ,e.ServerName AS 'Poller Name'
      ,s.LocationID AS 'Node Location'
      ,lo.LatencyOffset AS 'Latency Offset'
    FROM Solarwinds.dbo.Nodes
    
    LEFT JOIN SolarWindsAux.dbo.Sites s ON Nodes.n_site_id = s.SiteID
    LEFT JOIN SolarWindsAux.dbo.LatencyOffset lo ON lo.LocationID = s.LocationID
    INNER JOIN Solarwinds.dbo.Engines e ON Nodes.EngineID = e.EngineID
    
    WHERE Nodes.Caption = '${Caption}'
    
    
    
    
    



    * you will find both reports attached as well.


    Same as above - just add them as a resources on your node summary page


    (*)

    That's it! Well done!

    Further Tips & Advice:


    * This solution fits very well with Automatic Nodes Placement on the Worldmap as both will use same custom property to define location - SiteID

    * You can have your locations based on countries, or you can go more tight and identify locations based on areas, towns, states, etc


    To Your Monitoring Success,

    Alex Soul

    www.pixace.com


    --