Manage Node's Response Time Thresholds Better By Accounting For WAN Latencies

Version 4





    How do you deal with nodes response times across multiple worldwide sites?




    OK, I know, I know... NPM platform does give you a way of overriding global thresholds per node through "Edit Node" properties.




    However, think about this:


    - what if you have a lot of devices worldwide? 100s and 1000s of them?

    - how do you keep track of 100s and 1000s of overrides that you have set in the past?

    - what sort of overhead it creates to manage these overrides on a global scale?

    - what if you need to reduce/increase latency threshold for all devices at the particulate location only?

    - how do you sleep at night knowing that there might be a device (or two...) which was incorrectly overridden (not by you of course) with much higher latency and you potentially have issues without even knowing about it?

    - if for any reason you had a massive latency spike and upon the next poll it has settled - your average is still going to be skewed anyway. By default most of the alerts and resources are based on this average. So, you might be getting some false-positives.




    We will create a table in auxiliary database which will hold information about Latency Offsets for each site. Then we will use this table to calculate Latency for each device based on location that this device was assigned to. Keep in mind that this is exactly the same custom property which we have used in the Automatic Nodes Placement on the Worldmap solution previously. So, with just a few clicks (setting SiteID custom property) you can eventually control not only your nodes on map but also your response time worldwide - two rabbits with one shot. What is really beautiful about all this is that all your latency offset values and thresholds are stored centrally, in one table.


    End result:


    Here is the only thing which is needed to be done by Engineering Team - set Site ID custom property. That's it:




    Here is what you will get:

    Main resource for your Summary NOC page:


    If everything is OK - it will show nothing



    However, as long as there are some issue - they will show up here:




    Few additional informational resources for the nodes's summary page so that you always have these location details on hand when you land on any node:





    Create few tables in auxiliary database



    * Please consult DBA for advise on column types and any other constraints. I am not DBA and I have done this to the best of my knowledge. This will need to be improved later. For now - it works as it is


    Populate tables with your site details







    * Notice that LocationID will be joining link between these two tables.


    Here is a golden SQL for your reporting - remember to change database name


    ,'Warning.gif' AS 'ICON-STAT'
    ,Nodes.Caption AS 'NODE' --/Orion/View.aspx?View=NodeDetails&NetObject=N:${NodeID}
    ,Nodes.IP_Address AS 'IP' --/Orion/View.aspx?View=NodeDetails&NetObject=N:${NodeID}
    ,CONVERT(NVARCHAR(50), r1.AvgResponseTime) + ' ms' AS 'PING'
    ,CONVERT(NVARCHAR(50), r1.PercentLoss) + ' %' AS 'LOSS'
    FROM Solarwinds.dbo.Nodes
    --join global thresholds
    INNER JOIN Solarwinds.dbo.NodesThresholds nt_resp WITH (NOLOCK) ON (Nodes.NodeID = nt_resp.InstanceId AND nt_resp.Name = 'Nodes.Stats.ResponseTime')
    INNER JOIN Solarwinds.dbo.NodesThresholds nt_loss WITH (NOLOCK) ON (Nodes.NodeID = nt_loss.InstanceId AND nt_loss.Name = 'Nodes.Stats.PercentLoss')
    --join last recorded response/loss per node
      SELECT NodeID, MAX([DateTime]) AS max_datetime
      FROM [Solarwinds].[dbo].[ResponseTime_Detail]
      GROUP BY NodeID) MaxDates ON MaxDates.NodeID = Nodes.NodeID
    LEFT JOIN [Solarwinds].[dbo].[ResponseTime_Detail] r1 ON (r1.[DateTime] = MaxDates.max_datetime AND r1.NodeID = MaxDates.NodeID)
    --join site and latency offset
    LEFT JOIN SolarWindsAux.dbo.Sites s WITH (NOLOCK) ON s.SiteID = Nodes.n_site_id
    LEFT JOIN SolarWindsAux.dbo.LatencyOffset lo WITH (NOLOCK) ON (Nodes.EngineID = lo.EngineID AND s.LocationID = lo.LocationID)
      Nodes.UnManaged = 0 AND --node is not unmanaged
      Nodes.Status <> '2' AND --node is not down
          nt_loss.CurrentValue > nt_loss.Level1Value AND --node has high AVG packet loss value
          ISNULL(r1.PercentLoss,0) > nt_loss.Level1Value --node has high LAST packet loss value
        ) OR
          nt_resp.CurrentValue - ISNULL(lo.LatencyOffset,0) > nt_resp.Level1Value AND --node has high AVG responce time (minus Latency offset, based on location, if specified)
          ISNULL(r1.AvgResponseTime,0) - ISNULL(lo.LatencyOffset,0) > nt_resp.Level1Value --node has high LAST responce time (minus Latency offset, based on location, if specified)


    * Notice that we not only check the average here but also check the last recorded value and only trigger an alert if last value exceed threshold as well



    Download the above SQL as a ready-made report and import into Orion Report-Writer.

    Add this report as an on page resource on your NOC page



    Bonus. Now, let's create few more reports for the node summary page:

    SQL source for Location


      Sites.SiteID AS 'SITE'
      ,Sites.Address AS 'ADDRESS'
      ,Sites.Description AS 'DESCRIPTION'
    FROM SolarwindsAux.dbo.Sites
    INNER JOIN Solarwinds.dbo.Nodes n ON n.n_site_id = Sites.SiteID
    WHERE n.Caption = '${Caption}'


    SQL source for Latency Offset

      Nodes.EngineID AS 'Poller ID'
      ,e.ServerName AS 'Poller Name'
      ,s.LocationID AS 'Node Location'
      ,lo.LatencyOffset AS 'Latency Offset'
    FROM Solarwinds.dbo.Nodes
    LEFT JOIN SolarWindsAux.dbo.Sites s ON Nodes.n_site_id = s.SiteID
    LEFT JOIN SolarWindsAux.dbo.LatencyOffset lo ON lo.LocationID = s.LocationID
    INNER JOIN Solarwinds.dbo.Engines e ON Nodes.EngineID = e.EngineID
    WHERE Nodes.Caption = '${Caption}'

    * you will find both reports attached as well.

    Same as above - just add them as a resources on your node summary page


    That's it! Well done!

    Further Tips & Advice:

    * This solution fits very well with Automatic Nodes Placement on the Worldmap as both will use same custom property to define location - SiteID

    * You can have your locations based on countries, or you can go more tight and identify locations based on areas, towns, states, etc

    To Your Monitoring Success,

    Alex Soul