Closed

Closed due to inactivity. Received 29 votes with last vote on 29 Nov 2019.

Percentage Interface Errors and Discards @ polling interval

Calculating Percentage Interface Errors and Discards @ polling interval is very critical to a business. Currently there is no way to see this on Solarwinds without having to do complex SQL reporting from the backend on the real time data.

Monitoring just a number of packets for interface errors and discards does not provide a complete view of the state of the interface and needs to be in percentage - dependant on the total number of packets being sent during the same time period. Usually we monitor a Interface discard > 1% as an actionable item when Operations start investigations - duplex mismatches, etc.

Attached video should help - Percent In/Out Discards and Errors | SevOne Network Monitoring Training - YouTube https://www.youtube.com/watch?v=mLtzlPk9QY4

  • You are correct.  Our standard is 7 minutes emoticons_wink.png  As adatole​ is know to say, salt to taste.

    Thanks for the cool query too!

  • Nice query, but just a little nitpick, the default polling interval for interfaces is 9 minutes for out of the box environments.  you can validate that and check if it has been modified for your environment with this SWQL

    SELECT SettingID, Name, Description, Units, Minimum, Maximum, CurrentValue, DefaultValue, Hint

    FROM Orion.Settings

    where settingid='SWNetPerfMon-Settings-Default Interface Stat Poll Interval'

  • What you're looking for is this line

    FROM [dbo].[InterfaceErrors] ie

    ie is just an alias for [dbo].[interfaceerrors].  Instead of typing

    SELECT

            [InterfaceErrors].[InterfaceID]

    You would just type

    SELECT

            ie.[InterfaceID]

    Shortcuts are great for writing queries, horrendous when you are trying to reverse engineer queries, especially when you are new to SQL.

  • I'm trying to create a SQL alert very similar to the one you created for interface errors, but being a noob to SQL alerts, I am having a hard time figuring out some of the syntax in this alert.

    jbiggley

    SELECT

            ie.[InterfaceID]

            ,CASE WHEN

    ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0)=0

    THEN 0

    ELSE (([In_Errors])/(ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0))) * 100

    END ReceivePercentErrors

            ,CASE WHEN

    ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0)=0

    THEN 0

    Specifically, in the above snipet, I am unable to figure out the "ie." and "it." in "ie.[In_Errors]" and "it.[IN_TotalPkts]". What does the ie. and it. mean/do?

  • I did!  It is a custom SQL alert for interfaces that will look something like this.  If you don't intend on filtering your nodes via custom properties on this alert then you can remove the JOIN NodesCustomProperties line as well as the ncp.OwnerGroup = 'NETWORK' line.

    Also note that the DATEADD(ss,-420,GETDATE()) matches our 7 minute polling interval for interfaces. You should change the time to match your interface polling interval.

    SELECT Interfaces.FullName, Interfaces.InterfaceID FROM Interfaces

    WITH (NOLOCK)

    JOIN

    (

    SELECT

            ie.[InterfaceID]

            ,CASE WHEN

    ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0)=0

    THEN 0

    ELSE (([In_Errors])/(ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0))) * 100

    END ReceivePercentErrors

            ,CASE WHEN

    ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0)=0

    THEN 0

    ELSE (([Out_Errors])/(ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0))) * 100

    END TransmitPercentErrors

            FROM [dbo].[InterfaceErrors_Detail] ie WITH (NOLOCK)

            LEFT JOIN [dbo].[InterfaceTraffic_Detail] it WITH (NOLOCK) ON ie.InterfaceID=it.InterfaceID AND ie.[DateTime]=it.[DateTime]

    WHERE

    (

    In_Errors > 0

    OR Out_Errors > 0

    )

    -- Events from the Last 7 Minutes to match interface polling interval

    AND ie.DateTime >= DATEADD(SS, -420, GETDATE())

    ) AS T1 ON T1.InterfaceID = interfaces.InterfaceID

    --Not required unless filtering on a node custom property

    JOIN NodesData nd WITH (NOLOCK) ON interfaces.NodeID = nd.NodeID

    JOIN NodesCustomProperties ncp WITH (NOLOCK) ON nd.nodeid = ncp.nodeid

    WHERE

    ([T1].[TransmitPercentErrors] >= 1

    OR [T1].[ReceivePercentErrors] >= 1)

    AND ncp.OwnerGroup = 'NETWORK'

    AND (FullName NOT LIKE '%UPS%')

  • Hi Joshua,

    Thanks for pointing out those PercentErrors fields in SWQL. I had seen them in the perf stack and was wondering where the data was being pulled from.

    I was looking to create an alert using the stats but I am running into a few issues when running queries against those fields. In SWQL Studio any time I try and include the PercentErrors field in the where statement it times out. If I use the standard 'Error' field there are no problems.

    I'm thinking it is potentially because those fields have 13 decimal points and therefore the search is struggling and timing out?

    Did you ever get an alert for those values?

    Tony

  • This reminds me I need to check the sql query for the top 10 errors and discards via hubble.

  • The data actually exists but it is hidden in SWQL.  If you install and run the SWQL Studio you will find a table called Orion.NPM.InterfaceErrors that contains PercentDiscards, PercentErrors, ReceivePercentErrors and TransmitPercentErrors.  Both queries are checking the data at the collection interval in your environment, a default of 7 minutes.

    SELECT TOP 10 [NodeID], [InterfaceID], [PercentDiscards], [PercentErrors], [ReceivePercentErrors], [TransmitPercentErrors], [DATETIME]

    FROM [Orion.NPM].[InterfaceErrors]

    WHERE (([PercentDiscards] <> 0) OR ([PercentErrors] <> 0))

    Like every good SWQL query there is a SQL query behind it.  (Because SWQL is an obfuscation layer for SQL, amirite?!?)  The SQL for this query looks like the following although I expanded it a little and added in the Node caption and interface caption.  Hope it helps!  designerfx is right, we are working on some SQL logic to alert on this data as it takes a long time to run in a large environment the way it is written.  We're testing the logic now and will post it when it is ready.

    SELECT

    TOP (10)

    --[T1].[NodeID]

    nd.caption AS 'Node Name'

    --, [T1].[InterfaceID]

    ,i.caption AS 'Interface Name'

    , [T1].[PercentDiscards]

    , [T1].[PercentErrors]

    , [T1].[ReceivePercentErrors]

    , [T1].[TransmitPercentErrors]

    , DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T1].[DateTime])

    FROM

    (

    SELECT ie.[NodeID]

            ,ie.[InterfaceID]

            ,ie.[DateTime]

            ,ie.[In_Discards]

            ,ie.[In_Errors]

            ,ie.[Out_Discards]

            ,ie.[Out_Errors]

            ,ie.[Archive]

            ,CASE WHEN

      ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0)=0

      THEN 0

      ELSE (([In_Discards]+[Out_Discards])/(ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0))) * 100

      END PercentDiscards

            ,CASE WHEN

      ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0)=0

      THEN 0

      ELSE (([In_Errors]+[Out_Errors])/(ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0))) * 100

      END PercentErrors

            ,CASE WHEN

      ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0)=0

      THEN 0

      ELSE (([In_Errors])/(ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0))) * 100

      END ReceivePercentErrors

            ,CASE WHEN

      ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0)=0

      THEN 0

      ELSE (([Out_Errors])/(ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0))) * 100

      END TransmitPercentErrors

            FROM [dbo].[InterfaceErrors] ie

            LEFT JOIN [dbo].[InterfaceTraffic] it ON ie.InterfaceID=it.InterfaceID AND ie.[DateTime]=it.[DateTime]

      WHERE

      DATEDIFF(ss,ie.DateTime,GETUTCDATE())<=420 -- Events from the Last 7 Minutes to match interface polling interval

    ) AS T1

    JOIN NodesData nd WITH (NOLOCK) ON T1.NodeID = nd.NodeID

    JOIN Interfaces i WITH (NOLOCK) ON T1.InterfaceID = i.InterfaceID

    WHERE

    [T1].[PercentDiscards] <> 0

    OR [T1].[PercentErrors] <> 0

  • I'm aware of the fact that the lowest granularity possibly via reports is an hour, and that the values are an aggregate for the whole hour. So:

    a) Not enough when what you need is real time monitoring - in this case it is not aligned to the polling interval

    b) Granularity is lost, and the data is massaged - say you have an intermittent failure for 30 minutes and it might still not be picked up because of the 30 minutes being spread across 2 "hourly aggregate windows", so 15 minutes in one and 15 in another - resultant %ages not enough to get picked up in the hourly calculations.

    When you have SLA's which require measuring and reporting on critical network components, single points of failures, high availability (network performance for resilient links) you need to compare data at polling intervals for the resilient pair - there is no way around.

  • I used this for reference on interface error rate, how to calculate it, what is acceptable including the reference to the 802.3ab standard.  This fits almost exactly to what our network team was asking for as far as interface error alerting.  It does involve calculating errors as a percentage of total traffic at that time.  Interface type (1G, 10G, etc) and received traffic on that link will have an effect on the error rate.

    Understanding Interface Errors and TCP Performance - NetCraftsmen