Percentage Interface Errors and Discards @ polling interval

Calculating Percentage Interface Errors and Discards @ polling interval is very critical to a business. Currently there is no way to see this on Solarwinds without having to do complex SQL reporting from the backend on the real time data.

Monitoring just a number of packets for interface errors and discards does not provide a complete view of the state of the interface and needs to be in percentage - dependant on the total number of packets being sent during the same time period. Usually we monitor a Interface discard > 1% as an actionable item when Operations start investigations - duplex mismatches, etc.

Attached video should help - Percent In/Out Discards and Errors | SevOne Network Monitoring Training - YouTube https://www.youtube.com/watch?v=mLtzlPk9QY4

kaushal.maru

Top Comments

jbiggley over 6 years ago in reply to mesverrum

You are correct. Our standard is 7 minutes As adatole is know to say, salt to taste.
Thanks for the cool query too!
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
mesverrum over 6 years ago in reply to jbiggley

Nice query, but just a little nitpick, the default polling interval for interfaces is 9 minutes for out of the box environments. you can validate that and check if it has been modified for your environment with this SWQL
SELECT SettingID, Name, Description, Units, Minimum, Maximum, CurrentValue, DefaultValue, Hint
FROM Orion.Settings
where settingid='SWNetPerfMon-Settings-Default Interface Stat Poll Interval'
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
jbiggley over 6 years ago in reply to bradical987

What you're looking for is this line
FROM [dbo].[InterfaceErrors] ie
ie is just an alias for [dbo].[interfaceerrors]. Instead of typing
SELECT
[InterfaceErrors].[InterfaceID]
You would just type
SELECT
ie.[InterfaceID]
Shortcuts are great for writing queries, horrendous when you are trying to reverse engineer queries, especially when you are new to SQL.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
bradical987 over 6 years ago in reply to jbiggley

I'm trying to create a SQL alert very similar to the one you created for interface errors, but being a noob to SQL alerts, I am having a hard time figuring out some of the syntax in this alert.
jbiggley
SELECT
        ie.[InterfaceID]
        ,CASE WHEN
ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0)=0
THEN 0
ELSE (([In_Errors])/(ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0))) * 100
END ReceivePercentErrors
        ,CASE WHEN
ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0)=0
THEN 0
Specifically, in the above snipet, I am unable to figure out the "ie." and "it." in "ie.[In_Errors]" and "it.[IN_TotalPkts]". What does the ie. and it. mean/do?
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
jbiggley over 6 years ago in reply to tonyreel

I did! It is a custom SQL alert for interfaces that will look something like this. If you don't intend on filtering your nodes via custom properties on this alert then you can remove the JOIN NodesCustomProperties line as well as the ncp.OwnerGroup = 'NETWORK' line.
Also note that the DATEADD(ss,-420,GETDATE()) matches our 7 minute polling interval for interfaces. You should change the time to match your interface polling interval.
SELECT Interfaces.FullName, Interfaces.InterfaceID FROM Interfaces
WITH (NOLOCK)
JOIN
(
SELECT
        ie.[InterfaceID]
        ,CASE WHEN
ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0)=0
THEN 0
ELSE (([In_Errors])/(ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0))) * 100
END ReceivePercentErrors
        ,CASE WHEN
ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0)=0
THEN 0
ELSE (([Out_Errors])/(ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0))) * 100
END TransmitPercentErrors
        FROM [dbo].[InterfaceErrors_Detail] ie WITH (NOLOCK)
        LEFT JOIN [dbo].[InterfaceTraffic_Detail] it WITH (NOLOCK) ON ie.InterfaceID=it.InterfaceID AND ie.[DateTime]=it.[DateTime]
WHERE
(
In_Errors > 0
OR Out_Errors > 0
)
-- Events from the Last 7 Minutes to match interface polling interval
AND ie.DateTime >= DATEADD(SS, -420, GETDATE())
) AS T1 ON T1.InterfaceID = interfaces.InterfaceID
--Not required unless filtering on a node custom property
JOIN NodesData nd WITH (NOLOCK) ON interfaces.NodeID = nd.NodeID
JOIN NodesCustomProperties ncp WITH (NOLOCK) ON nd.nodeid = ncp.nodeid
WHERE
([T1].[TransmitPercentErrors] >= 1
OR [T1].[ReceivePercentErrors] >= 1)
AND ncp.OwnerGroup = 'NETWORK'
AND (FullName NOT LIKE '%UPS%')
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
tonyreel over 6 years ago in reply to jbiggley

Hi Joshua,
Thanks for pointing out those PercentErrors fields in SWQL. I had seen them in the perf stack and was wondering where the data was being pulled from.
I was looking to create an alert using the stats but I am running into a few issues when running queries against those fields. In SWQL Studio any time I try and include the PercentErrors field in the where statement it times out. If I use the standard 'Error' field there are no problems.
I'm thinking it is potentially because those fields have 13 decimal points and therefore the search is struggling and timing out?
Did you ever get an alert for those values?
Tony
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
designerfx over 7 years ago in reply to jbiggley

This reminds me I need to check the sql query for the top 10 errors and discards via hubble.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
jbiggley over 7 years ago in reply to kaushal.maru

The data actually exists but it is hidden in SWQL. If you install and run the SWQL Studio you will find a table called Orion.NPM.InterfaceErrors that contains PercentDiscards, PercentErrors, ReceivePercentErrors and TransmitPercentErrors. Both queries are checking the data at the collection interval in your environment, a default of 7 minutes.
SELECT TOP 10 [NodeID], [InterfaceID], [PercentDiscards], [PercentErrors], [ReceivePercentErrors], [TransmitPercentErrors], [DATETIME]
FROM [Orion.NPM].[InterfaceErrors]
WHERE (([PercentDiscards] <> 0) OR ([PercentErrors] <> 0))
Like every good SWQL query there is a SQL query behind it. (Because SWQL is an obfuscation layer for SQL, amirite?!?) The SQL for this query looks like the following although I expanded it a little and added in the Node caption and interface caption. Hope it helps! designerfx is right, we are working on some SQL logic to alert on this data as it takes a long time to run in a large environment the way it is written. We're testing the logic now and will post it when it is ready.
SELECT
TOP (10)
--[T1].[NodeID]
nd.caption AS 'Node Name'
--, [T1].[InterfaceID]
,i.caption AS 'Interface Name'
, [T1].[PercentDiscards]
, [T1].[PercentErrors]
, [T1].[ReceivePercentErrors]
, [T1].[TransmitPercentErrors]
, DateAdd(second,DateDiff(second,GetDate(),GetUtcDate()),[T1].[DateTime])
FROM
(
SELECT ie.[NodeID]
        ,ie.[InterfaceID]
        ,ie.[DateTime]
        ,ie.[In_Discards]
        ,ie.[In_Errors]
        ,ie.[Out_Discards]
        ,ie.[Out_Errors]
        ,ie.[Archive]
        ,CASE WHEN
ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0)=0
THEN 0
ELSE (([In_Discards]+[Out_Discards])/(ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0))) * 100
END PercentDiscards
        ,CASE WHEN
ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0)=0
THEN 0
ELSE (([In_Errors]+[Out_Errors])/(ie.[In_Errors]+ie.[Out_Errors]+ie.[In_Discards]+ie.[Out_Discards]+ISNULL(it.[In_TotalPkts],0)+ISNULL(it.[Out_TotalPkts],0))) * 100
END PercentErrors
        ,CASE WHEN
ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0)=0
THEN 0
ELSE (([In_Errors])/(ie.[In_Errors]+ISNULL(it.[In_TotalPkts],0))) * 100
END ReceivePercentErrors
        ,CASE WHEN
ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0)=0
THEN 0
ELSE (([Out_Errors])/(ie.[Out_Errors]+ISNULL(it.[Out_TotalPkts],0))) * 100
END TransmitPercentErrors
        FROM [dbo].[InterfaceErrors] ie
        LEFT JOIN [dbo].[InterfaceTraffic] it ON ie.InterfaceID=it.InterfaceID AND ie.[DateTime]=it.[DateTime]
WHERE
DATEDIFF(ss,ie.DateTime,GETUTCDATE())<=420 -- Events from the Last 7 Minutes to match interface polling interval
) AS T1
JOIN NodesData nd WITH (NOLOCK) ON T1.NodeID = nd.NodeID
JOIN Interfaces i WITH (NOLOCK) ON T1.InterfaceID = i.InterfaceID
WHERE
[T1].[PercentDiscards] <> 0
OR [T1].[PercentErrors] <> 0
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
kaushal.maru over 7 years ago in reply to designerfx

I'm aware of the fact that the lowest granularity possibly via reports is an hour, and that the values are an aggregate for the whole hour. So:
a) Not enough when what you need is real time monitoring - in this case it is not aligned to the polling interval
b) Granularity is lost, and the data is massaged - say you have an intermittent failure for 30 minutes and it might still not be picked up because of the 30 minutes being spread across 2 "hourly aggregate windows", so 15 minutes in one and 15 in another - resultant %ages not enough to get picked up in the hourly calculations.
When you have SLA's which require measuring and reporting on critical network components, single points of failures, high availability (network performance for resilient links) you need to compare data at polling intervals for the resilient pair - there is no way around.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
jhandberg over 7 years ago in reply to jhandberg

I used this for reference on interface error rate, how to calculate it, what is acceptable including the reference to the 802.3ab standard. This fits almost exactly to what our network team was asking for as far as interface error alerting. It does involve calculating errors as a percentage of total traffic at that time. Interface type (1G, 10G, etc) and received traffic on that link will have an effect on the error rate.
Understanding Interface Errors and TCP Performance - NetCraftsmen
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel