I was working with a customer the other day and we were analyzing some of the data that Orion NPM is collecting from his core routers. On some of his gigabit interfaces we noticed that every few hours we got a couple of hundred discards (all at once, not spread through the hours). This caused us to investigate the root cause and also got us to talking about errors and discards and the more I thought about I thought that some of this data might be useful to other people.

First off, when you errors or discards within your network management system you need to ask yourself two questions:

a) Do you trust the NMS?

b) Are you seeing any issues on those interfaces?

I mention trusting your NMS first as I've definitely seen cases where the network management software misreported these stats. If your software is from us here at SolarWinds, then you skip this part as in the 10+ years I've been helping to create and using these applications I've never seen them misbehave in this particular way.

When it comes to the second question,what I mean is, if you hadn't noticed the stats being reported in your NMS would you have been thinking about this interface? If not, and the number of errors or discards or relatively low then you might just sort of keep an eye on it to see if it gets any worse.

But let's assume that you've decided to go investigate these stats. One very important thing to understand is that there's a world of difference between discards and errors. Errors indicate packets that were received but couldn't be processed because there was a problem with the packet. In most cases, when you're seeing inbound errors on a router interface the issue is upstream of that device. Could be a bad cable, misconfiguration on one end or the other, or etc. In most cases, these issues are resolved outside of the router where you're seeing the errors. Errors reporting is documented within RFC 1213 (among others including RFC 1573) and typically is pulled from the IF-MIB (ifInErros and ifOutErrors).

With discards, the situation is almost the opposite. The packets were received with no errors but were dumped before being passed on to a higher layer protocol. A typical cause of discards is when the router needs to regain some buffer space. In the case of discards, the issue is almost always with the router that's reporting the discards (not witha a next hop device, bad cable, etc). RFC 1213 also documents discard reporting and they're right beside the errors within the IF-MIB.

This blog post is getting long so I'll stop the description here, but ping me if you want to know more about this as I never really tire of talking about packets...

Flame on...