Understanding Errors and Discards

I was working with a customer the other day and we were analyzing some of the data that Orion NPM is collecting from his core routers. On some of his gigabit interfaces we noticed that every few hours we got a couple of hundred discards (all at once, not spread through the hours). This caused us to investigate the root cause and also got us to talking about errors and discards and the more I thought about I thought that some of this data might be useful to other people.

First off, when you errors or discards within your network management system you need to ask yourself two questions:

a) Do you trust the NMS?

b) Are you seeing any issues on those interfaces?

I mention trusting your NMS first as I've definitely seen cases where the network management software misreported these stats. If your software is from us here at SolarWinds, then you skip this part as in the 10+ years I've been helping to create and using these applications I've never seen them misbehave in this particular way.

When it comes to the second question,what I mean is, if you hadn't noticed the stats being reported in your NMS would you have been thinking about this interface? If not, and the number of errors or discards or relatively low then you might just sort of keep an eye on it to see if it gets any worse.

But let's assume that you've decided to go investigate these stats. One very important thing to understand is that there's a world of difference between discards and errors. Errors indicate packets that were received but couldn't be processed because there was a problem with the packet. In most cases, when you're seeing inbound errors on a router interface the issue is upstream of that device. Could be a bad cable, misconfiguration on one end or the other, or etc. In most cases, these issues are resolved outside of the router where you're seeing the errors. Errors reporting is documented within RFC 1213 (among others including RFC 1573) and typically is pulled from the IF-MIB (ifInErros and ifOutErrors).

With discards, the situation is almost the opposite. The packets were received with no errors but were dumped before being passed on to a higher layer protocol. A typical cause of discards is when the router needs to regain some buffer space. In the case of discards, the issue is almost always with the router that's reporting the discards (not witha a next hop device, bad cable, etc). RFC 1213 also documents discard reporting and they're right beside the errors within the IF-MIB.

This blog post is getting long so I'll stop the description here, but ping me if you want to know more about this as I never really tire of talking about packets...


Flame on...
Josh

pastedImage_0.png

Anonymous
  • Browsing through the old Geek Speak posts and found this.  Excellent post. It's always bothered me when folks just ignore this kind of stuff.  There is no such thing as data that has no meaning.  Do the work to investigate and understand the cause, then make a decision as far as what you're going to do about it.  I work with a guy that is constantly saying things like "such and so much have just wigged out - don't worry about it".  That doesn't cut it.  Whether you're receiving bad packets or dropping them, there is a root cause.  Find out what it is, then address it if it needs to be addressed.

  • Is it common to see errors and/or discards when you have a trunk interface not configured correctly on one side of the connection? 

  • Hi there!

    Great read on Discards!

    I'm still a little confused if I should be worried about my situation so perhaps you can help.  We have two Cisco Access Points on one of our switches and about once a day for a minute, they get up to 2.5% discard rates.  The rate seems low and short lived so I'm thinking it's nothing to worry about since the interfaces seem to be properly configured.  I'm also thinking that because they are AP's they may be more susceptible to this sort of issue since you have many user on the same interface.  Let me know what you think?

    Thanks. emoticons_happy.png

  • What would be nice is if we can understand the ration of discards to the actual number of packets that were transmitted and then get an idea of the correlated errors if any. For instance what this number tells me is that if the ratio gets above a threshold of 3-5% discards I can get an idea that my interfaces are worked to the max and that I may need to look at either adding a different resource with more capacity or re-route up-links differently or redesign my network to allow for the least path of resistance to my network.  Is that a correct assumption?