16 Replies Latest reply on Jan 10, 2011 8:37 AM by pyro13g

    Incorrect Bandwidth reporting in NPM

      We have had issues with incorrect bandwidth utilization reports for several years now with NPM. Primarily we have noticed this issue when interfaces report more bandwidth utilization that is physically possible - like 65Mbps+ on a DS3 and so forth. When tickets were opened on this issue Solarwinds reported back that our routers were sending the false information. I never completely believed this answer, but didn't have a good way to refute it. Recently we were bringing up a new OC3 and to test it we ran WAN Killer at 75Mbps both directions across this circuit. The traffic on the interface was a solid 75Mbps in both directions as there was no other traffic on the link. The Solarwinds graph for this OC3 interface show traffic rates from 65Mbps to 110Mbps.

      This was obviously not a correct representation of the traffic on the link so I set up PRTG (another SNMP tool) to poll the same router. From PRTG the graph was a flatline at 75Mbps as expected. Which proves that the router is correctly reporting the bandwidth utilization on the link.

      I've had an open ticket for several weeks now with Solarwinds on this issue, but still waiting on any kind of a fix or reason for the problem.

      I'm curious if any other Solarwinds NPM users have experienced this problem.

        • Re: Incorrect Bandwidth reporting in NPM
          slinky103

          I work with tflake and we also tried polling from different polling engines and from the Engineers Toolset and had the same issues.  Hoping someone else has come across these issues and resolved them.

            • Re: Incorrect Bandwidth reporting in NPM
              Questionario

              Pearson...

              *rant on*

              the company that jacked my money... providing exams and if the exam is faulty you tell them to sort it out with cisco.... :P

              *rant off*

              sorry but we havent experienced this issue, I assume this is a Cisco 7200? is this the only boxx you are having problems with? which IOS version are you running?

            • Re: Incorrect Bandwidth reporting in NPM

              We've seen it too - Juniper routers - T1 interfaces reporting 400Mb/s spikes.

              Solarwinds support says that it's an "IOS bug" :)

              Any ideas on how to "safely" remove the bogus data from the database?

                • Re: Incorrect Bandwidth reporting in NPM
                  sean.martinez

                  We have seen with Juniper and some other Devices that the Bandwidth Counter has Rolled Over. Changing the Counter Rollover Method under Settings> Polling Settings and changing to 64 bit counters under Node Management usually resolves this issue.

                    • Re: Incorrect Bandwidth reporting in NPM
                      kweise

                      Sean,

                      One of bkessler's questions was is there a safe way to remove the bogus data from the database.  If one has a lot of bogus data in their database, it could definitely skew utilization reporting.  I happen to know bkessler and had suggested using an update statement to replace the bogus data with a more realistic value like the maximum transfer rate of circuit in question.  If the counters are really rolling over, even the maximum rate of the circuit is probably wrong but wouldn't skew averages as much as 450 Mbps on a T1.

                      Since I'm a network engineer and not a DBA, my suggested update query was really more of a stab in the dark than a solid recommendation on how to remove bogus data from the database.  In bkessler's example, the chart shows 450 Mbps inbound on a T1.  My suggestion was to update the In_Maxbps column in the InterfaceTraffic_Detail table to the maximum rate of the circuit anytime there was a value in the column that was greater than the circuit's maximum rate.

                      UPDATE InterfaceTraffic_Detail

                           SET IntefaceTraffic_Detail.In_Maxbps=the maximum rate for the circuit

                           WHERE NodeID=X AND InterfaceID=Y AND In_Maxbps > the maximum rate for the circuit

                      Does anyone know if this would this be a valid (and safe) way to remove bogus bandwidth rates from the database?

                        • Re: Incorrect Bandwidth reporting in NPM
                          tdanner

                          Yes, this should be a safe way to chop those values off at the maximum rate for the circuit. You may also want to apply the same logic to In_Averagebps and In_Minbps.

                            • Re: Incorrect Bandwidth reporting in NPM

                              I used Kevin's idea to modify the Maxbps value (this particular case is "out" not "in") and it worked fine but when I tried updating the Averagebps and Minbps values, the result was "1 record changed" but re-querying the db shows no change.

                              I'm going to keep digging but if you have any additional ideas I'd love to hear them.

                                • Re: Incorrect Bandwidth reporting in NPM

                                  Disregard the previous post.  Kevin's suggestion was accurate; it was my implementation that was in error.

                                  Thanks for the assistance!

                                    • Re: Incorrect Bandwidth reporting in NPM

                                      I decided to expand on the logic and look for all database entries for T1 interfaces with stored bandwidth counters > 1.536Mb/s

                                      I used the following query:

                                      select * from InterfaceTraffic_Detail where InterfaceID IN

                                      (select InterfaceID from Interfaces where InterfaceName IN

                                      ('t1-1/0/0', 't1-1/0/0.0', 't1-2/0/0', 't1-2/0/0.0', 't1-3/0/0', 't1-3/0/0.0'))

                                      AND (In_Minbps > 1536000 OR In_Maxbps > 1536000 or In_Averagebps > 1536000

                                      OR Out_Minbps > 1536000 OR Out_Maxbps > 1536000 or Out_Averagebps > 1536000)

                                      Then, I don't really want to set these bogus values to some other arbitrary and incorrect value so I figured that the best thing to do was to simply delete them from the database.

                                      To do that, I simply changed the "select *" in the above to "delete"

                                      Thanks for everyone's help!

                                      This seems to have done the trick for me but standard disclaimers apply...don't try this at home, I'm a "professional;" make sure you have a good backup; try this in a lab first; don't run with scissors, etc. etc.

                              • Re: Incorrect Bandwidth reporting in NPM

                                We use "method 2"  and 64-bit counters

                              • Re: Incorrect Bandwidth reporting in NPM

                                bkessler: Solarwinds fixed the issue with v.10.1. I have seen one graph since we upgraded to 10.1 that went over the physical bandwidth of the link, but only the one graph. Prior to upgrading to 10.1 we saw this issue a lot. If you haven't upgraded try that and see if it makes any difference.

                                sean.martinez: We tried 64 bit counters and that didn't make any difference. Thanks for the idea though. For us the 10.1 upgrade is what fixed the problem.

                                  • Re: Incorrect Bandwidth reporting in NPM
                                    pyro13g

                                    Please understand that many vendors count the traffic as it's queued/buffered and not when it actually leaves the interface.  And be rest assured, that the polling interval is not always spot on perfect which is not a Solarwinds issue but just the way things are.

                                     

                                    Is your NPM polling set to 30 seconds(don't think possible) and do NPM and PRTG poll the interface at the exact same time?  Apples to Oranges comparison.

                                • Re: Incorrect Bandwidth reporting in NPM

                                  We've definitely seen this.  100Mb ports reporting spikes of 105/108 Mbps on a 3 minute poll.  The ones that standout are internet routers-cisco 7200s.   Some errors could be mathematical, however, I doubt that the 7200 is buffering an average of 108 mbps for 3 minutes straight and discards are at zero.  So it always looked like sloppy code.  I've also seen a variety of odd singular errors, like a 3750 switch that registers 4 billion discards for every poll on every port.  DMVPN tunnels where every tunnel seems to have one of four different bandwidth values.  Some of this is IOS/Cat irregularities, but some is clearly Orion's problem. When we upgraded to 9.0 we lost the ability to monitor most of out CAT-based 4006 switches.  7.x,8.x worked fine, 9.x can't even inventory the blades successfully.  CPU/memory fails on all.  SW tech supports response was that our supervisors were too slow.  Yes, they are old and under load, but 9.x is so poor at handling slower SNMP devices, that they basically don't support older CAT devices. Older IOS devices--years older than those 4006s with less muscle work fine.  CAT support in general has been lacking behind IOS for at least 5 years. 

                                    • Re: Incorrect Bandwidth reporting in NPM
                                      pyro13g

                                      RoyalIEF.  I've experienced this with CISCO gear for over a decade and the reason I stated is why.  You might be able to google up some very old discussions on it.  I had the same question when I was writing code and would get these results.  We get the same above line rate spikes in other products (Netscout PM, OpsManager, Cascade, WhatsUP, Traverse(old netvigil))

                                    • Re: Incorrect Bandwidth reporting in NPM
                                      slinky103

                                      Here is another example as of late after the "fix" in 10.1.  One of our OC3's were running at 100%...maximum available bandwidth of 155Mbps yet the graph shows over 200Mbps.  How is this possible???  Our netflow utility shows it completely normal.  Even the SW 95th percentile is out of whack!!  We have a second polling engine/database monitoring the other side of the link, with more frequent polls, and we see the same thing.  So it is not environmental.