This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

What multiplier should be used to estimate actual interface bandwidth utilization when NPM only polls once per 10 minutes?

Management likes to know how much Internet pipe a particular site uses, and I appreciate seeing the growth trend over the last 14 years, too.  But the standard NPM polling rate of once each ten minutes reveals just a fraction of the actual consumed bandwidth.  This is easily proven by comparing NPM's bandwidth reported to the bandwidth utilization discovered by Engineer's Toolset Bandwidth Gauges set to poll every ten seconds.

Confirmation can be had by remoting into the switch and using the "show interface" command to see how much traffic is flowing through it.  The Bandwidth Gauge shows a truer representation of the actual throughput than NPM does.

I'm thinking there might be a conversation factor one could easily apply to the graphed utilization and get a more accurate idea of the true utilization.  I hate to say it, but Orion NPM doesn't tell the whole truth with its Interface utilzation graphs.

Here's an example of NPM polling an interface once every ten minutes over twenty-four hours:

pastedImage_0.png

Compare the same exact interface bandwidth information polled every ten seconds with an Engineer's Toolset Bandwidth Gauge set to display a historical graph:

pastedImage_1.png

Comparing today's information (the right half of each graph), NPM missed the 710Mb/s spike.  It reports bandwidth utilized today was between 70 Mb/s and 100 Mb/s.

But the Toolset Bandwidth gauge shows most of the day was spent between 100 Mb/s and 200 Mb/s.

I'd like a way to be confident when I tell Management that a particular Interface has X utilization, and of knowing NPM is off by a certain specific percent.

Have you come up with a Kluge for this?  I can't trust NPM for an actual statistic--only a relative one that is averaged.  Instead I must keep that Bandwidth Gauge running so I know the real utilization--not what Orion NPM tells me.  Otherwise I end up with insufficient pipe purchased, and complaints from the users.

1. What do you use to see the actual bandwidth utilized? NPM?  Toolset?  Other?

2. How do you verify your information is valid?

3. What X Factor do you apply to NPM's graphs to reflect the true utilization?

Hoping to see some creative answers here!  Remember:  there are three questions.

Yours,

Rick S.

  • What you could do, is change the interface polling interval for your critical interfaces, to give you a more granular view of bandwidth usage. It won't help with the historical stats, but it will resolve things going forward. You can do this by going to 'Manage Nodes', expanding the node in question, and then checking the box next to the interface(s) you need. Then, click 'edit properties' from the options at the top. and in the page that follows, change the polling interval:

    Interface_Stats.jpg

    I would only recommend that you do this for a handful for interfaces, is doing it for 100's would add significant load to your polling engines, but for a small number, this would do the trick.

  • Thanks for the suggestion.  However, it's not the one I need because it only addresses the limited number of interfaces that would be adjusted for higher polling.  Given 50,000+ interfaces in this particularly Orion monitoring deployment, there will always be times when someone wants to know accurate statistics for a given port over some historical period.

    If there were a good rule of thumb for multiplying the actual stats by an agreed-on fudge factor, one could leaving polling settings alone and simply multiple existing graphed stats by that fudge factor.

    Your suggestion is a good one--for a small number of interfaces--and it addresses my specific example satisfactorily.  My example was only the tip of an iceberg, though, so the multiplier is still needed.

  • Hi rschroder,

    You raise a very valid point which shows the benefits of having the engineers toolset available to complement NPM. The way I like to look at it is this, NPM is better suited to watching for sustained utilization spikes, and can miss the occasional spike, which sometimes users will not even notice when they are short lived. If you do receive complainants from users you can run the ET to monitor at a more granular level for as long as required. 

    You can certainly reduce the polling interval for statistics collection for all of your interfaces however this will also have an immediate impact on your polling engine throughput, the size of your database, in particular the detailed interface statistics table, and also could affect the load times for any views which have interface specific resources.

    There is no "multiplier" that i am aware of, as the detailed statistics are not averages, but the values observed at that point in time. Depending on the device type, it may be possible to configure a SYSLOG message or TRAP notification for the interface utilization condition so help you identify which interfaces are spiking. You could also look at Solarwinds  "Netflow Traffic Analyser" to help you understand the "who and how" your bandwidth is being consumed, this keeps 30 days of stats at 1 minute intervals be default.

    Hope that helps,

    Thanks,

    Tony

  • No problem, at least I helped with one small part of the issue. The problem with a multiplier is that it'll be no good in the real world, as you can see from the accurate data from the toolset polls, peaks happen!

  • Hmm...   I think you have things a bit wrong.   First, even though it polls every 10 minutes, it is polling a counter of ALL data that went through the interface.   So its not missing anything at all, it just reads the counter, divides it by the time interval, and says that's how much data went through ON AVERAGE.  Its really >very< accurate.   Multiplying it by something will distort the data, not fix it.   Once again, NPM is reading actual counters for an interface of the actual bandwidth, not taking a sampling of what data is currently passing every X minutes (in which case a multiplier might work).

    The 710 Mb spike in the toolkit is probably actually the problem (or anomaly) rather than the data in NPM.  I'm guessing that either the bandwidth gauges have a problem handling 64-bit counters, or they "wrapped" which they will do every so often and the bandwidth gauges probably don't know how to handle it as well as NPM.   The counters "wrap" when they reach their maximum amount and go back around to zero.  On very high speed interfaces this can happen quickly and distort the data, esp. of anything not using 64-bit counters.  In NPM you can set each node to use 64-bit counters and it is a good idea to do so.

    Short of going to NTA, reducing the poll time for that interface is the only way to get better resolution.   NTA is probably your best bet.   Netflow, rather than being a polled technology like SNMP, is a push technology.   The router or other device watches every "flow" of data going through it and reports back to a collector (NTA) every time a flow cancels, or a timer expires.   You need to be cautious with this though also and set shorter, defined time-outs for the data to be sent to the collector or you can have the netflow source send a netflow packet for multiple minutes of data back which can distort the graphs also.   Less than or equal to a minute for both active and inactive flows is best.

  • It's interesting to think of using NTA that way.

    Regarding NPM missing the spike caught by the Bandwidth Gauge, a couple of items:

    • The Gauge correctly recorded the spike.  I was the one generating a large download of Cisco IOS code, and I watched the download correspond exactly with the Bandwidth Gauge's spike.  So kudos to the Toolset for accurately monitoring and reporting it!
    • NPM missing the spike isn't a concern.  Internet use is bursty, and I don't expect to catch the traffic to the Nth degree with NPM.  What I WOULD like is for it to show numbers equivalent to those shown by the Bandwidth Gauge.  When I call up historical data in NPM and see a trend that averages around 100 Mb/s for six hours on a port, and then compare that to what the Toolset's Bandwidth Gauge shows for that same port and period, there can be a major discrepancy between those two reports for the same port and time period.

    In one way it seems to be comparing apples to apples (the same interface monitored by two different solutions over the same period).

    In practicality it can be very different, all as the result of the different polling frequency--every 120 seconds versus every 10 seconds.  Which results in what looks like a comparison between apples and and oranges.

    When the Boss asks what's the Internet utilization looking like this week, he wants to know if the stats suggest we should consider increasing the Internet pipe size based on what Orion shows.  Since I graph that particular pipe with the Toolset's Bandwidth Gauge AND with Orion NPM, whereas he only sees NPM's graph, I like him to understand that NPM's graphs are not showing the complete picture, as proven by Bandwidth Gauge graphs.

    If we could agree on a multiplier, I wouldn't have to keep qualifying the results that NPM provides by showing more real-time Bandwidth Gauge graphs.

  • Since you have the toolset graph, and the NPM graph, you could compute how far out the two are, and perhaps use a multiplier that gets the NPM graph up to the results shown in the toolset. For me, it's a bit of a bodge, because if you use the same multiplier on any given graph, it may well be widely inaccurate, even though it fits in the first case.

    I doubt SolarWinds would have an officially sanctioned number you could use, as as cnorborg‌ rightly said, NPM will tend to average things out, based on graph zoom and timeframe requested etc.

    Perhaps if you do the toolset/NPM comparison for a decent set of data (say a dozen interfaces), and see if there is a common multiplier? It's still a bit "finger in the air" for me, I'd go with a shorter polling interval (not really an option on 50K interfaces, I grant you).

  • Thanks Craig,

    The OID’s which are polled provide the “Total Number of Octets / Pkts” and are reset at re-initialization or at “other times as indicated by the value of ifCounterDiscontinuityTime." I believe it may be counting the delta between the last reported value?

    If NPM were dividing by a time interval it should be possible to graph it down to any resolution, which is not the case, the graphs will only go as details as the data available. Also looking at the Database, I don’t see the cumulative values I would expect

    By default, Orion NPM polls the following interface-related OIDs, as listed by type.

    Information Type

    OID Name

    OID Value

    Bandwidth - 32-bit Counter

    ifInOctets

    1.3.6.1.2.1.2.2.1.10

    Bandwidth - 32-bit Counter

    ifInUcastPkts

    1.3.6.1.2.1.2.2.1.11

    Bandwidth - 32-bit Counter

    ifInNUcastPkts

    1.3.6.1.2.1.2.2.1.12

    Bandwidth - 32-bit Counter

    ifOutOctets

    1.3.6.1.2.1.2.2.1.16

    Bandwidth - 32-bit Counter

    ifOutUcastPkts

    1.3.6.1.2.1.2.2.1.17

    Bandwidth - 32-bit Counter

    ifOutNUcastPkts

    1.3.6.1.2.1.2.2.1.18

    Bandwidth - 64-bit Counter

    ifHCInOctets

    1.3.6.1.2.1.31.1.1.1.6

    Bandwidth - 64-bit Counter

    ifHCInUcastPkts

    1.3.6.1.2.1.31.1.1.1.7

    Bandwidth - 64-bit Counter

    ifHCInMulticastPkts

    1.3.6.1.2.1.31.1.1.1.8

    Bandwidth - 64-bit Counter

    ifHCOutOctets

    1.3.6.1.2.1.31.1.1.1.10

    Bandwidth - 64-bit Counter

    ifHCOutUcastPkts

    1.3.6.1.2.1.31.1.1.1.11

    Bandwidth - 64-bit Counter

    ifHCOutMulticastPkts

    1.3.6.1.2.1.31.1.1.1.12

    Errors and Discards

    ifInDiscards

    1.3.6.1.2.1.2.2.1.13

    Errors and Discards

    ifInErrors

    1.3.6.1.2.1.2.2.1.14

    Errors and Discards

    ifOutDiscards

    1.3.6.1.2.1.2.2.1.19

    Errors and Discards

    ifOutErrors

    1.3.6.1.2.1.2.2.1.20

    Interface - General

    ifName

    1.3.6.1.2.1.31.1.1.1.1

    Interface - General

    ifAlias

    1.3.6.1.2.1.31.1.1.1.18

    Signal-to-Noise Ratio and Codeword Errors (CMTS only)

    docsIfSigQUnerroreds

    1.3.6.1.2.1.10.127.1.1.4.1.2

    Signal-to-Noise Ratio and Codeword Errors (CMTS only)

    docsIfSigQCorrecteds

    1.3.6.1.2.1.10.127.1.1.4.1.3

    Signal-to-Noise Ratio and Codeword Errors (CMTS only)

    docsIfSigQUncorrectables

    1.3.6.1.2.1.10.127.1.1.4.1.4

    Signal-to-Noise Ratio and Codeword Errors (CMTS only)

    docsIfSigQSignalNoise

    1.3.6.1.2.1.10.127.1.1.4.1.5

  • I love that SE's like you--in Cork!--read and respond to customers' queries, and that have you access to the technical information behind the scenes!

  • Well, once again, NPM is recording it correctly for what its doing.   If you have one tool monitoring every 10 minutes and there is a 700Mbps spike for 20 seconds and the rest of the time its at 20Mbps, over the 10 minutes its going to average it back out and probably down to something much closer to 20Mbps than 700Mbps.   However, if you are monitoring with another tool every 10 to 15 seconds, its going to capture that spike in its entirety.   So, in order for both tools to behave the same, you would have to put them on equal footing and make their polling intervals the same.

    There is no multiplier!!  It just doesn't make sense to do so...   Both tools are graphing simple counters, that are of ALL traffic, but over different time periods.   Put them on the same time period and I'm guessing they'll start looking more similar...

    Maybe another way to illustrate it, once again with Netflow.    There are two different types of Netflow.   One that most are used to is where the router reports on every single packet going through it, and another type where it will report on a random sampling of one out of every X packets.   With the first version of Netflow it doesn't make sense to use a multiplier on it because it is reporting on every single traffic flow that is going through the device.  However, because the second is NOT reporting on every flow, your software needs to be able to compensate and apply this multiplier to correctly interpret the data.  Let's say its sampling one out of every 100 packets (1:100), you would tell the software to multiply each packet by 100.   In the sampled-netflow you would potentially miss entire conversations and lots of smaller traffic conversations, while with regular netflow you would pick them up.

    Reading an interface counter via SNMP is similar to full netflow, its not reporting on a sampling of packets, its reporting on every single packet it sees go over the interface.  Applying a multiplier doesn't make sense.   Where-as if these counters only recorded one out of every X packets it would be different, you would need a multiplier like with sampled netflow.

    If you want more accurate or granular reports, either increase your polling interval on NPM to match your bandwidth gauges polling interval (or visa-versa), or go with NTA using Netflow...   And be warned, your Netflow graphs might look different than your NPM graphs!!!  Oh yea, be sure in Netflow to record all packet types too, not just those for known traffic types.

    Oh, and make sure that both NPM and bandwidth monitor are both doing either 32 or 64 bit counters too, not sure if bandwidth gauges can do 64 bit though...