Community
- Command Central
- MVP Program
- Monthly Mission
- Blogs
- Groups
- Events
- Media Vault
Products
- Observability
- Network Management
- Application Management
- IT Security
- IT Service Management
- System Management
- Database Management
Content Exchange
- SolarWinds Platform
- Server & Application Monitor
- Database Performance Analyzer
- Server Configuration Monitor
- Network Performance Monitor
- Network Configuration Manager
- SQL Sentry
- Web Help Desk
Free Tools & Trials

Availability Statistics Calculation - Request for Input

Greetings all.

We are trying to determine availability percentage figures that are as accurate as possible.

I am looking for guidance to the following scenario(s)

1. We PING our devices every 120 seconds for an up/down status or in other words a missed PING.

One missed ping starts the Orion polling engine into a Rapid Polling algorythm of every 10 seconds, according to documentation. The current limit of time set on our system is 300 seconds. Again, according to documentation, this means that fast polling continues for 300 seconds before the device is declared "Down". Given those values, the total time BEFORE a device is declared "Down" is 7 minutes (420 seconds). Availability is then drecremented and continues until a successful poll is returned. Potentially, this could mean a minimum of 9 minutes before the device is declared up even though it came up much sooner. Here is the problem: Should a device fail (in reality, it went down, I'm staring right at it) and then comes back up, PING would fail for a period of time, fast polling would not be successful immediately, BUT the device returns to an "UP" condition before the 300 seconds is reached. THEREFORE Availability is calculated as 100% because the device never reached the "Down" condition in the ORion System. An alert was NOT generated based on "Down" as the condition.

2. Basically, the same scenario as above, but we not SHORTEN our Fast Polling time to 120 seconds. NOW we will declare the device "Down" after 4 minutes (120 + 120). Alerts are now generated at THAT time.

The potential problem in this sceanrio is that ALERTS start to be generated for devices that are truly up but don't repsond quickly enough to the PING. Devices can be so remote or slow that this does happen. Also, the statistics are artificially negative.

I'm looking for how to be as accurate as possible for Availability Statistics while NOT alerting to a point where no one believes the alerting anymore.

3. I am aware of the Percentage setting in place of Node in the Poller Settings. The problem with that is IF a device misses any number of polls, it is considered "Down" when in reality it is just too busy/slow to respond. It is not down and therefore skews the statistics abnormally negative.

I need thoughts on this from the forum and from Solarwinds to have a best practice. Your input is gratefully acknowledged in advance.

Thanks!

Find more posts tagged with

npm_availability_statistics

Accepted answers

All comments

mcbridea

This is controled by the Node Warning Level setting. From NPM Admin Guide.

Node Warning Level

Devices that do not respond to polling within this designated period of time display as Down in the web console. By default, this value is 120 seconds.

mhh351

As I stated, the documentation is ambiguous at best as to what that really means.

This is not an answer. AI read the notes.

When does Orion declare a node "Down" and "Up" and what is the delay to that declaration?

What does the delay include, if anything?

mcbridea

If you have set the availability polling at 120 seconds and the Node Warning Level at 300 seconds (I think that is what you are saying) then the maximum downtime is time difference between the device ping failure and the next ICMP poll (an uncontrollable variable) + the Node warning level , 300 sec. The fast poll will happen every 10 seconds for the time set in the Node Warning Level setting. After the Node Waning period the device returns to 120 sec pings. You are correct that if node starts responding to ICMP again before the Node warning expires it will now be seen as down.

You could set up an advanced Alert to catch these but you may get overwhelmed by these because ICMP can be dropped for many reasons. at least it will work as a measure of you ICMP poll time vs Node Warning delay time settings, even if you don't use it permanently.

Hope this helps - Andy

warning alert.JPG

William_Powley

I'm glad I saw this post, as I'm starting this very process right now. So, up front thank you for doing the calculations and posting them for the rest of us to work off of.

My initial thought is that there would have to be room for error in the results that we present to our companies. I understand that some environments this might be very small, but generally when we are asked what our (my organization) uptime is, we provide them a rough estimiate based on our own rough calculations. That was before we had the statistics provided by Orion which help us zero in on those numbers even more, but we realize that there will always be factors outside of our control because of the delay you mention for remotes sites, latency, and other factors. We easily see 40-80 ms delay on most of remotes when everything is quiet, and 300-800 ms when they are hammering the tunnels. There is little we can do about that...except perhaps use Quality of Service to optimize the management traffic a little more.

Granted, I've only started the process myself and am only brainstorming a little here while I think about this, so if I'm going off topic or no helping then tell me to shush

I guess what I'm getting at is what would be that acceptable amount of delay that you can absorb when reporting your availability?

That being said, I'll start testing as well and see if we can narrow down that margin of error as much as possible.

William

mcbridea

If the delay gets over 2500 ms it will be seen as a ICMP failure so you are far from that. To best understand real traffic delay I suggest setting up IP SLA operations if you are a Cisco shop. Check this out for a better understanding.

mhh351

William,

What I am trying to gain is knowledge of how ORion calcs "Down" time for a device.

What I am thinking about right now, is that given the 2 minutes between "Ping" polls, one could have missed a "Down" starting with 1 or 2 seconds in to the next cycle and then have to calc all the way out to the 2 minutes plus whatever time period before ALERTING is triggered. Given that as a truism, I would then back off my Fast poll cycle to limit it to 120 seconds. At the worst, Orion would wait up to 6 minutes of wall time before alerting.

The calculation of when a device is declared "Down" for availability reasons has not yet been answered. I know what the doc says in its abiguity but I need to know what the program is really considering as available for the device. Is it (first missed ping) or (from first missed ping + fast poll) or (first missed ping +fast poll+ limit of time) or (fast poll exceed+limit of time) or immediate missed ping? If one looks at the SQL in Reports, the value description says availability + missed ping + fast poll= AVERAGE availability.

So, there are several questions yet to be answered.

So, now you know where my thought process is.

I agree with your analysis that we have to caveat to our reports what the statistic is based on. As long as everyone knows what the basis is for the statistic, all will be good.

Thanks for your interest and input.

Mark