Greetings all.
We are trying to determine availability percentage figures that are as accurate as possible.
I am looking for guidance to the following scenario(s)
1. We PING our devices every 120 seconds for an up/down status or in other words a missed PING.
One missed ping starts the Orion polling engine into a Rapid Polling algorythm of every 10 seconds, according to documentation. The current limit of time set on our system is 300 seconds. Again, according to documentation, this means that fast polling continues for 300 seconds before the device is declared "Down". Given those values, the total time BEFORE a device is declared "Down" is 7 minutes (420 seconds). Availability is then drecremented and continues until a successful poll is returned. Potentially, this could mean a minimum of 9 minutes before the device is declared up even though it came up much sooner. Here is the problem: Should a device fail (in reality, it went down, I'm staring right at it) and then comes back up, PING would fail for a period of time, fast polling would not be successful immediately, BUT the device returns to an "UP" condition before the 300 seconds is reached. THEREFORE Availability is calculated as 100% because the device never reached the "Down" condition in the ORion System. An alert was NOT generated based on "Down" as the condition.
2. Basically, the same scenario as above, but we not SHORTEN our Fast Polling time to 120 seconds. NOW we will declare the device "Down" after 4 minutes (120 + 120). Alerts are now generated at THAT time.
The potential problem in this sceanrio is that ALERTS start to be generated for devices that are truly up but don't repsond quickly enough to the PING. Devices can be so remote or slow that this does happen. Also, the statistics are artificially negative.
I'm looking for how to be as accurate as possible for Availability Statistics while NOT alerting to a point where no one believes the alerting anymore.
3. I am aware of the Percentage setting in place of Node in the Poller Settings. The problem with that is IF a device misses any number of polls, it is considered "Down" when in reality it is just too busy/slow to respond. It is not down and therefore skews the statistics abnormally negative.
I need thoughts on this from the forum and from Solarwinds to have a best practice. Your input is gratefully acknowledged in advance.
Thanks!