I am under the understanding that Group availability and Group member availability counts unmanaged time as downtime, thereby lowering true availablity.
I can make no sense of either measurement against any of the group of nodes I have built
So a common assumption I run into is that availability is based on time, but orion doesn't have much of a sense of time, it doesn't keep an internal clock like that. It just has datapoints for polled data, and the availability % is determined as a percent of the number of times we polled the object.
So in a normal day with a 2 minute polling intervals you have ~720 availability datapoints in the database that could be up or down.
If the node was down for 10 polls during that day then you have 98.61111% up time.
Now if you bring unmanaging into the mix, lets say you unmanaged the device for 1 hour, but it still was down for 10 polls during the 23 hours of the day that the node was managed.
Now we have only 690 availability datapoints to work with, and 10 of those were down, so we get 98.5507% up time.
Further impacting the numbers can be things like nodes with dependencies that become unreachable, because unreachable nodes are effectively unmanaged until their parent comes back up. When a node goes into warning it triggers the fast polling which changes to a 10 second polling interval, or manually forced polls.
I've also seen cases where the numbers looked weird until we found out that the polling engine was overloaded or have some other problem that was making it so our completion rate was bad.
Essentially, what you need to do to verify the numbers is to generate a report of every availability datapoint during the interval, and do the math of [ count of up polls ] / [ count of total polls ]Things get a little more complicated if you are going further back than 2 weeks because you would also need to factors weighting of the historical metrics into the calculation.
Very interesting, and new information, but I still do not understand if Node availability is calculated differently than Group availability or Group Member availability. My indications are that it counts unmanaged time as being down, unlike node availability.
Ah I understand your question a bit better now, to be honest I have not yet had a use case where I had to test the logic being used for that group member availability feature since it came out. My availability reports are all still relying on the individual objects availability stats directly so I haven't looked at group member avail for anything.
Thans.....here is what I see:
I have a Group that only has one member. If I look at the availability for last month of the Node/Server in Node detail, it shows 100% availability
If I look at the Member Availability for last month, it shows 99.18%, and if I look at the Group availability it also shows 99.18%. So I scratch my head.
If I have a group made up of multiple members, they the Group Availability and the Member Availability does not necessarily match any more.
Somehow I think it is because Group availability and maybe Group member availability counts unmanaged time as down time. but can not get confirmation on this.
If this is true, this Group availability and Group member availability is a misnomer and is no true indication of uptime, as availability is so often defined.
I have evidence that supports your theory. See the screenshot below.
I am trying to get availability statistics on a pool of two servers. If either of the servers is up, I consider the overall application to be up since it is a high availability pool. To get a metric of the overall availability, I put these two servers in a group configured with best case roll-up. There are no other group members.
We have scheduled maintenance where both servers are rebooted daily at 4AM when there should be no activity. We are using the unmanage utility to unmanage these two servers from 4AM until 5AM and this works great at the node level. The servers show 100% available every day.
However when we look at the group availability it shows 1 hour outage each day while the servers are unmanaged and while they are unknown when remanaged waiting for polled data.
If an unmanaged or unknown node is considered available (the main reason we are unmanaging in the first place) why is the group, which takes its status from said node, considered unavailable?