This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Availability

I have been asked by senior management to start providing network availability reports in the near future.  As it stands now we have our complete network infrastructure inside Orion and we are starting to wrap some process around using the unmanage node option to account (assumption) for downtime that occurs via an approved change request.  With that, I am curious if there is any documentation on how Orion actually computes availability?  There is an old post which implies that the calculations are not that straight forward. 

Is there anyone out there in Orion land that can answer some of these questions: 

  1. What database tables contain the availability data?  I see an availability column in the responsetime_xx table, is that the only spot?
  2. How is availability calculated?
  3. Does the un-manage node option actually play in the availability calculation as the admin guide does not actually call out this feature? 
    From the admin guide "When you need to perform maintenance on a node, such as upgrading firmware, installing new software, or updating security, you may want to discontinue polling while the device is down. Disabling statistics collection while node maintenance is being performed helps maintain the accuracy of your data and prevents unnecessary and inaccurate alert messages." 
  4. How does Orion deal with the data when Orion is coming back up from a reboot?  This was mentioned elsewhere in the forums also.
  5. Does database statistics summarization skew the availability data?

Thanks!

  • Cam,


     Good questions.


    1. The ResponseTime_xxxxx tables are the repository for the Availability statistics.


    2. In it simplest form it is 100% up if it reponds to a ping and 0% up if it does not. Take all of those data points, average them for the time period in question, and that is your availability. The alternate is to use a 10 poll moving average so that a few missed pings do not actually mark the node as 0% up. These two methods are selectable in the advanced settings in Orion.


    3. The Unamange option does play into it in that for the time the node is unmanaged it is as if it does not exist. So it will not negatively impact your availability.


    4. Can you be a bit more specific on this one? When Orion or its host system are rebooting or otherwise inoperable no polling will be conducted and for that time period there will be no data recorded.


    5. Summerization will have an effect on availability caculations as the summerized data points are not wieghted.


    Cheers,
    Dan


  •  Dan,

    Good Answers!

    In regards to number four I have figured this out in that I was trying to compare it to some other monitoring tools we use.  So the question was a little confusing due to my confusion.

    So I took some time today to down a server and see what the data looked like and that has spurred a couple more questions.

      

    The server I used for testing has a two minute polling cycle and the server was downed in between polls around 4:19 p.m.

    at 4:20:56 orion saw the server was down but avail is still at 100
    at 4:22:56 orion displays an event that the server is down and avail is still at 100
    at 4:24:57 orion records avail as 0

    Questions.

    1.  Looking at the raw data, why is avail still at 100 during two polling cycles where the server is down
    2.  Would this delay I am seeing cause an alert to possibly be delayed for an unexpected period of time?  i.e. two polling cycles.
    3.  What does the archive field indicate?  I assume that during summarization the data is removed.
    4.  Can you please explain a little more your answer to my original question 5.  I see that as data moves from detail to hourly to daily there is a loss of granularity as you are taking an average of averaged data.  Is that what your answer is saying.

    As always, thanks! 

  • Cam,


     I am checking/testing on the behavior you have seen with the recording of availability. One item we should cover is the Node Warning Interval ( this is set in the Advanced setting of Orion) which is integral to when a node is actually marked as down and when a Node Down alert would be triggered. The Node Warning Interval sets the period of time that we will 'fast poll' the node after it misses a ping. Here's how this comes into play. We ping the Node at a regular cycle of 2 minutes by default. If the device should fail to respond to a regularly scheduled ping Orion will set the Node's status to Warning/Yellow and Orion will then commence 'fast polling' of the node. During this period Orion will ping the device every 10 seconds. Should it respond to one of these pings Orion will mark the Node as Up/Green. If it continues to not respond the Node will then be marked as Down/Red status. In short, the first missed ping does not immediatly set the node to Down. That should help with questions 1 and 2.


    3. The archive field is a remnant of when all of the stats were kept in a single table and not broken out into three. It was a way of marking records as Detailed, Hourly or Daily.


    4. What I mean by this is that for calculations of Availability that span daily hourly and detailed records the overall average can be skewed by the greater number of datapoints in the detailed time frame.


     HTH,

  • Dan,

    It seems like the more time I spend with this topic the more questions I have.  Thanks for entertaining my obsessive nature on this topic.  A couple more questions if you will:

    1.  If statistics summarization does not occur will the report writer on the fly roll-up hourly into daily so that a report can be generated?

    2.  I am struggling with the skewing of data that can occur with summarization.  If I only have two polls in a day and one of them is 0 and one is 100 it would appear that node was down for 50% of the day.  That is an over simplified example but I think it works.  With a two minute poll cycle simple math would say that I should have 720 (24 hours X 60 minutes = 1440 minutes/2 min polling = 720 per day) polls per day.  So instead of a 50% availability I end up with 99.99861111%  availability when the weight per poll is applied.  In order for this type of math to work on a regular basis I would have to keep the detail statistics for three plus months, for my example of quarterly reporting.  I say this because I would need to know how many polls happened in an hour.  I also have to incorporate the polling cycle difference per node.  So I guess the questions is, does solarwinds have on the road map anywhere a fix for availability calculations?  

    3.  If I were to add a field or two to a couple tables and adjust the summarization stored procedures, how many problems would that create?  I assume that there would be no assurance that future updates would not overwrite my changes but I am concerned that Orion might have some integrity checking that would bark about the new fields or change to the stored procedure.

    Thanks!

  • Cam,


     Sorry for the delay in following up on this thread. I travel quite a bit in my position. So lets go through the fllow-on questions.


    1. The Report Writer will not do any sort of roll-up on its own. If you are ever curious to what the Report Writer is doing you can view the actual SQL statements by selecting the View SQL option under the View menu.


    2. This behavior is known, and in our tracking system. To mitigate the issue you can run the reports and summerize by day then take the resulting datapoints and average those.


    3. Now adding fields to teh db is not an issue. The Custom Property editor can do this for you. Adjusting the store proccedures, however, is not advised. Changes to the stored procedures may work but would be unsupported and overwritten the next time the Configuration Wizrad is run.


    Cheers,