1 Reply Latest reply on May 29, 2014 2:45 AM by shuth

    Need help in alert timing

    mahdi87_gh

      Hi everyone, I'm new to Alerting engine. I configured an alert to send a post url to my web server. the alert is configured to check every 15 secs. if i shut an interface on the router it should send an alert. the problem is solarwinds sends alert after around 100 secs not 15 secs. and if the interface changes state to up before 100 secs, i don't get any alert at all. what is the problem?

        • Re: Need help in alert timing
          shuth

          Assuming you have set the "Alert Evaluation Frequency" to 15 seconds in the alert configuration, this will query the database to look for the alert conditions every 15 seconds.  This is different to the SolarWinds server polling your devices/interfaces for their status/statistics.

           

          You will need to configure your nodes and interfaces to have a quicker polling interval (and possible reduce the warning period) if you want to be notified sooner. If the router is still reachable with the interface down, the interface should go straight to a down status on poll (no warning period)

           

          I've included some info below from a post I wrote the other week about how the node status polling works.

           

          Difference Between Statistics Collection Interval and Rediscovery Interval

           

          Short version: At system defaults, a device can be down for 2-4 minutes before being detected

          Long version: See wall of text below.

           

          As Rob above mentioned, there are separate polling intervals for status, statistics and rediscovery.

           

          The status polling interval is how often the server will check the status, availability, response time and packet loss of your monitored elements (default 2 minutes).

          The statistics polling interval is how often the server will collect statistics from your devices using SNMP/WMI - information such as CPU/Memory utilisation, volume statistics, interfaces statistics, etc.

          The rediscovery interval polls your devices checking for reindexed interfaces, system OID info, node details, as well as checking for technologies like EnergyWise and wireless.

           

          As for how long a node can be down before being detected, this will depend on your status polling interval (default 120 seconds) and node warning level (this setting is further down the page under Calculations & Thresholds - default 120 seconds).

           

          As for how long before the system detects a node is down, my understanding is the following:

          If the polling engine polls a node and does not get a response, the node status is set to Warning (and the icon goes from green to yellow) and goes into a "fast poll mode". In this mode, the server will poll the node every 10 seconds (locked value) for the period defined in the node warning level (default 120 seconds). If there is still no response at the end of this time period then the status is set to Down (and icon changes from yellow to red). If the node responds, the status changes to up (green). If you have dependencies configured, the system will check the parent statuses first before marking a device as down - if all of the parents are down or unreachable, the node will be marked as unreachable.

           

          Therefore, if a node goes down immediately after a successful poll, with default polling intervals you will find out after ~4 minutes.

          T0 - successful poll, node status up

          T1 - device goes offline

          T120 - system polls, node status set to warning, enter fast poll mode - poll every 10 seconds

          T240 - node warning level ends, node status set to down (device has been offline for 3m:59s)

           

          I typed the above then remembered I had this bookmarked: How Does Orion Mark a Node as Down?

          I also had another bookmarked thread with some developer quotes on the fast poll cycle: Fast Polling

           

            wrote:

          Here is perhaps a more detailed explanation than what you are looking for.

          On a status poll orion performs an ICMP ping operation.  If that ping fails then we set the Node to Warning Status and begin Fast Poll.  By default, we will poll in Fast Poll mode for 120 seconds and send a ping every 10 seconds.  (120 seconds configurable from your Polling Settings page - Node Warning Level).  If after the 120 seconds, all of the pings sent every 10 seconds have failed, then we will mark the node as Down.  So on average it will take 13 failed pings to set a node to the Down status.

          Fast Poll pings are not logged against availability.

          There are more settings to ping more than once on the initial status poll, too.

           

          There is a setting that you can use to increase the number of times we will try to ping a device on a status poll by default.  Meaning that even before fast polling kicks in, we will try to ping the device X number of times to make sure we didn't just maybe miss a ping because of network load.

          In your Settings table of your Orion database, you may or may not have the entry:

          SWNetPerfMon-Settings-Response Time Retry Count

          This value, even if not in the database defaults to 1.  Meaning we will try to ping the device once before failing and moving into fast poll if necessary.  If you were to update this number "CurrentValue" to 3, then if we were polling the device and the first ping failed, we would try again, and if that one failed, again before going into fast poll mode.  If any one of the 3 succeed, then we immediately just mark the device as up for that poll interval.  There is no time delta between these pings.  If one fails, another is tried as fast as the code can run.

          You probably don't need to change this number unless you are seeing a lot of your nodes go to warning because your network often drops ICMP packets.

          Hopefully, this helps even more.

           

          It will stop as soon as the first ping attempt succeeds.

          So if you set it to 3, then we will try once and if that succeeds then attempt no more.

          Here is the algorithm:

          1. Try to ping the device

          2. Did we succeed?  Yes go to 4. No, go to 3.

          3. Should we try again (we've tried less times than we are configured to)? Yes, go to 1. No, go to 4.

          4. Notify the system of the success or failure of the ping.

          So it is really a maximum amount of attempts.  Thus 1 is truly the minimum attempt count.

          1 of 1 people found this helpful