This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Have one query on the 'Evaluate frequency option'

Hi All,

Would like to have clarity on the 'Evaluate frequency option' that is available while configuring the alert.

For ex: if I have my polling settings as default and for disk place alert I select that it should evaluate every 30 minutes.

Then:

If the disk space is observed high on 10th minute and back to normal on 16th minute and again high on 20th minute, the trigger action gets initiated only once?

I am just trying to understand the relation of this option with polling settings.

  • Someone else may be able to confirm, but my understanding is that in your example it would trigger either once or not at all.  You don't state what happens between minute 21 - 30.   If it remains 'high' I think it will trigger at minute 30.  If it returns to 'normal' again before minute 30 then I would expect it to not trigger at all.

    To answer the direct question, I believe you can think of the evaluate frequency option as how often the Database Stored Procedure triggers to check the conditions of the rule.  If it's 30 minutes, the trigger conditions are only evaluated/checked every 30 minutes, values/states in-between the interval are irrelevant.

    I haven't thought of a use-case for increasing the evaluation frequency myself over the default 1 minute.  I suppose it's most logical to simply keep it below the polling frequency.  I can imagine if you had a very high polling rate you might want to lower the evaluation frequency, and if you had a longer polling time, and a very busy system, it might free resources to only evaluate at a longer interval.  I have adjusted the trigger condition 'must exist for xx minutes/seconds duration' though.  For example i have some WAN nodes that go down frequently.  I still want to poll them at the default rate, but don't want any alerts unless the node has been unreachable for 5 minutes.  It still evaluates every 1 minute, but triggers only after 5 consecutive minutes of downtime.  In absolute worst-case scenario with a 2 minute polling frequency, it would be 8 minutes before the alert triggers after the node goes down (goes down just as soon as poll succeeds, then 2 minutes before being checked again, then assuming this is a second after the last evaluation, another minute for the next evaluation, and then 5 minutes with it 'down = true' before it triggers alert.  On average it would be 6.5 minutes.  Best-case I find out just over 5 minutes after it went down.

    Does that help?

  • Thanks for the detailed explanation...yes I did read at many places that

    evaluate means it will be against the database but just wanted more

    clarity..

    Actually we are sending email alerts to a team for disk space..now what

    they are saying is they want to be informed every 30 min, incase they

    missed to check the email. So I was thinking to use Escalate option but

    then it is tagged to the acknowledge part.. I won't have anyone

    acknowledging in console...

    Your thoughts on this?

  • They want a new email repeating every 30 minutes?  So like if they didn't fix the disk for 12 hours, they'd have 24 emails 'reminding' them?  That seems the wrong approach, it's not like an email from 35 minutes ago goes away without someone touching it... If the issue is too many total email alerts, then the process is broken and this makes it worse.  The email alerts should be easily manageable, and only surfacing things that must be actioned, that's why you buy into a system with complex alerting options in the first place...  But if that's really what you want, you might want to just clear and re-trigger the alarm after a set time, instead of resetting on status changing out of warning?  If you won't be using the web interface to acknowledge the fix and manage alarms, it wont' really matter that it's triggering and clearing every 30 minutes?  That might be easier to keep it actively notifying then trying to build out potentially 10's or hundreds of identical 'escalation' rules for every possible future 30 minute interval before it's fixed...  But i would strongly recommend they try to action the first email, and if a broader team or group need to be notified at larger intervals (30 minutes > level 2, 2 hours level 3, etc) then escalation is the way to go.  Don't forget you can have multiple rules as well, eg a 'space warning' rule at 10% free and a space critical alert at 5% free, and a space emergency alert at 1% free or something, each with their own timings, emailing rules, etc... 

  • Hi slebbon

    I was working on this for a while... Whats happening is:

    I have set an alert for CPU > 90% which should wait for 10 min before trigger. I am evaluating every 30 minutes and there is no reset condition, I have set it trigger each time the condition is met.

    First time the alert comes in, but for the next evaluation it doesn't. What I observed here is that if I remove the wait time then I m getting the alert every 30 min since the condition is true.

    Any idea why it doesn't alert when we have wait time in place?

  • I'm not positive, but i think you might be having trouble with the use of the 'evaluate every 30 minutes' being a longer duration than the wait timer?  The evaluation interval I don't think is being used correctly here.  It's intent I think is more to reduce database CPU processing time, not to set a repeat interval.  My first thought was that you are not getting repeated alerts because the first alert is not getting reset, so it's still active.  Normal behavior would be to not trigger an alert again until the alert is reset and then re-meets the trigger conditions.  If you want escalating alerts if the original alert isn't responded to, that's taken care of in the alert escalation part of the trigger Actions section.  But that doesn't quite explain your last sentence that you do get repeated alerts if you remove the wait condition.  Maybe it will help if i break down the 'meaning' of each part in plain language as i understand it:

    Wait time before trigger will be checked as part of trigger condition eg "if a = b and this has been true for last 10 minutes, then trigger alert".

    Evaluate every x minutes just schedules in the DB how often the trigger condition rules are checked at all.

    Since you're only evaluating every 30 minutes, i think your condition of 'has been true for 10 minutes' can't trigger because it hasn't even been checked in the last 10 minutes.  It's possible if you waited for another check interval (now 1 hour later) 'condition met for 10 minutes' would evaluate true, and the alert would trigger, but it is also possible that it might not be that granular with tracking alerts, I've never increased the evaluation interval so high as 30 minutes.  It's possible that you'd have to keep the trigger wait time larger than the evaluation interval to avoid problems.

    I'll say again, i simply think you're approaching this wrong.  When you need to figure out why you're trying to do something a flexible system isn't designed for, it is probably because it doesn't really make sense.

    If you want a CPU alert, that after 10 minutes of > 90% emails someone, then leave the evaluation at 1 minute, wait time at 10 minutes and let alert trigger.  If you need 'reminder' emails after x amount of time to probably a larger group, use the trigger action escalations, if you need to send the same email to the same person repeatedly then you are doing it wrong.  The person doesn't need 5, 10, 1000 emails that something needs to be fixed/looked at.  They need do just do their job and look at it after the first email.  Again, email doesn't disappear if the person doesn't look at it right away, it builds up more and more tasks in their inbox.  If the recipient is so busy with so many emails that they can't manage to ever look at something that came in longer than 30 minutes ago and will never get to it, sending them more emails just makes this worse, it doesn't fix anything.

  • Hi slebbon

    I completely agree with you on the approach selected here. I am also not with this but ppl above me want it this way so I m trying to figure out the options I can go for.

    The only prob here is they want repeated alert. If the CPU on a device is high even after 30 min, 1 hr etc, they want it to be triggered to them. Which again as you mentioned is not correct.

    The sentence ' if I remove wait time then alert triggers continuously'- in this case if the CPU is high for 1 day for ex then every 30 min I get an alert since I have not put any reset condition. Hope you got it now.

    I have also tried to keep the evaluate time less and wait time more. Like 1 min and 10 min wait time but result was same. I don't get repeated alerts.

    Maybe I will try to convince them stating its not correct way and we should follow the recommended process.

    If you have any other thoughts then please do let me know...

  • You can still use escalations to send additional emails, as i said, just not on a regular schedule without a lot of work building (infinite?) number of escalations. 

    The only scenario I can think of where it makes sense to repeat the alert for the same item is when the original notification wasn't enough and then you need to bring in additional people/resources with a reasonable number of escalations .  For example, if a small site branch office router on generic ISP (DSL or cable) goes offline, it's likely a brief interruption, so we only generate alert in Solarwinds website, with no email.  If after 10 minutes the site is still down, we escalate that alert to send email to primary person/team responsible for initial investigation.  If still site remains down for another 50 minutes (1 hour outage total now), (and nobody has yet logged into SW at all to acknowledge the alert) a second email is then sent to a general larger team for investigation.  Finally if after 4 hours it's still down, AND still has not been addressed (ie acknowledged with comment in Solarwinds -- even if the comment is just to say power is out at site and might not be back for hours; it will stop the escalations), the alert is escalated once again to a manager who should never be getting alerts.  He will then finally be able to investigate the issue and find out 1) why site is down, and put in some sort of acknowledgement or comment; and 2) find out what the lower level teams were doing on their shifts instead of addressing the alert as they are supposed to.

  • Thanks...Lets see how it turns out once I explain them in detail....

  • Hi slebbon

    Is it possible for you to re-create the same situation in your environment?

    I want to know if u get repetitive alerts when we set evaluate every 30 mins if the trigger condition is: memory>x value for 10 mins

    reset condition would be: no reset, trigger each time the condition is met.

    If its possible, then please do let me know...