This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SOLARWINDS generating false alerts on daily Basis

Hi all,

We are managing a Solar winds Server (NPM and NCM) for monitoring our network devices. But everyday midnight (between 1:00 AM to 1:05 AM) we receive down alerts from our NPM for some of the Cisco devices. Actually when we check those devices, the Uptime seems to be very long and there was not any network issues being faced during those times. can any one help me on moving forward on this.

NOTE: For instance, if we are monitoring 100 devices, i am not getting alerts for the same 100 devices exactly during that time daily. But will receive alerts for around 85 % of them during that time.

Regards'

Vignesh Pandurangan

  • 1. Check your polling interval

    2. Check your alert trigger condition

    5-29-2014 7-02-20 AM.png

    We have ours set to poll every 120 seconds and not to alert unless it is down for 2 minutes. So if I understand this correctly it wont alert us unless it is down for 2 polling cycles. (max of 4 minutes)


    I stand corrected.

  • Please post your alert trigger information. It could be a momentary hiccough.

    Speaking to fcpsolaradmin's comment, what you have there is a VERY tight timeframe.

    If you are polling every 2 minutes, and you are triggering when it's been true for 2 minutes, you run the very real possibility that you will trigger based on one bad ping cycle (because SolarWinds won't have time to re-check the server. Here's the way the timing works out:

    12:00 - Solarwinds sends ONE ping, all good

    12:02:00 - solarwinds sends ONE ping, it fails

         Server marked "warning"

        12:02:00 - :50 - solarwinds sends one ping every 5 seconds until 10 pings in a row fail.

         Let's assume it fails all 10 pings.

         Server marked down

    12:03 - SolarWinds alert manager runs a check, finds this server is now down. The 2 minute timer begins

        (remember, the alert manager checks each alert condition once every minute (by default)

        (so this delay could be longer, but we'll assume everything is running like a top)

    12:04 - nothing happens, but let's say the node is now UP.

    12:05 - SolarWinds checks node again. Ping succeeds

    12:05: MEANWHILE, alert manager checks again, server is still down, creates alert

    12:05:01 - 12:05:50 - SolarWinds sends a ping every 5 seconds until 10 pings succeed

        Let's assume server passes all 10 pings

        Server marked up

    12:06 - technician receives alert. Goes to server

       Server is up.

    12:10 - technician arrives at your desk and begins a 20-minute long rant about how "your" monitoring sucks, and never tells him real information, and that he's not going to respond to any more of your garbage alerts, and please stop monitoring his precious servers because he'll do it himself with WhatsUpGold.

    SUMMARY: While we always want to know as soon as possible when a device is down, you MUST understand the interplay between polling cycles and alert delays so that you don't end up creating alarms when there's been no secondary verification.

  • Leon Adato Great break down. I had to explain the same thing the other day and didn't do it any justice in comparison of this.

    I will refer others to this post.

  • "12:10 - technician arrives at your desk and begins a 20-minute long rant about how "your" monitoring sucks, and never tells him real information, and that he's not going to respond to any more of your garbage alerts, and please stop monitoring his precious servers because he'll do it himself with WhatsUpGold"

    This ^^ guy  was talking to me this morning

    Ok @Leon Adato so I should just keep the " Do not trigger action until the condition exists for more than" to the default 0 seconds?

    I have a mixed environment where some nodes poll every 60 seconds others poll every 120 seconds.

  • I would actually make the delay longer - between 3 and 5 minutes. That way you get AT LEAST two polling cycles to confirm the box is really down before "that guy" gets a message.

    Alternatively you can create a custom property ("servicelevel") where "level0" indicates high-criticality, poll-every-60, and "level1" indicates a less critical box that's polled every 120. Then you have two alerts - one that triggers after 3 minutes, but you add the clause "where Nodes.ServiceLevel is equal to level0", and the other triggers after 5 minutes, but is limited to ServiceLevel=1.

  • Thank you, next question (sorry to thread jack) I have a secondary site where everything connects to a core switch(parent). When our WAN goes down sometimes it sees the child nodes as down before the parent node. Thus creating a flood of alerts. Do I just set that parent nodes status polling cycle to a shorter time?

  • Hi all,

    Thanks for your suggestions, i have requested my Support team (who is actually having access to that Server) to check the poll time settings and also the alert trigger condition.

    I will get back with the results asap !!!

    Thanks & Regards'

    Vignesh Pandurangan.

  • In Solarwinds, parent-child really just "injects" an extra polling cycle. So you have a greater chance of seeing the switch go down before the server. BUT if the servers are polling at 60 seconds and the switch is polling at 120, you are right that parent child is going to fail you sometimes.

    It's another reason to have alert triggers that are double (or more) the polling cycle.

    <PHILOSOPHY>

    I realize there are some systems that are hyper-critical. But the majority of systems in an enterprise can be down for 5 or 10 minutes without thousands-of-dollars-per-minute impact. And the fact (which flies in the face of the "five-nines" mentality) is that systems DO go down. All the time. Servers spontaneously reboot. Services drop and come back up. Etc. And life goes on.

    Generally speaking (and based on my experience - 25 years in IT, 12 of that in the monitoring space), a 10 minute outage before action is perfectly acceptable when balanced against repeated false alarms at 2am. That's for MOST systems. Also note that our most important systems are also often our most redundant systems. Sure, the web server crashed. But it's part of a 3-server cluster behind a load balancer. The SERVICE (customers getting to the website) still has five-nines. It's just that one server that had a blip.

    </PHILOSOPHY>

  • Thank you for your advice, I certainly appreciate it.

  • OMG. How I wish I had you working with me when I was still in corporate culture.

    Then my boss would have heard the same rants from TWO employees! emoticons_wink.png