This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

flood of Interface alerts trigger

Hello All

i have received flood of alerts for interface down of network switches  and when checked those were down. when asked with the network team they found that interface was down from last 2 months. my query is why alert has trigger after 2 months even if same alert profile configured from beginning.

Regards

Harry

  • Lots of questions on this.

    Let's start with, what alert are you using?  Is it an out-of-the-box alert, or custom/edited?    I think there are two oob ones that could trigger on this, maybe more.

    Did your server go up and down?

    Was the node that the interface was on unmanaged for a period of time and just came off?

    Was this alert active prior to receiving the flood?   

    Some of these questions you might be able to answer by going into "Events" and seeing what happened with the node itself over the past couple days including the timeframe when the alert went off. 

    You might be able to get more information by looking at the "Message Center" and telling it to show "Audit Events" for the timeframe including when the alert went off.  Specifically look for Alert Added/Changed/Edited/Enabled.  If you edit an alert, it will clear and then trigger all the alerts that meet the criteria.

  • Were these actually all new alerts never fired before alerts, or could they have possibly triggered before and just now auto-retriggered in a flood of alerts?  If it could have been a refire issue, you can extend the time period between auto-retriggers.  The steps for this are covered in the following success center doc.

    Success Center

  • Hello Jeff

    thanks for reply.

    below are the answers for your questions

    1. what alert are you using==> using interface down alerts customized not out of box

    2. Did your server go up and down? ==> device is switch and some of the interface were down since jun 2019 and also from may when checked in device. interface status is showing correct last status changed for JAN 2019.

    3. MBDILNCR13-31-A.PNG

    3. Was the node that the interface was on unmanaged for a period of time and just came off? ==> no Interface was not in unmanaged.

    4. Was this alert active prior to receiving the flood?=>> yes but this has not triggered alert before but the image i have attached is same like other network devices

    I have checked the in message center, Alert was triggered as per the condition. but i just want to know is there any possibility whether there are stuck in MSMQ or Database entry created on that day only?

  • hello Jeff,

    1. Were these actually all new alerts never fired before alerts, or could they have possibly triggered before and just now auto-retriggered in a flood of alerts?==>

    Yes these are new alerts triggered before that there is no entry for the alerts.

    is this possible at database end that no entry created for these events. ?

    i want to dig deep in this. let me know the all possible way to troubleshoot this .

  • That last status change date is suspiciously around the max setting that you can set for auto-retriggers.  Realistically I think the data might roll off of the alerthistory table since there shouldn't be an active event tied with it and non-active alert retention of the alerthistory table should be tied to event retention.  Most likely your event retention setting is probably shorter than the 100 days unless it was changed from default.  The events table and the alerthistory table would be the places I would look to see if there is an alert for the same object that matches the auto-reset.

    For something like this if it is not an auto-reset, I would probably be thinking RabbitMQ or subs/pends and SW info service.  It has been a little while since I have had to chase something like this down, but the timing based on your example looks a lot like the auto-retrigger happened.

    I usually recommend interfaces that are going to be left down for a long period of time that the down state get resolved with the interface being shutdown, unmonitor it, or set the interface to unpluggable in Orion.

  • Re: #2, was wondering if your SW server was rebooted or reconfigured.   Not if the switch went up or down.   If the server is rebooted it might retrigger alerts, or reconfigured.

    I think the "Last Status Change" on the switchport is the "Up X days"  (on Arista) or the last time the interface went up/down.    I don't think that has anything to do with the Orion database.   This is in regards to the "re triggered" suggestion.

    You could try going into the database via database manager or SWQL studio to see if the theory of the Alerts getting retriggered due to them getting expired by Orion.

    I couldn't find an "alert" retention in the database, and my "event" retention I think is at the default of 30 days.   So not sure if there is an alert expiration or not?

    If your alert is custom, and you didn't see it edited the day it sent out all the alerts, you might want to post the definition so we can look at it.   Unfortunately I don't see a "last changed" date in the AlertDefinition table, so no way to double check and make sure the alert wasn't modified.   If there isn't something in audit messages about it, then we can't tell for sure...