This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alert email notifications inconsistent

After upgrading to  2020.2.1 HF2 some email notifications are not being sent out. A user noticed this when he would get "Reset" emails but not the initial "trigger" email for some alerts. Some of these alerts that also have an API action gets executed successfully, but again, the email action does not get sent. 

It is not an issue with the SMTP configuration as both the Trigger action and the Reset action use the same SMTP configuration. It isn't a problem with the action configuration as I have added the action to another alert and it that email is received. If I recreate the alert from scratch, (not using the Duplicate and Edit button) then everything works as expected. 

What I am worried about is if there are some alerts that have this bug in it that we never know there is an issue with the alert configuration and won't be notified if there is a problem. 

Has anyone experienced this before? Is there a query you can run to find out what alerts have this bug, and better yet, is there a query to run to fix the alerts so I don't have to go recreate all the alerts to be sure all of them are working? 

  •   

    This is a new post for the issue I accidently added to a different thread. Case #00710923

  • Thanks for the heads up. I'll keep an eye on it.

  • Something you could do is add another alert action to log the event to a file with a timestamp and the alert string.

    This way you know the alert fired....to go a step further, add the logging action before the email action and then again after....so if it hangs or gets interrrupted you won't get the second log message.  Useful in troubleshooting.

  • This is my personal best practice.  I always log an event to the NetPerfMon log on all alert triggers (and each escalation level) and alert clear actions.  It helps me have a record on a per element basis.

  •  that is what we like to do as well.

    It then gets injested into splunk so we can pull an entire timeperiod of events to correlate with other tools like servicenow, emails and actions by other parties and such.  If an alert template fails partway through, inconceivable...right?, having a beginning/ending log entry lets you know if it failed somewhere becasue you'll never get the ending log entry for that template.

  • So working with Support three things came up.

    1. Some old and low priority alerts had a sender email address that was being blocked. Though I fixed that previous to the case by changing the SMTP server used, that was insightful.

    2. Support gave me some scripts to run and .db files to delete while Orion was shutdown, in order to clean "subscriptions in the database." I'm not sure the impact of this one.

    3. Before shutting down Orion I disabled all the actions. When I was in the UI to disable the actions I noticed some of the actions were already disabled. I didn't stop to look if they were related to the alerts we were not getting alerts for, but after I did item 2 above I made sure that all actions were enabled again. The "Select All" feature might not actual be "Select all" as I had to do it twice. (Where is the headslap emoji?)

    I am still getting some errors in the logs though about a Trigger evaluator failing. 

    2021-01-26 11:43:11,529 [93] ERROR SolarWinds.Orion.Core.Alerting.Service.ConditionsStateEvaluator - Condition 'AlertId: 387, AlertLastEdit: 1/3/2020 5:06:17 PM, ConditionIndex: 0, Type: Trigger' Evaluator failed - Condition evaluation failed for query = (SELECT E0.[Uri], E0.[DisplayName] FROM Orion.Groups AS E0 WHERE ( ( ( E0.[Name] = @p0 ) ) AND ( ( E0.[Status] = @p1 ) OR ( E0.[Status] = @p2 ) ) )), condition = (AlertConditionDynamic: scope=(AND ([Orion.Groups|Name] = 'M_Citrix_DDC')): ( ([Orion.Groups|Status] = '2') OR ([Orion.Groups|Status] = '3') )) - System.ServiceModel.FaultException`1[SolarWinds.InformationService.Contract2.InfoServiceFaultContract]: Invalid username or password. (Fault Detail is equal to InfoServiceFaultContract [ SolarWinds.Data.AccessDeniedException: Invalid username or password. ] ).

    If I recreate the alert from scratch the new alert does not fail. I guess my last step is to recreate these 15 alerts from scratch and hope there are not any unlogged alerts that are not being triggered.