This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alerting Service Processing Issue (2023.2.1)

I hate to ask this question, but I have no choice.  I need to understand how the Alerting Service on your primary poller works from cradle to grave.  The reason why I am asking this question is because for no obvious reason, after a recent upgrade to 2023.2.1, ours just stops processing.  We have an active case open, we are working with Application Engineer who is working with development, we have made multiple production changes, but still the issue remains.  I need to understand all the moving parts so I can deep dive on each one.  Thus the question.

First we found that one of our core apps, they were not putting there servers in maintenance every night when they perform maintenance on there servers (alert storm) < MUTE the Alerts, done

This helped, we even thought this was the root cause.  Then we disabled ALL the canned alerts that we were not using (should have already done this, ugh).  Made several other core runtime changes, optimization changes, yet, it is still happening.

I know that the alerting service is making simple sql calls to the database.  I am looking for specifics, does the Job Engine manage all of this via SWIS, etc.

Can anyone elaborate?

Thanks....

  • Is there a specific time that the alerting stops working, or is it at random? When this occurs, is everything still polling just fine? 

  • HA environment, 10 active, 10 DR.  All production pollers (active), same network segment 10gb links, database on same network.  New primary (physical) Dell PowerEdge MX750c, Dual Proc, 32 cores, 64 logical processors running on average processing at 13 to 18% consumption, 128gb of memory, all SSD’s, 10gb network connection.

    Your answer: Polling good, Collection good, database synchronization good, RabbitMQ, no real queuing.  Syslog, Traps, Messages, Events, all GOOD.

  • Initially it was around 4:00am every morning, we found a unmanage task that I built in 2016 did "not" get properly moved to the new poller 1 server when we upgraded.  It was slamming us, we recreated the job and thought that this event WAS the root cause, for almost a week we were solid, then this last weekend, here it comes again, the failure.

    Your answer: random, the rest of your answer in my other response, no other effects to the core we can see, just alerting.

  • Development was able to recreate this issue today, confirmed bug in 2023.2.1.

    Description: The Alerting Service/Engine stops processing alerts for an extended period of time.  Initially for us, hours.  We then made several changes to lesson the runtime load on the engine this in turn lesson the time for the stall to minutes instead of hours.

  • Does this exist in 2023.3 as well?

  • YES, it is also in 2023.3, found out today!

    We are running 2023.2.1 across (4) environments, Test, QA (a small HA test environment), our Watcher Environment (monitors our production Orion environment exclusively) and production, a HA environment with 9 APEs.

    Have no experience with 2023.3 yet.  I was waiting on you to tell me, "it's safe", just kidding....

    Alerting Issue Case Number: 01415926

  • I'm upgrading to 2023.3 next Monday :)

  • I wish you the best of luck.  I did just confirm in our bi-weekly meeting with SolarWinds this bug DOES exist in 2023.3. 

    For this issue:  (Alerting engine STALLS and stops processing Alerts)

    In the Alerting.Service.V2.log on your main poller.

    Look for (1) WARN  SolarWinds.Orion.Core.Alerting.Service.AlertConfigurationLock - Acquire(#668) and

    (2) WARN  SolarWinds.Orion.Core.Alerting.Service.ConditionsStateEvaluator - Long Running Condition - Condition:

    Note: (#668) is the AlertID, FYI

    Create a OnDemand alert, something real simple like

    Trigger Condition

    > Node / Node Name / is equal to / [your main poller]

    > Trigger Action > Email yourself

    Then from the Alert & Activity View, find the alert, clear it, it should immediately fire and be viewable in the Alerts view.

    Now that you have a manual system check (on-demand check), automate it if you want, we did.

    IT appears to me that what is hosing up the Service/Engine is the NUMBER of trigger actions you have on your alerts, logging to a file as a trigger action is one of the biggest drains on the engine from our investigation with support/development.

    it seems to have started (for us) in the 2023.2.x code.

    If you run into this issue Bob after the upgrade to where it is affecting SolarWinds Service delivery, reach out to me directly and I then can share several things you can do to mitigate the issue at least until a fix is available.

    Again, best of luck with the upgrade my friend.  Other than this issue, investigating others that have upgraded and reported here on Thwack, looks pretty clean and post upgrade stable.  We can only hope this is true.

  • Thank you for sharing. This is great information for anyone running into this issue. 

  • 2023.2 has a bug in which your alerts can get garbled. If they're garbled you can get errors, if you get errors you can get weird behaviour like this.

    2023.3 is much better

    If you had an alert that was previously not storming, and now is, I'd check them and their SWQL to make sure it's not all messed up

    You can probably find some information the the alerting service log