This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Alerting Broken After 2023.2 Upgrade

We upgraded our instance from 2023.1.1 to 2023.2. After the upgrade 30% of our alerts are broken and not firing, mostly with our Component based alerts. The Alerting.Service Log shows a conversion failed SQL exception for the alert triggers: 

2023-04-28 15:04:43,276 [35] WARN SolarWinds.Orion.Core.Alerting.Plugins.Conditions.Swql.ConditionEvaluatorSwql - Condition evaluation failed : RunQuery failed, check fault information.
Conversion failed when converting the nvarchar value 'net-snmp' to data type int.
2023-04-28 15:04:43,276 [35] ERROR SolarWinds.Orion.Core.Alerting.Service.ConditionsStateEvaluator - Condition 'AlertId: 326, AlertLastEdit: 4/5/2023 1:32:36 PM, ConditionIndex: 0, Type: Trigger' Evaluator failed - Condition evaluation failed for query = (SELECT E0.[Uri], E0.[DisplayName]
FROM Orion.APM.Component AS E0
WHERE ( ( ( E0.[Application].[Node].[Status] = @p0*1 ) AND ( E0.[Status] != @p1*1 ) AND ( E0.[Status] != @p2*1 ) AND ( E0.[Status] != @p3*1 ) AND ( E0.[Status] != @p4*1 ) AND ( E0.[Application].[Node].[CustomProperties].[OPS_Targeted_Alert_Node] = @p5*1 ) AND ( E0.[ComponentAlert].[UserNotes] NOT LIKE @p6 ) AND ( E0.[ComponentAlert].[UserNotes] NOT LIKE @p7 ) AND ( E0.[Application].[Node].[CustomProperties].[OPS_Targeted_Non_Crt_Node] = @p8*1 ) AND ( E0.[Application].[ApplicationAlert].[ApplicationName] LIKE @p9 ) AND ( ( E0.[Application].[Node].[Vendor] = @p1*10 ) OR ( E0.[Application].[Node].[Vendor] = @p1*11 ) OR ( E0.[Application].[Node].[Vendor] = @p1*12 ) ) ) AND ( ( E0.[Status] != @p1*13 ) ) )), condition = (AlertConditionDynamic: scope=(
([Orion.Nodes|Status|Application.Node] = '1')
AND ([Orion.APM.Component|Status] != '27')
AND ([Orion.APM.Component|Status] != '9')
AND ([Orion.APM.Component|Status] != '3')
AND ([Orion.APM.Component|Status] != '0')
AND ([Orion.NodesCustomProperties|OPS_Targeted_Alert_Node|Application.Node.CustomProperties] = '1')
AND ([Orion.APM.ComponentAlert|UserNotes|ComponentAlert] NOTCONTAINS 'NonCritcal:')
AND ([Orion.APM.ComponentAlert|UserNotes|ComponentAlert] NOTCONTAINS 'Serious:')
AND ([Orion.NodesCustomProperties|OPS_Targeted_Non_Crt_Node|Application.Node.CustomProperties] = '0')
AND ([Orion.APM.ApplicationAlert|ApplicationName|Application.ApplicationAlert] CONTAINS 'OPS Telnet - EDI Proxy Ports')
AND (
([Orion.Nodes|Vendor|Application.Node] = 'net-snmp')
OR ([Orion.Nodes|Vendor|Application.Node] = 'Sun Microsystems')
OR ([Orion.Nodes|Vendor|Application.Node] = 'Unknown')
)
): (OR ([Orion.APM.Component|Status] != '1'))) - System.ServiceModel.FaultException`1[SolarWinds.InformationService.Contract2.InfoServiceFaultContract]: RunQuery failed, check fault information.
Conversion failed when converting the nvarchar value 'net-snmp' to data type int. (Fault Detail is equal to InfoServiceFaultContract [ System.Data.SqlClient.SqlException (0x80131904): Conversion failed when converting the nvarchar value 'net-snmp' to data type int.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at System.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
at System.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
at System.Data.SqlClient.SqlDataReader.Read()
at SolarWinds.InformationService.DataProviders.SqlQueryRelation.<GetEnumerator>d__8.MoveNext()
at SolarWinds.Data.Query.PhysicalQueryPlan.Provider...). 

Debug for the Alerting Log shows missing entities when alerts are fired: 

2023-04-28 14:08:57,850 [48] DEBUG SolarWinds.Orion.Core.Alerting.Service.ConditionsStateEvaluator - EvaluateScheduled: nothing to evaluate, exiting

2023-04-28 14:08:57,850 [36] DEBUG SolarWinds.Orion.Core.Alerting.Service.ConditionsStateEvaluator - EvaluateScheduled: nothing to evaluate, exiting

2023-04-28 14:08:57,909 [46] DEBUG SolarWinds.Orion.Core.Common.ChannelProxy`1 - Invoking <Query>b__0 finished

2023-04-28 14:08:57,909 [46] DEBUG SolarWinds.Orion.Core.Alerting.Plugins.Conditions.Swql.ConditionEvaluatorSwql - } Start exited

2023-04-28 14:08:57,909 [46] DEBUG SolarWinds.Orion.Core.Alerting.Service.ConditionsStateEvaluator - Condition Evaluator OnNext (AlertId: 244, AlertLastEdit: 7/12/2019 6:30:24 PM, ConditionIndex: 0, Type: Trigger)

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstance

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstance

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstance

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstance

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstance

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstance

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstanceApplication

2023-04-28 14:08:57,910 [46] DEBUG SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider - Missing entity from navigation SolarWinds.Orion.Core.Common.InformationService.SwisSchemaProvider+RelationsSearchItem (Orion.APM.Application) -> Orion.DPA.DatabaseInstanceClientApplication

When creating new component based alerts and mirroring our old alerts, we can no longer select Application as a trigger condition, Error: "Missing field ApplicationName in Orion.APM.ApplicationAlert"

The only fix is to re-create alerts using Application instead of Component and re-writing the email alerts. Everything was running great on 2023.1.1 and we were pleased with the product. We have an open case with SolarWinds to look at this issue, support seems to be stumped at the moment. 

Also, after upgrading to 2023.2 WPM monitors started to flap, we lost our worker configuration on our players, and that module has become very noisy. Re-recording transactions, adding wait times, resolution, image match adjustments, etc. does not correct the issue. We have an open ticket for this issue as well. 

We thought this 2023.2 upgrade was going to be the same as the 2023.1 and 2023.1.1 upgrades that completed successfully without issue. The only reason we wanted to get to 2023.2 is to address the UTC Bug for last reboot that end users were complaining about, of course that led to the system being down with alerting broken. We have made a decision to wait to preform platform upgrades for at least 6 months due to these issues we are seeing. 

Parents
  • Dang, we are seeing these errors as well with 2023.2.1

    Our Case Number is: 01415926 

    For us the biggest result is the Alert Service stalls, i.e., stops processing alerts.

    We have made several modifications, some of which helped a lot, some not so much so (that we can see).  Waiting on the results from a meeting between our AE assigned and our development resource from this morning for next steps.

  • Try 2023.3 or 2023.2.2. Both include udates to the job engines. 2023.3 seems to be the most stable recent release. 

  • Seems like we are talking about more than 1 found issue here in this thread.  What I know is, there is an alerting issue we found in 2023.2 and it also exists in 2023.2.1, 2023.2.2 and 2023.3.  Development yesterday confirmed it is a bug, however, there are mitigation steps that can be made to mitigate the problem until a fix is available.

    In the Alerting.Service.V2.log, look for this error:

    WARN  SolarWinds.Orion.Core.Alerting.Service.AlertConfigurationLock

    If you see this, keep an eye on the Alerting runtime, i.e, processing alerts on a timely basis.  Seems to be tied to Alert/Trigger Actions, especially if you Log to a FIle.

    This is NOT related to the JobEngine issue.

    Have a thread here on it > Alerting Service Processing Issue (2023.2.1)

Reply
  • Seems like we are talking about more than 1 found issue here in this thread.  What I know is, there is an alerting issue we found in 2023.2 and it also exists in 2023.2.1, 2023.2.2 and 2023.3.  Development yesterday confirmed it is a bug, however, there are mitigation steps that can be made to mitigate the problem until a fix is available.

    In the Alerting.Service.V2.log, look for this error:

    WARN  SolarWinds.Orion.Core.Alerting.Service.AlertConfigurationLock

    If you see this, keep an eye on the Alerting runtime, i.e, processing alerts on a timely basis.  Seems to be tied to Alert/Trigger Actions, especially if you Log to a FIle.

    This is NOT related to the JobEngine issue.

    Have a thread here on it > Alerting Service Processing Issue (2023.2.1)

Children
No Data