Polling Engine down, actions and discovery not working

Question

Hi Thwack,

First time poster but long time lurker.  We have an environment with a primary polling agent and two additional agents.  On the polling engine details page, the primary polling engine is not syncing.  Interestingly enough, any nodes that are either added to or already exist on the primary polling agent are not able to complete discovery.  Once the web console goes to "List Resources", it will never complete.  When I look at the Discovery logs, I find the discovery kick off and complete yet it is never updated on the web console.

Also, no actions are being fired from any alerts.  An example of the error that we are seeing in the logs are below:

2018-05-01 14:31:32,827 [ActionsExecutionProcessingThread] ERROR SolarWinds.Orion.Core.Alerting.Service.ActionsResolverInternal.PendingExecutionActions - Action ID: 45, ActionType: WriteToNPMEventLog, Title: NetPerfMon Event Log : NetPerMon Event Log: Group ${N=SwisEntity;M=Name} is ${N=SwisEntity;M=Status;F=Status}, Description: Log the Alert in the Network Performance Monitor Event Log, Enabled: True, Order: 1   failed. alertActiveId: 4239427 alertObjectId: 1373320. Error: ProvideFault failed, check fault information.

2018-05-01 14:31:32,858 [ActionsExecutionProcessingThread] ERROR SolarWinds.Orion.Core.Alerting.Service.ActionsResolverInternal.PendingExecutionActions - System.ServiceModel.FaultException`1[SolarWinds.Orion.Core.Common.CoreFaultContract]: ProvideFault failed, check fault information. (Fault Detail is equal to SolarWinds.Orion.Core.Common.CoreFaultContract(Unknown): System.Collections.Generic.KeyNotFoundException: Action WriteToNPMEventLog doesn't exist

at SolarWinds.Orion.Core.Actions.Runners.ActionRunner.Execute(ActionDefinition actionDefinition, ActionContextBase context)

at SolarWinds.Orion.Core.BusinessLayer.CoreBusinessLayerService.ExecuteAction(ActionDefinition actionDefinition, ActionContextBase context)

at SyncInvokeExecuteAction(Object , Object[] , Object[] )

at System.ServiceModel.Dispatcher.SyncMethodInvoker.Invoke(Object instance, Object[] inputs, Object[]& outputs)

at System.ServiceModel.Dispatcher.DispatchOperationRuntime.InvokeBegin(MessageRpc& rpc)

at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage5(MessageRpc& rpc)

at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage41(MessageRpc& rpc)

at System.ServiceModel.Dispatcher.ImmutableDispatchRuntime.ProcessMessage4(MessageRpc& rpc)

at System.ServiceModel.Dispatcher.Immuta...).

Our network folks complained that their daily scheduled jobs were not running either, so we when attempted to manually run them we get "An unexpected error has occurred."  When I dig through the logs, I find an error very similar to the above, except it is "Action: Save to Disk" (the job is supposed to save the results to disk).  Another symptom is the daily database maintenance is not automatically starting.  There are no log entries indicating it started and failed; just nothing.  I have to manually start the database maintenance wizard every morning to keep the database size under control.

Support has had me rebuild the Core (twice), reinstall the Job engine, and had me run some queries to clear subscriptions and recreate them.  I am losing confidence in the system that SolarWinds has for Technical Support, considering we are nearly 2 months into this case with all of the above described features being inoperable.  I would had hoped that at some point I would be interfacing directly with someone from the "advanced team" or "engineering" that the technicians keep referencing so we could establish some sort of continuity of knowledge on this particular case.

Thwack community ever had any of these issues and have any suggestions as to what the root cause could be?  We are on NPM 12.1 (Windows 2008 R2 so unable to upgrade to NPM 12.2).

jrox904 · Answer

I checked HIPS logs and I don't see any reactions referencing the SW processes.  Furthermore, the other two additional polling agents are working just fine (as far as discovery goes).  For critical services / servers, I have been moving them off of our primary polling engine onto the APEs so I can, at a minimum, alert on disk usage and other statistics (although I have to check the alerts pane or keep it open, and then email the responsible owners of systems manually).

All of the servers are in the same OU in AD, so I have ruled out GPO interference.  I queried the database for the ActionIDs that were referenced in the logs and I was able to find them in the dbo.Actions table, so I know that they exist (assuming this is the table that they are referencing).

d09h · Answer

Have any host intrusion prevention interference or group policy object interference?