Solarwinds Agent Issue RCA & Troubleshooting

Dear Thwack Forum member,

 

i seek you guidance to handle Solarwinds agent related issues , from detecting the cause of issues to troubleshooting the issue.

Our current environment monitor 10,000 servers using SAM -agent based approach, randomly we used to have 100 agent issues / day on average & below is our troubleshooting guidelines.

1. Restart the Solarwinds Service (via automation tools).

2. (If step 1 doesn't help) Re-initialize agents (manually).

While above troubleshooting mostly solve the issues, Since daily we have some random agent issues we would like to understand RCA procedure of such agent issues. Below are the areas we seek your help

- Solarwinds Agent log based analysis - Which log should we ideally look at Agent side ??

- Any way to Automate/ Scripted way to  Re-initialize agents in both windows & linux ??

     

 

Parents
  • Hi @dhinagar_j

    I did this with an alert. I started with an advanced SWQL alert condition that looks for online agents that haven't updated CPU load stats in 180 minutes. That time can be adjusted though.



    I then added a trigger action to call a PowerShell script which restarts the agent via a SWIS API call. You have to pass the AgentID to the script as an argument and handle authentication, but the main PowerShell command to do this is:

    Invoke-SwisVerb -SwisConnection $Swis -EntityName Orion.AgentManagement.Agent -Verb RestartAgent -Arguments @($AgentID)

    I also created an escalation trigger action to move the agent to another polling engine if the action above doesn't solve the issue. As a part o the script it calls, I pull polling engine stats to determine the polling engine with the least amount of load, and then move the agent there with the command below.

    Invoke-SwisVerb -SwisConnection $Swis -EntityName Orion.AgentManagement.Agent -Verb AssignToEngine -Arguments @($AgentID, $($TargetPollingEngine.EngineID))

    I had a support case open for six months or so where we were troubleshooting why the agents would stop communicating even though they showed as connected. Below is a fix that resolved 95% of my agent issues.

    # Edit following file "c:\Program Files (x86)\Common Files\SolarWinds\JobEngine.v2\SWJobEngineSvc2.exe.config"
    # Seek for option jobEngineSettings_Full and add there option emptyRouterCleanupInterval="-00:00:00.0010000" so it looks like this

    <jobEngineSettings_Full maxWorkerProcess="10"
    jobExecutionRetention="01:00:00"
    jobExecutionEnabled="false"
    thresholdStreamByteThreshold="131072"
    maxWorkerProcessLogFileAge="5.00:00:00"
    maxHungJobs="1"
    jobHungTimeout="00:00:30"
    maxJobThreadsPerWorkerProcess="100"
    maxJobThreadsPerWorkerProcess64="1000"
    workerMonitoringPeriod="00:00:01.0000000"
    forceSeparateWorkersOnAgent="true"
    jobErrorSuppressionInterval="01:00:00"
    jobErrorSuppressionCleanupInterval="00:05:00"
    emptyRouterCleanupInterval="-00:00:00.0010000"
    />

    3. Restart service JobEngine v2

    This has to be done on all engines.

Reply
  • Hi @dhinagar_j

    I did this with an alert. I started with an advanced SWQL alert condition that looks for online agents that haven't updated CPU load stats in 180 minutes. That time can be adjusted though.



    I then added a trigger action to call a PowerShell script which restarts the agent via a SWIS API call. You have to pass the AgentID to the script as an argument and handle authentication, but the main PowerShell command to do this is:

    Invoke-SwisVerb -SwisConnection $Swis -EntityName Orion.AgentManagement.Agent -Verb RestartAgent -Arguments @($AgentID)

    I also created an escalation trigger action to move the agent to another polling engine if the action above doesn't solve the issue. As a part o the script it calls, I pull polling engine stats to determine the polling engine with the least amount of load, and then move the agent there with the command below.

    Invoke-SwisVerb -SwisConnection $Swis -EntityName Orion.AgentManagement.Agent -Verb AssignToEngine -Arguments @($AgentID, $($TargetPollingEngine.EngineID))

    I had a support case open for six months or so where we were troubleshooting why the agents would stop communicating even though they showed as connected. Below is a fix that resolved 95% of my agent issues.

    # Edit following file "c:\Program Files (x86)\Common Files\SolarWinds\JobEngine.v2\SWJobEngineSvc2.exe.config"
    # Seek for option jobEngineSettings_Full and add there option emptyRouterCleanupInterval="-00:00:00.0010000" so it looks like this

    <jobEngineSettings_Full maxWorkerProcess="10"
    jobExecutionRetention="01:00:00"
    jobExecutionEnabled="false"
    thresholdStreamByteThreshold="131072"
    maxWorkerProcessLogFileAge="5.00:00:00"
    maxHungJobs="1"
    jobHungTimeout="00:00:30"
    maxJobThreadsPerWorkerProcess="100"
    maxJobThreadsPerWorkerProcess64="1000"
    workerMonitoringPeriod="00:00:01.0000000"
    forceSeparateWorkersOnAgent="true"
    jobErrorSuppressionInterval="01:00:00"
    jobErrorSuppressionCleanupInterval="00:05:00"
    emptyRouterCleanupInterval="-00:00:00.0010000"
    />

    3. Restart service JobEngine v2

    This has to be done on all engines.

Children