Solarwinds Agent Issue RCA & Troubleshooting

Dear Thwack Forum member,

 

i seek you guidance to handle Solarwinds agent related issues , from detecting the cause of issues to troubleshooting the issue.

Our current environment monitor 10,000 servers using SAM -agent based approach, randomly we used to have 100 agent issues / day on average & below is our troubleshooting guidelines.

1. Restart the Solarwinds Service (via automation tools).

2. (If step 1 doesn't help) Re-initialize agents (manually).

While above troubleshooting mostly solve the issues, Since daily we have some random agent issues we would like to understand RCA procedure of such agent issues. Below are the areas we seek your help

- Solarwinds Agent log based analysis - Which log should we ideally look at Agent side ??

- Any way to Automate/ Scripted way to  Re-initialize agents in both windows & linux ??

     

 

  • Woah. 10,000 servers.

    So I have a few questions...

    1. How is your SolarWinds Environment structured?
    2. Do you have any additional polling engines? If so how many?
    3. Of those 10k servers how are they distributed amongst your polling engines?
    4. Are you primarily using active or passive agent communication?

    I ask because the guidelines recommend no more than approximately 1000 monitored agents per polling engine (using default polling rates). If you're near or over that per polling engine you may be taxing the capacity of the engines.

    In theory if using default polling there would need to be 10 polling engines (at full capacity).

    https://documentation.solarwinds.com/en/success_center/orionplatform/content/orion_platform_scalability_engine_guidelines.htm

    Depending on the communication method (active vs passive) used you could be seeing port exhaustion perhaps.

    https://support.solarwinds.com/SuccessCenter/s/article/Ephemeral-Port-Exhaustion?language=en_US 

    https://documentation.solarwinds.com/en/success_center/orionplatform/content/core-agent-requirements-sw476.htm

  • Thanks for Response, I sincerely appreciate your precious time in providing related information.

    We have 15 polling engines & we equally split the load among APE , 600 servers/agent per APE is our considerations,

    .All agent are  Agent-initiated communication.

    Looking some instruction are managing agent health

  • Can you please share how you are using automation tools to do the SolarWinds service agent restart? What tools and how are you kicking it off. Are you using an alert to kick off a program?

  • Hi thanks for engaging in this Topic.

    Yes Agent issues will be identified over Alert based on 2 different criteria 

    1. Agent connect status in the 'Manage agents' view

    2. If no data collected for node in last 10-20 mins based on 'LastSystemUpTimePollUt' attribute using SWQL query.

    Post that We have Ansible open source tools to perform Agent Re-Start,

    For windows - Restarting Solarwinds agent windows service

    For Linux - Systemctl restart swiagent

    Looking for guidance if above steps doesnt resolve & to understand what caused the issue

  • Thank you for providing details on your process. I would like to see this as an option in the product itself as we also see events daily where the agent goes unresponsive. As your using open source tools I'll open a feature request to see if they can add this functionality to the suite. I'm eagerly following along as we experience the same issues (though with a much smaller agent install base)

  • Have you opened a support case on this?

    Pick maybe a small handful of servers assigned to each polling engine and set there agent logging level to debug. This may be difficult if its a sporadic issue and not always the same agents. Wait until the issue occurs on them and then take diagnostics on one or two agents and all of Orion.

    Share that with SolarWinds Support.

    documentation.solarwinds.com/.../core-editing-agent-configuration-sw440.htm

  • Also may be good to include your networking team to see if there's any related events occuring and/or at least rule out the basic (if not done already). Ping response time etc... basic network triage

    Maybe even setup a TCP Port check on 17778 for their assigned polling engine?

  • Hi @dhinagar_j

    I did this with an alert. I started with an advanced SWQL alert condition that looks for online agents that haven't updated CPU load stats in 180 minutes. That time can be adjusted though.



    I then added a trigger action to call a PowerShell script which restarts the agent via a SWIS API call. You have to pass the AgentID to the script as an argument and handle authentication, but the main PowerShell command to do this is:

    Invoke-SwisVerb -SwisConnection $Swis -EntityName Orion.AgentManagement.Agent -Verb RestartAgent -Arguments @($AgentID)

    I also created an escalation trigger action to move the agent to another polling engine if the action above doesn't solve the issue. As a part o the script it calls, I pull polling engine stats to determine the polling engine with the least amount of load, and then move the agent there with the command below.

    Invoke-SwisVerb -SwisConnection $Swis -EntityName Orion.AgentManagement.Agent -Verb AssignToEngine -Arguments @($AgentID, $($TargetPollingEngine.EngineID))

    I had a support case open for six months or so where we were troubleshooting why the agents would stop communicating even though they showed as connected. Below is a fix that resolved 95% of my agent issues.

    # Edit following file "c:\Program Files (x86)\Common Files\SolarWinds\JobEngine.v2\SWJobEngineSvc2.exe.config"
    # Seek for option jobEngineSettings_Full and add there option emptyRouterCleanupInterval="-00:00:00.0010000" so it looks like this

    <jobEngineSettings_Full maxWorkerProcess="10"
    jobExecutionRetention="01:00:00"
    jobExecutionEnabled="false"
    thresholdStreamByteThreshold="131072"
    maxWorkerProcessLogFileAge="5.00:00:00"
    maxHungJobs="1"
    jobHungTimeout="00:00:30"
    maxJobThreadsPerWorkerProcess="100"
    maxJobThreadsPerWorkerProcess64="1000"
    workerMonitoringPeriod="00:00:01.0000000"
    forceSeparateWorkersOnAgent="true"
    jobErrorSuppressionInterval="01:00:00"
    jobErrorSuppressionCleanupInterval="00:05:00"
    emptyRouterCleanupInterval="-00:00:00.0010000"
    />

    3. Restart service JobEngine v2

    This has to be done on all engines.

  • Do you have the case number for that support case open for six months? I'd like to share it with my support person to see if it would assist in my environment. 

  • Yeah, the case# was 00869953. Now that I look back at it, it was actually open for a little over 9 months!