Solarwinds Agent Issue RCA & Troubleshooting

Dear Thwack Forum member,

 

i seek you guidance to handle Solarwinds agent related issues , from detecting the cause of issues to troubleshooting the issue.

Our current environment monitor 10,000 servers using SAM -agent based approach, randomly we used to have 100 agent issues / day on average & below is our troubleshooting guidelines.

1. Restart the Solarwinds Service (via automation tools).

2. (If step 1 doesn't help) Re-initialize agents (manually).

While above troubleshooting mostly solve the issues, Since daily we have some random agent issues we would like to understand RCA procedure of such agent issues. Below are the areas we seek your help

- Solarwinds Agent log based analysis - Which log should we ideally look at Agent side ??

- Any way to Automate/ Scripted way to  Re-initialize agents in both windows & linux ??

     

 

Parents
  • Can you please share how you are using automation tools to do the SolarWinds service agent restart? What tools and how are you kicking it off. Are you using an alert to kick off a program?

  • Hi thanks for engaging in this Topic.

    Yes Agent issues will be identified over Alert based on 2 different criteria 

    1. Agent connect status in the 'Manage agents' view

    2. If no data collected for node in last 10-20 mins based on 'LastSystemUpTimePollUt' attribute using SWQL query.

    Post that We have Ansible open source tools to perform Agent Re-Start,

    For windows - Restarting Solarwinds agent windows service

    For Linux - Systemctl restart swiagent

    Looking for guidance if above steps doesnt resolve & to understand what caused the issue

  • Thank you for providing details on your process. I would like to see this as an option in the product itself as we also see events daily where the agent goes unresponsive. As your using open source tools I'll open a feature request to see if they can add this functionality to the suite. I'm eagerly following along as we experience the same issues (though with a much smaller agent install base)

Reply
  • Thank you for providing details on your process. I would like to see this as an option in the product itself as we also see events daily where the agent goes unresponsive. As your using open source tools I'll open a feature request to see if they can add this functionality to the suite. I'm eagerly following along as we experience the same issues (though with a much smaller agent install base)

Children
No Data