cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post
Level 9

Possible to automate Restarting a SW service when polling fails?

Jump to solution

Hello, 

We have this recurring issue where our polling engine fails randomly. Multiple support tickets, including one open now, and we have been unable to find a root cause. So for now, we get fails multiple times a week.

Right now, I have the alert set that will email me when a polling engine fails (hundreds of emails - I am OK with that). When I get the email, I will jump online and restart the Job Engine and Collector service.

Is there a way to automate this in SW? My goal is to narrow it down to one email sent (I know how to accomplish this), automatically restart the services (no idea), notify me in an email (I know how to do this). 

So, is there any way to get this done? These are two SW services: Job Engine and Collector. This would save a lot of time and missed statistics if I could get this accomplished. 

Thanks in advance!!!!

Future goal is to have them automatically move to a separate polling engine if this process fails but that is a pipe dream. 

0 Kudos
1 Solution
Everything you describe is actually very doable.

For the alert action I would have it launch a powershell script that restarts the services, then maybe set up a 5 min escalation where if the situation has not improved then it emails you if the problem has not improved.

I also have created several variations on load balancing scripts for clients in the past, it centers around figuring out which engine you want to move from and to and then just updating the engineid on Orion.Nodes in the api.

I'm on my phone so I can't pull up any examples but hopefully that gives you enough to get started.
https://thwack.solarwinds.com/t5/SAM-Discussions/Powershell-script-off-a-triggered-alert-not-working...
https://github.com/solarwinds/OrionSDK
- Marc Netterfield, Github

View solution in original post

5 Replies
Everything you describe is actually very doable.

For the alert action I would have it launch a powershell script that restarts the services, then maybe set up a 5 min escalation where if the situation has not improved then it emails you if the problem has not improved.

I also have created several variations on load balancing scripts for clients in the past, it centers around figuring out which engine you want to move from and to and then just updating the engineid on Orion.Nodes in the api.

I'm on my phone so I can't pull up any examples but hopefully that gives you enough to get started.
https://thwack.solarwinds.com/t5/SAM-Discussions/Powershell-script-off-a-triggered-alert-not-working...
https://github.com/solarwinds/OrionSDK
- Marc Netterfield, Github

View solution in original post

Thanks! That was an excellent, excellent start. We were able to put together a rudimentary script but have not tried it yet. Are you able to put more examples? And of the load balancing scripts? This is exactly what we need! Your help is very much appreciated!
0 Kudos

Unfortunately my script is heavily tailored to the conditions of my environment so it's not a great example, but the basic logic is first we group up our APE's based on which ones can be treated as a pool (for example DMZ ape's should not be interchanged with regular ape's and we have some pollers in dedicated networks for specific reasons) Most people might not even have to do this if all your pollers can reach all your nodes

Then determine which engine in the pool has the most load and which has the least based on their load on the orion.engines table (if you want to get really fancy you may want to factor in sam polling load, but I rarely see people doing that)
Once you have your highest and lowest loaded systems you just start moving nodes over by just using the set-swisobject command to change them to the least loaded engine.

Lots of approaches to this could work, but its important to note that there is about a minute or so lag time between moving a node to new poller before the polling load actually reflects any changes. You could try to estimate the desired change in load ahead of time and move X number of nodes at once, but i'm in no rush so I just wrote my script to pick the node with the highest number of elements on the most loaded poller and move it to the least loaded poller, wait 30 sec, recheck the poller loads and repeat the process again and again until all pollers in the group are within x% of each other, usually I do about 20% as my acceptable spread. Technically due to the lag time i tend to over run and move a few more nodes than I absolutely needed to, but that's not really a problem if the pollers end up 19% unbalanced instead of 20%

In the past I reworked a lot of that same logic to write a script that moves all the nodes off a poller and redistributes them across the remaining pool members and then set it up as a trigger action for an alert. So if one of my pollers goes 5 min without writing to the DB I am going to assume it died. Script can unload the poller immediately to try to keep nodes from missing data, then kicks of a restart of services or a full reboot of the APE and then if it comes back to life and is happy then we can rebalance the pollers again. I should probably also mention that I only include APE's for any of this stuff since I have 9 of them right now and I don't do very much polling from the main server (the scripts also know to drain the primary server in case anyone does forget and nodes get assigned to it.

- Marc Netterfield, Github

Thanks so much! I am not sure how I didn't see that you replied. I am on the search to see if I can find a step by step somewhere, but I really appreciated you helping me. 

0 Kudos

Thank you! I will see if our PowerShell guru can whip me up something fancy.

 

 

0 Kudos