We have this recurring issue where our polling engine fails randomly. Multiple support tickets, including one open now, and we have been unable to find a root cause. So for now, we get fails multiple times a week.
Right now, I have the alert set that will email me when a polling engine fails (hundreds of emails - I am OK with that). When I get the email, I will jump online and restart the Job Engine and Collector service.
Is there a way to automate this in SW? My goal is to narrow it down to one email sent (I know how to accomplish this), automatically restart the services (no idea), notify me in an email (I know how to do this).
So, is there any way to get this done? These are two SW services: Job Engine and Collector. This would save a lot of time and missed statistics if I could get this accomplished.
Thanks in advance!!!!
Future goal is to have them automatically move to a separate polling engine if this process fails but that is a pipe dream.
Solved! Go to Solution.
Unfortunately my script is heavily tailored to the conditions of my environment so it's not a great example, but the basic logic is first we group up our APE's based on which ones can be treated as a pool (for example DMZ ape's should not be interchanged with regular ape's and we have some pollers in dedicated networks for specific reasons) Most people might not even have to do this if all your pollers can reach all your nodes
Then determine which engine in the pool has the most load and which has the least based on their load on the orion.engines table (if you want to get really fancy you may want to factor in sam polling load, but I rarely see people doing that)
Once you have your highest and lowest loaded systems you just start moving nodes over by just using the set-swisobject command to change them to the least loaded engine.
Lots of approaches to this could work, but its important to note that there is about a minute or so lag time between moving a node to new poller before the polling load actually reflects any changes. You could try to estimate the desired change in load ahead of time and move X number of nodes at once, but i'm in no rush so I just wrote my script to pick the node with the highest number of elements on the most loaded poller and move it to the least loaded poller, wait 30 sec, recheck the poller loads and repeat the process again and again until all pollers in the group are within x% of each other, usually I do about 20% as my acceptable spread. Technically due to the lag time i tend to over run and move a few more nodes than I absolutely needed to, but that's not really a problem if the pollers end up 19% unbalanced instead of 20%
In the past I reworked a lot of that same logic to write a script that moves all the nodes off a poller and redistributes them across the remaining pool members and then set it up as a trigger action for an alert. So if one of my pollers goes 5 min without writing to the DB I am going to assume it died. Script can unload the poller immediately to try to keep nodes from missing data, then kicks of a restart of services or a full reboot of the APE and then if it comes back to life and is happy then we can rebalance the pollers again. I should probably also mention that I only include APE's for any of this stuff since I have 9 of them right now and I don't do very much polling from the main server (the scripts also know to drain the primary server in case anyone does forget and nodes get assigned to it.
Thanks so much! I am not sure how I didn't see that you replied. I am on the search to see if I can find a step by step somewhere, but I really appreciated you helping me.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.