This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

SolarWinds self healing (for Primary, APEs, NTA Flow Storage servers)

Our organization is encouraging us to implement self healing as it pertains to the SolarWinds environment itself (as opposed to self healing of the monitored infrastructure). The Windows services themselves should all be set to "Restart the Service" for Recovery. While we don't experience a lot of SolarWinds server/service issues, there are the occasional ones. Sometimes they just manifest as a problem with a specific module (i.e. NCM) or the web console becomes unresponsive (or just very slow/sluggish). Sometimes we see a polling engine (primary or an APE) has not updated the database in a while, and we do leverage the builtin alert for this, but the trigger action currently is just to send us an email. Typically this type of issue, if prolonged, requires restarting Orion services. I doubt that we would want to create a trigger action for this alert that would automatically just restart all Orion services...but maybe use complex conditions to check something else before determining that it's worthy of doing so (without a human making that decision)?

We do not have the SAM module, and likely won't have this for at least the near term. So any solutions need to leverage existing modules, likely just NPM. For reference, we have: NPM, NCM, VNQM, NTA, and UDT.

Anyway, we really do need to look into this deeper and come up with a self healing strategy, even if it ends up being rather simplistic. We're hoping to get some ideas/opinions from the community and spur some discussion so we can further explore this.

Thanks in advance!!

  • You can use Syslogs and Traps to execute a program or external script via the Syslog Viewer or Trap Viewer and setting up rules for alerts/actions.

    pastedImage_0.png

    -CharlesH

    Loop1 Systems: SolarWinds Training and Professional Services

  • I have a feeling that I didn't clarify this very well. We are looking to put self-healing measures in place that will automatically resolve issues with the Orion platform and modules. A good example that occurred today and we have seen occasionally, is when NCM transfers start failing and we see on the Transfer Status screen the error "Unable to connect to polling engine...". We're not aware of a way to monitor this condition, as all other functionality (i.e. SNMP polling) appears to be OK. Once we become aware of the issue, we just restart Orion services to resolve. So monitoring and alerting ourselves of this condition would be great, but taking it one step further, how could we put a self-healing measure in place that would auto restart the Orion services?

    Config Transfer - unable to connect to polling engine.jpg

  • If you have a working alert you can use that alert to execute a script and restart the services you need to.

    Rather than restarting all services, I would open a support case or reference your logs to determine what is causing the failure and figure out what specific service is being affected.

    Reference these links for help setting this up : Re: Restart a Service

    Which has links to SW Doc : Monitor, alert and restart a Windows service from SAM - SolarWinds Worldwide, LLC. Help and Support

    & another link at the bottom that will eventually get you do : PSEXEC to start remote process  - for help with using Powershell to make this happen.

    It is not quite 'out of the box' but your service restart can be achieved.

  • That's a good call-out, that we need to understand what is causing the failure. We've had this symptom analyzed before with previous support cases, but there was never a smoking gun per say. Having said that, I did just yesterday open a new support case specifically for this issue. And today I discovered that the nightly running config backup job was stuck, which I'm fairly confident is related. We get an email with the log results of the job, and of course we didn't get that email. But again, how to monitor this condition is what needs to be known (so we can alert on it somehow).

    As mentioned, we do not have the SAM module and likely won't for the foreseeable future. But as long as we can monitor and alert on the condition, we could launch Psexec to restart processes.

    Hopefully we'll get SolarWinds support to help us really identify the underlying cause.

  • Understandable, I have had some experiences with the software that measure the same. If you can recreate the event in any way you will have fresh logs to reference that might help point in a certain direction. Though if your NCM backup is not working in full, recreating the SDF's or Jobs might be the first step. A Configuration Wizard run should also replace those SDF's but it does not recreate a failing job. (Duplicate the backup job, then disable/delete the old one).

    Though it sounds like your issue may run a little deeper, doing some of those 'fixes' may just bury your issue a little/prolong the next recurrence.

    If you can reference any Server Events, along with whatever other logs the system itself generates to possibly pinpoint what you do have going on. Aside from this train, make sure your SQL Server is not constrained, make sure you have no locks or waits on the db stopping the process.