Our organization is encouraging us to implement self healing as it pertains to the SolarWinds environment itself (as opposed to self healing of the monitored infrastructure). The Windows services themselves should all be set to "Restart the Service" for Recovery. While we don't experience a lot of SolarWinds server/service issues, there are the occasional ones. Sometimes they just manifest as a problem with a specific module (i.e. NCM) or the web console becomes unresponsive (or just very slow/sluggish). Sometimes we see a polling engine (primary or an APE) has not updated the database in a while, and we do leverage the builtin alert for this, but the trigger action currently is just to send us an email. Typically this type of issue, if prolonged, requires restarting Orion services. I doubt that we would want to create a trigger action for this alert that would automatically just restart all Orion services...but maybe use complex conditions to check something else before determining that it's worthy of doing so (without a human making that decision)?
We do not have the SAM module, and likely won't have this for at least the near term. So any solutions need to leverage existing modules, likely just NPM. For reference, we have: NPM, NCM, VNQM, NTA, and UDT.
Anyway, we really do need to look into this deeper and come up with a self healing strategy, even if it ends up being rather simplistic. We're hoping to get some ideas/opinions from the community and spur some discussion so we can further explore this.
Thanks in advance!!