We have a service that hangs, but appears to still be running. By monitoring a log file, we can detect when the service loses connection to a PBX. What we need to do is to execute several commands (Batch file) that does the following:
REM When a disconnect occurs, the server may be unable to reconnect for 2 or 3 minutes.
REM Set a delay of 3 minutes before restarting the hung services.
Timeout /t 180 /NOBREAK
REM Stop the Windows Service that lost connectivity.
net stop "<ServiceName>"
REM Wait 30 seconds to allow the service to terminate.
Timeout /t 30 /NOBREAK
REM Kill any remaining instances of the service tasks still in memory.
REM This will terminate all running tasks with the name of <ServiceName.exe>
REM The /F flag is for Forcing the tasks to terminate.
TaskKill /IM <ServiceName.exe> /F
REM Wait 10 seconds to allow the tasks to terminate.
Timeout /t 10 /NOBREAK
REM Start the service.
Net Start "<ServiceName>"
What we need is a way to execute the equivalent of these steps from SolarWinds, on the impacted Node. We have set up a monitor that watches for new instances of the lost connection string in the application log file. The next step is sending emails alerts to the app admin team, and automatically restart the hung service.
I believe SAM service restart may be part of this. But I am not sure how to use it if SAM doesn't detect the service has failed. If someone would point me to an article or tutorial for setting this up, I would appreciate it.
Thanks,