This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

When Up/Down by IP isn't enough...

Hey Everyone! 

I am having an issue with monitoring our Windows Servers.  In particular, last night our AV was patched and servers were restarted but upon reboot some failed to reboot all the way, Servers were pingable so they didn't throw out any DOWN alerts, but were not functional.

Any ideas on what services or process I could monitor on approximately 2500 servers for potential failed patching that causes the server to report as UP, but the server is non-responsive that won't tax SolarWinds and my polling engines?

Any advice is appreciated!

Thanks!

Scott

PS - Running SAM 6.7.0

  • Ah, yes, the classic case of the spinning Windows 2012/2016 post-reboot, pre-login screen.

    Unfortunately, this one plagues a lot of us. Mostly, this happens with virtual machines in my opinion.

    There is no effective way to detect this condition if all services are started.

    If you're patching 2500 servers in one shot, that is epic. emoticons_happy.png


    What tool do you use? SCCM? WSUS?

    When I used to patch 700 servers, I'd check status, and also do ANOTHER graceful reboot using a great tool to reboot server servers in batches. This of course does not always work to clear the spinning pre-login screen.

    Additionally, DBAs and App folks were forced to do post-patch wellness checks on their servers. This would expose the hung Windows post-reboot pre-login screen.

    In short, patching 2500 servers is not going to be perfect by any means, I'm not sure what services can be checked if they're all "Up".

    Also, you have a duplicate post, in the forum, you may want to delete your other post with the same topic subject.

    emoticons_happy.png

  • Depends on where exactly the server got blocked and what pooling method you are using.

    1.If the server block after the reboot:

    Your best bet would be using agent pooling with status agent instead of ping. In this way if the Solarwinds agent dose not start you will get an alert for server down.

    2. Servers block before reboot:

    You could use the build in server inventory and check for last boot time. Is not real time but if you do a force inventory on all servers you should have all the dates pooled in around 2 hours ( all depends on how many APE you have and how powerful they are)

    3. If they block before and after just do a combination of the above 2.

    This is not 100% sure way but you mentioned you don't want to be taxing on your servers.

  • hpstech​ @bogdan.stan@xpo.com

    I am not the one doing the patches, and we use SCCM, and we don't patch all at the same time, and 95% of those servers are virtual.  This was the security team patching the AV. 

    I don't believe all of the services loaded due to other departments not being able to remote into their boxes.  I'll have to look at services that load when Windows finally load and stay up, and possibly monitor that service.

    We don't want to use Windows Agents in our environment if we don't have to.

  • Usually when this issue happens you can RDP but you will see the pre-login screen spinning, just as you would see in VCenter console.

    How is your team "remoting" in - RDP?

    Again, this issue is very common with Windows, and extremely difficult to detect with monitoring.

    I agree about not desiring agents and it would most likely not be effective for this issue.

    I've fought this same issue for years, I'm surprised not many others have chimed in to agree.

  • not functional as in u cannot RDP or something else?

    you can try configuring some event IDs which may help you or your Windows team to troubleshoot if this is happening every time...