How do I monitor hung condition in Windows?

Recently I encounter with an issue with 1 of my application server. 

The server is being monitored in SAM so we're monitoring all it's processes & general server utilizations.

However  2 days back the server hung & nobody realized it until a couple of hours when more and more users complaining the application getting timed out.

When check, we're unable to RDP to the server, however ping works fine. We had to force restart to bring it back to normal.

Throughout this duration no down alert was generated by SAM until we restarted the server.

I'm assuming when the OS is hung, WMI processes all got stuck as well. Is there a way for SAM to alert us if the OS is hung? There should be a way right? Since the server's WMI hung, SAM no longer able to poll successfully then it should be able to send an alert for this scenario right?

  • "Hung" conditions are a super nebulous vague space.  In most cases hung servers respond to pings, depending on the exact problem they may also still respond to some/all WMI queries.  The way I have always approached this topic is that every server exists to provide some service.  I don't mean like it has some service running in windows, i mean what is is the server actually does.  For a DC I would expect to be able to shoot LDAP requests at it and get prompt responses, for a web server I should be able to load a particular website, for a database server I should be able to execute queries against that specific database.  For a server hosting a call center application I should be able to hit some sort of API or similar endpoint and get a proper response back.  Whatever the thing that server exists to do you aren't really monitoring it properly unless you ask it to do that thing as part of your SAM templates.  Hung servers will always fail this test, but what that test is depends on each application and takes a bit more work to find out.

    I also use the Custom Query with the below SWQL in it to identify systems that aren't responding to their polling method for the last 20m, and for really high priority systems I have a version of it that I can turn on as an alert so we know right away.  Alerting on every polling failure can get really noisy so I tend to not do that, especially because it paints a target on my back any time we have a hiccup on one of my polling engines.  Never fun when 800 servers on a given poller try to trigger the same alert at once because I needed to do some maintenance to the poller for a bit and forgot to disable the alert while I worked.

    select   
    n.caption as [Node]  
    ,n.detailsurl as [_linkfor_Node]  
    ,'/Orion/images/StatusIcons/Small-' + n.StatusIcon AS [_IconFor_Node]  
    ,n.ip_address as [IP Address]  
    ,n.detailsurl as [_linkfor_IP Address]  
    ,n.statusdescription as [Status Description]  
    ,n.objectsubtype as [Collection Type]  
    ,e.servername  
    ,n.statcollection as [Interval]  
    ,case when n.objectsubtype !='SNMP' then 'Not Used'  
    when n.community='' then 'Not Used'  
    else n.community  
    end as [SNMPv2 Community]  
    ,case when c.Name is null then 'Not Used'  
    else c.Name   
    end AS [WMI/SNMPv3 Credential]  
    ,tolocal(n.lastsystemuptimepollutc) as [Last Stat Collection]  
    ,tolocal(n.lastsync) as [Last Ping]  
    ,daydiff(lastsystemuptimepollUTC,getutcdate()) as [Days Since Polled]  
    ,'Edit' AS [Edit]  
    , '/Orion/Nodes/NodeProperties.aspx?Nodes=' + ToString(n.NodeID) AS [_LinkFor_Edit]  
    ,'/Orion/images/nodemgmt_art/icons/icon_edit.gif' as [_IconFor_Edit]    
      
    from orion.nodes n  
    left JOIN Orion.NodeSettings ns ON n.NodeID = ns.NodeID and SettingName like '%Credential%' and settingname not like '%palo%'    
    left JOIN Orion.Credential c ON ns.SettingValue = c.ID      
    join Orion.Engines e on e.engineid=n.engineid  
    where status<>'2'  
    and status<>'9'  
    and objectsubtype!='ICMP'  
    and minutediff(lastsystemuptimepollUTC,getutcdate())>20  
      
    Order by Lastsystemuptimepollutc

  • I have seen servers that have blue screened still respond to pings.

    As said, you need to be making interrogations of the server that require it to perform commonly expected actions/responses after it queries a database, queries an application, makes an API call, etc.  SNMP sort of works, but gets dropped when the server is busy.  Now that may be a notable symptom, but queries/responses that are not cached are the way to check viability.