This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Noes not responding to SNMP or WMI

There are times when the clients' device stops polling for whatever reason.  This could be an issue with the device or a change in credentials.  Almost all clients I have been involved with are not aware that the polling has stopped.

There is a simple way of noticing this, which is by looking at the timestamp of the CPU polling.  If it is more than 35 minutes from the current time, the node is having issue.

Here is the report for it:

SELECT n.Caption as Node_Name, n.ip_address as IP_Address, n.ObjectSubType as Poll_Type

,Cast(DateDiff(day,MAX(c.datetime),getdate()) as varchar) + ' Day(s) ' + convert(char(8),dateadd(second,DateDiff(second,MAX(c.datetime),getdate()),0),14) as Duration

,DateDiff(mi,MAX(c.datetime),getdate()) minutes_since

FROM Nodes n

Inner join CPUload c on c.NodeID = n.NodeID

WHERE n.status = 1 and (n.ObjectSubType = 'wmi' or n.ObjectSubType = 'snmp')

GROUP BY n.Caption, n.StatusDescription,  n.ip_address, n.ObjectSubType

Having DateDiff(mi,MAX(c.datetime),getdate()) > 35

ORDER BY minutes_since desc

Reporting is nice, but a better way to notice this is by creating an alert for it - so it can be resolved in a timely manner.  For the alert, you would need to use a custom sql:

SELECT nodes.NodeID, nodes.caption FROM Nodes

Inner join CPUload c on c.NodeID = nodes.NodeID

WHERE nodes.status = 1 and (nodes.ObjectSubType = 'wmi' or nodes.ObjectSubType = 'snmp')

GROUP BY nodes.Caption, nodes.nodeid

Having DateDiff(mi,MAX(c.datetime),getdate()) > 35

Using both the report and alert will make sure you are getting data from all nodes and avoid the embarrassing situation when a server crashes due to high CPU and the boss comments - "I thought that SolarWinds was monitoring this".

Thanks

Amit Shah

Loop1 Systems

  • Indeed, thanks ">cscoengineer‌ for sharing

    Often device would appear green without any indication that polling is not working. I beleive this actually should be somehow addressed by the Orion platform itself, maybe having an additional state like "missing polls" or something like that.

    In addition to tracking last CPU we also track Nodes.LastSystemUpTimePollUtc time. If it falls too much behind - we would flag this as polling issue as well.

    Also, we calculate missing polls rather than minutes and this allows to alert based on the frequency of polls, which will be different in some cases

    SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name, Nodes.NodeID

    FROM Nodes

    ------------------------------------------------------------

    LEFT JOIN

    (SELECT CPULoad_Detail.NodeID, MAX(CPULOad_Detail.DateTime) as LastCPU

       FROM CPULOad_Detail

       GROUP BY CPULoad_Detail.NodeID) c1 on Nodes.NodeID = c1.NodeID

    (

    Nodes.Status = 1 OR

    Nodes.Status = 3

    ) AND

    Nodes.Unmanaged = 0 AND

    Nodes.ObjectSubType <> 'ICMP' AND

    (

    (DATEDIFF(ss, c1.LastCPU, getdate()) / Nodes.PollInterval > 15 ) OR

    (DATEDIFF(ss, Nodes.LastSystemUpTimePollUtc, getUTCdate()) / Nodes.PollInterval > 15)

    )

    there is a whole thread about it here

  • If it's of any additional use for detecting WMI/polling anomalies:

    I've got a cheap method that only works for WMI, but it's helped catch a few times a device was pingable and Orion *was* polling correctly but the server had issues (you need to have SAM installed).  I call it a "canary" alert because you really don't know whats wrong, just that WMI stopped responding.

    Add the "Windows 2008-2012 counters" built-in SAM template to your windows servers (all of the components use WMI, most of these metrics are nice to have anyway).  Set a rule to only alert if the entire "application", not each individual component, is unreachable (grey),  When WMI breaks/get corrupted, or Orion cannot poll it (credentials no longer work), or the server starts having weird issues affecting WMI ( auth. issues and you cannot log in, blue screen, etc..) you get 1 alert because all the components under the application go "down".  Having it rely on all components failing at the same time helps cut down on false alerts.

    Of course, if Orion stops polling, this doesn't detect it emoticons_happy.png.  I could probably get away with disabling all but 2 of the counters just to cut down on the WMI polling but keep the spirit of having multiple WMI checks under the same app, but I like having all the extra data.

  • Hi alexslv

    I did try this query but it doesn't give the expected result.

    I want to know if any of the nodes which are polled via ICMP/SNMP/WMI are not responding....

    I actually have few nodes which are currently not responding to SNMP but this query doesn't identify them. So can you help in this case?

  • Have a look at this article here: Uncovering Polling Problems and Issues in Your Environment 

    It has all reports you need attached. Run them and let us know if you get your problematic node highlighted. If not - we will need to understand a bit more about your particular circumstances - I would be very interested to find out the answer as well and catch yet another "Gremlin" emoticons_happy.png

  • This may possibly be of interest as it's a modification of the "check for node not responding" you originally posted that I'm using to catch when Windows servers go into a "soft fault" condition (WMI not responding, cannot login via Remote Desktop or VMware console...but server pings, application may be working, and usually you can map a drive and walk the filesystem).

    The caveats are:

    1. It's set up as an alert, not a report.

    2. It's only for WMI issues, not SNMP, but I don't see why it couldn't be used for SNMP (I don't have a way to test it)

    3. It handles the edge case of "the forever broken server" where WMI just no longer works or cannot be repaired (or there's no desire to repair it due to an old server, etc.).  For these servers it does not alert as there is no longer any CPU data being collected past the "collection window we care about".

    4. It doesn't alert when you put a server into maintenance (from the above query) or when the server is actually down. This may already be features form the original query above.

    Probably the most interesting thing about it is the DATEDIFF area. I played around with it a lot since it's an alert and not a report. It takes into account if you have modified the default CPU stats collection interval globally or at the Node level (I hope, I did a little testing) as we have some Nodes where we made the default stats polling a longer than the default.  The 3.5 number is the only thing that needs to be tuned.  In my environment I find that setting it to 3.5 "missed polls" works well for alerting even though it seems like 1 or 2 would be optimal/logical.  When I set it that low, or even at 3 I get spurious "false" alerting (maybe Orion pollers get a bit behind, or we miss a few polls for various reasons), but at 3.5 it (so far) appears to only alert when the polling/server problem is *definitely* there.  I'm not sure if this is the best way to handle this part of the query (maybe I'm using wrong fields or something), but searching Thwack it seems that no one else is doing it this way so I figured someone might get a kick out of it or tell me it's not really working like I think it is emoticons_happy.png

    SELECT Nodes.NodeID AS NetObjectID, Nodes.Caption AS Name, Nodes.NodeID

    FROM Nodes

    ------------------------------------------------------------

    LEFT JOIN

    (SELECT CPULoad_Detail.NodeID, MAX(CPULOad_Detail.DateTime) as LastCPU
    FROM CPULOad_Detail
    GROUP BY CPULoad_Detail.NodeID) c1 on Nodes.NodeID = c1.NodeID
    WHERE
    (

    /* I have forgotten why I specifically included all these conditions. I think it's because Orion keeps the "last known" status, and application status, if you have SAM, can roll up (an admin config option?) so you can get a non-up status  of the node and the node is still pingable, but collection via WMI has stopped */
    Nodes.Status = 1 OR /* up: green */
    Nodes.Status = 3 OR /* warning: yellow */
    Nodes.Status = 15 OR /* partly available: yellow */
    Nodes.Status = 8 OR /* lower layer down: red */
    Nodes.Status = 22 OR /* active: green */
    Nodes.Status = 17 OR /* could not poll: grey */
    Nodes.Status = 0 /* unknown: grey */
    )
    AND
    Nodes.Unmanaged = 0 AND
    Nodes.ObjectSubType = 'WMI'
    AND
    (
    (DATEDIFF(ss, c1.LastCPU, Nodes.NextPoll) > (Nodes.StatCollection * 60 * 3.5) ) /* (how far back last CPU data is in sec) > ( "how often CPU data collected in min * in seconds" * how any polls can be missed)*/
    )

  • Thanks Alex :-)

    Let me see if this works for me or not..will update the result...

  • For the Nodes reports, its throwing an error like " Incorrect syntax near ; "

  • Its actually showing error wherever this is used "&gt; 3"

  • Well, most likely reason is that you need to change Database reference in SQL query. There will be few places with database name. Just remove it, leaving table name only ("... FROM SolarWinds.dbo.Nodes ..."). Follow the below steps:

    In Nodes report you should have the following:

    • Line 43: FROM Nodes WITH(NOLOCK)
    • Line 48: FROM CPULoad_Detail WITH(NOLOCK)

    I have just tested it - works fine.