DCOM Errors and Down Node

Question

I have started to monitor for DCOM errors due to a high TCP TimeWait connection count.  The connection count is an entirely different conversation, but working with support in it.  They did mention that a high count of DCOM/WMI errors could be a potential source for the higher TCP connection counts.

To monitor the DCOM errors, I have built out a SAM monitor utilizing the Windows Event Log monitor checking the Systems event for ID 10028.  I then applied the monitor to my various APEs.  I was able to get the count down to 0, and then saw the count jump up to around 200 errors (the monitor captures the past 30 minutes of events).  I saw that all of the errors were from a single Windows node that went offline.  The node itself has been offline for a couple hours and the DCOM issues are still generating.  Its stabilzed at approximately 200 errors in the 30 minute log capture window.

Is it me or does this behavior seem off.  I know interfaces and sam monitors go into an unreachable state when the parent monitor goes down.  Shouldn't SolarWinds stop trying to run WMI queries against the node when the ICMP status is reporting the node as down?  I guess I'm more curious if this is expected behavior or if I have something weird going on in my environment.  I'm seeing between 5 to 10 failed calls a minute for this particular node.  I could understanding that happening while the node transitions from an up to down state, but to continue with the failed attempts for hours seems like a problem.  Also setting the node to maintenance mode would stop the polling, but then that would mask the node being down in the console.

SteveK · Answer

So thats is what is interesting with this.  I haven't ran the numbers yet (if someone has, please jump in) but I know SAM monitors will switch from Up/Down/Critical/etc... to unreachable whenever the target node is down.  With it being unreachable, I'm not sure if the SAM piece is still attempting to execute its various calls.  So if it stops calling, then Windows process/service monitors would not be generating wmi calls and therefor decom errors on the APEs.

On the availability/reachability side, they did kind of address that if the environment has ICMP blocked.  SNMP and Agent based monitoring you can select the source for "Status & Response Time."  I thought that existed for WMI, but not seeing it.  So odd on that front.  I guess in that scenario you could technically build out an External node, and build out the SAM monitors from there.

I'm with you though, it definitly seems like an overall design flaw.  I know CPU, Memory, Volumes and Interfaces fall under the NPM side of things.  So at least getting that fixed wouldn't require jumping departments/teams.  I also get pointing DCOM issues to port exhaustion, but then it seems like the product/solution wouldn't create those problems with bad logic.  In theory, I think the quick work around would be to just add an additional WMI [polling method] check to the logic flow.

Ping (regardless of result move to next piece)/Heartbeat check

WMI precheck (some basic WMI call, maybe get system name or something universal that Windows has never changed).  If success, continue, if fail stop.  (Precheck could be done for SNMP based polling as well)

The various other WMI calls for status checks (Interfaces, Volumes, CPU, Memory, etc...). [Same with other polling methods]

Then keep SAM functioning the same way, attempt every 5 minutes unless node is unreachable.

ralphpost · Answer

I cannot really help with "fixing" this but am replying as your post/questions mirror my recent experiences/thoughts precisely.

i.e. - why, when the Node (Windows) is down do we still see WMI queries "running"?.

I can (kind of) understand thinking here (maybe) - usually the Node availability/reachability is via ICMP (ping) and there may be setups/scenarios where (for some reason) ping fails but the device can still be reached by other methods (like RDP etc). I have seen where some sites block ICMP but allow (whitelisted) WMI access etc.

Having said that - I think it is just a failure in understanding of how things work (generally and in the product) and probably lack of communication in the development of NPM and SAM or teams that handle one part of the Node management and the other - it's a fundamental and obvious issue that you would think would be catered for if reasonably experienced /qualified designers were involved.

There may be reasonable and defendable scenarios where you would want to continue attempting WMI when node down. If so - at a minimum (in my opinion) there should be either a global or object (node) switch to say "when a node is down, stop all WMI related polling (NPM, SAM (Appications) etc.).

Or else - just stop it wholesale when node is down.

Design flaw that needs to be fixed if you ask me.