I have started to monitor for DCOM errors due to a high TCP TimeWait connection count. The connection count is an entirely different conversation, but working with support in it. They did mention that a high count of DCOM/WMI errors could be a potential source for the higher TCP connection counts.
To monitor the DCOM errors, I have built out a SAM monitor utilizing the Windows Event Log monitor checking the Systems event for ID 10028. I then applied the monitor to my various APEs. I was able to get the count down to 0, and then saw the count jump up to around 200 errors (the monitor captures the past 30 minutes of events). I saw that all of the errors were from a single Windows node that went offline. The node itself has been offline for a couple hours and the DCOM issues are still generating. Its stabilzed at approximately 200 errors in the 30 minute log capture window.
Is it me or does this behavior seem off. I know interfaces and sam monitors go into an unreachable state when the parent monitor goes down. Shouldn't SolarWinds stop trying to run WMI queries against the node when the ICMP status is reporting the node as down? I guess I'm more curious if this is expected behavior or if I have something weird going on in my environment. I'm seeing between 5 to 10 failed calls a minute for this particular node. I could understanding that happening while the node transitions from an up to down state, but to continue with the failed attempts for hours seems like a problem. Also setting the node to maintenance mode would stop the polling, but then that would mask the node being down in the console.