This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

WMI causing Orion Web Console to crash

When a server goes down that is being polled using WMI, SW tries to poll that node on every port until it reaches port exhaustion and causes the orion web console to crash. This never happened on previous versions, but every since upgrading to 2020.2.1 I have had this issue any time a server goes down. Is there a way to make it so that SW stops trying to poll a node using WMI when it is simply unreachable? or is it just better to deploy the agent on to Windows servers rather than using WMI for this reason?

This is the DCOM error that shows up in the event logs when SW is trying to poll a Windows Server that is down:

MasonLan_0-1610637607791.png

Parents
  • The most common cause is bad login is configured on SAM for that node - wrong password.

    You should be able to test WMI from the Device Page.

  • That's what I thought initially as well, however I just tested the credentials and they seem to work fine.

    WMI works fine when the node is up, but the issue only happens when a node goes down.

  • Yeah, I wish SAM would ping, and not try WMI if it fails.

    Also, it would be nice if SAM would return a real, useful error when WMI fails. If a link to a building with many servers goes down, my 1st clue is often that Orion web page hangs because of these errors. I guess they haven't tested it at scale.

  • Yes, I agree. That would be nice.

  • The best I can offer is to unmanage server that will be down longer than for a reboot, and to otherwise stagger reboots.

    Also, the polling interval makes a difference, too. Aggressive polling really brings the problem out.

  • This is very helpful. It looks like we are polling nodes every 60 seconds. According to Configure polling interval settings in the Orion Platform (solarwinds.com) the default is set at 120 seconds. Will make that change and see if that helps. Thank you.

  • How many servers you are polling, and the what type of CPU and memory on your polling server, and even how well built your SQL server is can also make a difference.

    Have a look at this to start:

    https://documentation.solarwinds.com/en/Success_Center/orionplatform/content/core-optimization-polling-engines.htm

    I love short polling intervals, but I've found it's best to put them on the most critical devices, and leave less critical devices at longer intervals.

  • We are polling 970 nodes total split between two polling engines. 566 of those are servers. 321 are being polled using WMI.

    Main Polling Engine: 16 CPU(s), 28 GB Memory

    Additional Polling Engine: 4 CPU(s), 16 GB Memory

    SQL Server: 8 CPU(s), 64 GB Memory

    Both polling engines have a polling completion of 100 and the polling rate is around 50% on both.

    Do you think it would make a difference if I moved all Windows Servers using WMI over to my additional polling engine rather than polling them on my main polling engine where the orion website is hosted?

  • Those specs and stats look pretty good.

    I was thinking that polling from the web server was causing your problem. But do I hear you right, that WMI polling is from a different server than the web console?

    How many servers that you poll are down at any one time? If it's only a few, then this WMI thing could be a red herring. You may need to open a case to look deeper.

  • Sorry I may have said that unclear.

    Currently all servers using WMI polling are being polled from the main polling engine which also hosts the web console. It sounds like moving those nodes over to the additional polling engine might help resolve the issue?

    There are not usually very many Windows Servers down at a time, but every time the web console has crashed, I have seen lots of DCOM errors for a Windows Server using WMI in the event logs just like the one included in my original post. 

  • Well, I wouldn't guarantee that moving WMI polled servers from the server that hosts the web would solve it completely. You can probably stop the web console from crashing, but you may still get browser lag if there are servers polled from the web server that are down.

    My experience is that even a few WMI nodes being down simultaneously, and having multiple object being polled puts a load on the web server. I think web performance got worse with the 2020 releases, but I haven't had a chance to track anything specific down.

    So I think you may be better off opening a case, and getting to the bottom of it all.

Reply
  • Well, I wouldn't guarantee that moving WMI polled servers from the server that hosts the web would solve it completely. You can probably stop the web console from crashing, but you may still get browser lag if there are servers polled from the web server that are down.

    My experience is that even a few WMI nodes being down simultaneously, and having multiple object being polled puts a load on the web server. I think web performance got worse with the 2020 releases, but I haven't had a chance to track anything specific down.

    So I think you may be better off opening a case, and getting to the bottom of it all.

Children