*** This is by no means a scientific article, meaning I didn't do a massive amount of testing or research. But it is still useful anyways. I hope this article is helpful to some of you!
** Also, big thanks to Leon Adato for sharing his research on SNMP vs. WMI monitoring! Very useful document.
In my environment we have 3 poller servers total (both run NPM and SAM polling). Recently my team had to engage SolarWinds support due to problems with the pollers where the MSMQ folder storage would fill up with orphan data and cause a spam storm (the alert engine would get stuck in a loop and trigger repeat emails over and over....). We resolved the issue but a week later it happened again, then again a week after that. This is when we found the SolarWinds Orion Server Monitors SAM template on Thwack so we applied it, which gave us a lot of interesting information on the health of our Orion environment. Thanks to this monitor we have continued to stay on top of performance issues with Orion. But issues kept on coming up such as port exhaustion, MSMQ folder kept on filling up with orphan data, etc. What was going on? After some research and looking at the issue over and over, I remembered my process for how I consistently configure SAM template settings....
Last week on Thursday, during my lunch break, I decided to manually verify the SAM template configurations in most of the application monitors that use WMI polling. In doing so I discovered that just about ALL of our SAM monitors that I personally set up had their polling intervals set to 60 seconds instead of the default 300 seconds. Why did I set most of the monitors to 60 seconds instead of the default? Well the answer to this is quite obvious – to detect for problems as close to real-time as possible. However in doing this I unknowingly placed a tax on each poller server that severely impacted the performance and reliability of the system as a whole. Not only that, most of the application monitors we have in place did NOT need to be monitored as frequently. For example, one of our less-utilized applications can remain at the 300secs interval but payment processing (i.e. credit card services) needs to be monitored as close to real-time as possible.
Why does this matter? Well according to the "SNMP vs. WMI" document from adatole changing from SNMP to WMI monitoring in Network Performance Monitor should not negatively impact polling server performance in a large way and any change in NPM's performance is minuscule. And in my experience, yes I fully agree with Leon's assessment and research. However the document only focuses on NPM monitoring using WMI and does not have a focus on WMI performance with SAM templates. And I can't blame Leon for not doing this because SAM templates can get very complicated quickly
So after researching the problem I was having with SAM, and after looking at the differences in how NPM and SAM work, one of the things I found was that it is critical to keep the default polling interval of the application template at 300secs or greater. Reducing the WMI application template polling interval from 300secs to 120secs dramatically utilized more compute AND network resources on the SAM poller --- which fully explains the issues we were experiencing!
(Unfortunately I don't have any hard evidence for this because I didn't take any screenshots. This is a guestimate based on my own experience with our custom SAM templates. Most of our custom SAM templates have well over 20 Windows Process and\or Windows Performance Monitors each and we have more than 200 custom SAM monitors -- and all are using WMI!)
So what I did on Thursday was change the configurations for most of the existing SAM WMI application monitors’ polling interval from 60secs to the default 300secs. Since doing this ALL of our performance issues we have been experiencing with SolarWinds appear to be resolved. For example, our team used to get daily alerts that the SW Job Engine was queuing up jobs, that SW job errors were occurring, false alert emails would trigger, and polling was delayed here and there. This is no longer happening and it appears that SolarWinds performance has returned to normal.
So to those that heavily utilize WMI application monitoring in your environment I would keep this in mind. I created a few SAM reports you can add to your own environment that can help you quickly identify the polling interval configurations you have for your SAM templates:
I hope this information helps!