This Knowledge Base article should provide the steps you need to balance your polling engines:
As of NPM 10.2, It's all done in the web console, in Web Node Management.
Maybe I didn't make myself clear. I was refering to the polls per second tuning, not load balancing.
We no longer allow you to tune the polls per second. Instead, we simply make a best effort to poll everything according to your settings. This means if you want us to poll all your nodes for ICMP status every minute, then you will have at least <your node count> ICMP packets per minute from Orion.
The polling in 10.2 has been optimized from an SNMP standpoint such that we have made an attempt to send fewer SNMP packets whereever possible. So before you may have been seeing 7 SNMP packets per Interface statistics poll, you will now see only 1 packet.
The polls per second tuning application was specifically tied to the NetPerfMon service, which no longer exists. All the polling is now done by the Collector and Job Engine v2 services.
Does that help?
Thank you for explaining this.
Pardon the interjection, but since users are used to the "old way" of using the Polling Tuner, what options are going to be available to us to get the best performance out of our polling servers? How will we know whether our server resources (RAM, CPU) are sufficient to handle the element count and polling intervals being used?
In other words, what indicators, if any, are in 10.2 that can be used to gauge the health of a server?
The polling tuner was really not a good way of tuning performance (from a generic software perspective). It was necessary because of the architecture of the NPM system. With 10.2 the architecture has changed significantly such that common performance measuring methods are more applicable. This means that monitoring CPU and memory on the Orion servers is a better indicator than before with regards to performance. Also, the polling completion rate in the Engines table is a more useful number in 10.2 in that it now more accurately reflects what is going on from a system wide basis. The percentage is based on how many of the jobs in the Job Engine v2 were we able to schedule on time. Jobs getting delayed is a sign that there is a physical performance problem or other limitation in the system.
Another thing to monitor on your Orion system is the MSMQ length of the solarwinds/collector/processingqueue. System problems like a slow database connection or other performance issues can cause this queue length to grow and stay large for long periods of time.
Also, monitor your SQL server. Ping times between it and Orion can have significant impact on the system, as well as CPU, memory, disk write queue length and other SQL health monitoring values.
There are probably other things that other Orion users have found useful, so the community should speak up here as well.
Removal of the 'polls per second' doesn't appear to take into account the 'rediscover' process and how if affects older hardware (2948's in our case).
Upon upgrading to 10.2 we started seeing CPU spikes every 30 minutes on our older switches. I tracked this down to the rediscovery interval and manually initiated a rediscover of a node while watching the CPU load from the node's CLI. It spiked from around 43% to 70%. When this spike is caught by our CPU HIGH alert it notifies.
It was awhile ago since I tuned the polling on 10.1 so I may be mis-remembering what I did and why but I'm pretty sure I turned the polls per second down before to lower the impact of the rediscovery process on the older nodes. Is there a way to have it backoff so it's not working the nodes so hard? I did turn the rediscovery interval down to every 120 minutes, but that will just space out the processor spikes.
The new topology polling happens on the rediscovery interval. For these older devices, check and see via List Resources if the "Topology Layer 2" or "Topology Layer 3" pollers are assigned to those devices. Uncheck these boxes and you will lighten the load on the rediscovery interval, at the price of losing up to date topology data.
Another option is to see if increasing the number of OIDs we are sending per SNMP GetBulk request will help the CPU utilization. This will work only for SNMPv2 and v3 devices. If you look in your Settings table and have an entry for SNMP MaxReps:
SELECT * FROM [dbo].[Settings] WHERE SettingID = 'SWNetPerfMon-Settings-SNMP MaxReps'
This value by default is 5. Increase this number to see if this lightens the CPU load on the devices. Note that changing this value will change the count for all GetBulk requests for the system. We cache that value, so in order to test the change, you will have to restart the SolarWinds Orion Module Engine service and the 3 SolarWinds Collector services. If you don't have that value in your database here is a SQL statement to help out.
INSERT INTO [dbo].[Settings]
,'SNMP GetBulk Maximum Repititions'
Let us know if you try this out. Thanks
It was indeed the layer2 and layer3 topology polling that was happening during the rediscovery cycle.
Futher investigation showed that the CPU spike was observed in Cisco 2948 switches with Version Version 8.3(2)GLX catOS but not 8.4(7)GLX.
Thank you for the definitive response.
Picking up on your mention of the MSMQ queues, I created am APM template using the following components:
Bytes in Journal Queue
Bytes in Queue
Messages in Journal Queue
Messages in Queue
Are all of these of interest when tracking Orion system performance, and if so, what thresholds would be practical?
Messages in Queue is the most interesting performance counter for this situation.
The interesting queue here is the ...\priate$\solarwinds\collector\processingqueue
A good threshold is probably 10,000. Generally, the queue sits around 0, but the queue usage is meant to handle spikes of job processing delay (due to reasons like a temporary slow DB connection etc.) Spiking up to 10,000 isn't a huge problem if the queue comes back down, as the collector can process thousands of results per second under normal circumstances. So a spike that goes up this high should be monitored to make sure it comes back down quickly (meaning the queue length should steadily come down as the seconds progress after the spike stops). A queue growing consistently for a few minutes indicates a problem with system resources or database query time (check your index fragmentation).
Thanks Karlo. So, if I was to set up an advanced alert on the Message Queue threshold being exceeded, what alert frequency and Trigger Condition threshold would you think would be appropriate?
Should I query the database every minute? Should I set the time delay on the Trigger Condition tab to some multiple of the APM polling interval?
I would still poll the database every minute, as that reduces the maximum delay between the issue happening and your notification of the issue. Also, having the trigger condition exist for about 10 minutes may be a good starting point, depending on your polling interval. If you're polling every 5 minutes, then two consecutive polls of > 10,000 messages is worth investigation, although there is a chance that it spiked, recovered and then spiked again.
Try it out and see if you are getting false alerts and adjust accordingly.
Let me know any results you are seeing, if things go bad.
Can someone drop a hint as to how to find/monitor: "...\priate$\solarwinds\collector\processingqueue"
either it's performance counter
- run perfmon and add appropriate counter for MSMQ
or you are right click on this computer-> manage -> and there's section services and applications ->
some info is on this blog