Orion performance best practices - Part 1: monitoring the 10.2 Polling Engine

This post compiles important information you need to know about the "New Poller" metrics and performance measurement.
We implemented a brand new polling mechanism that is responsible for scanning (SNMP,ICMP and WMI queries) Nodes, Volumes and Interfaces in NPM 10.2. The old poller was able to use only single thread (so performance and scalability was limited) which was one of the main drivers why we implemented totally new poller that can use multi-threaded (core) environments. Since the new poller is highly scalable we would like to give you few tips about how to measure and control the poller’s overall performance.
Regarding the new poller you could hear the "Collector" term so let me clarify all the wording before you dive into the details. New poller represents generic term of new polling mechanism and you can understand that as set of services and components where Collector is the controller of the general polling activity and maintains all polling schedules and also it stores poll results into database. For the execution of polling (SNMP, ICMP, WMI request) Collector communicate with Job Engine which is service that runs polling jobs and returns the results back to the Collector.

What are the new metrics I can use to monitor the new poller?

When we introduced the new poller we also added new important metrics that you can monitor in main Orion web console and which can give you control of overall polling performance.
There are three new important metrics you may watch (click on settings -> polling engines to see screen bellow):
Polling Completion – represents percentage delay in the primary polling mechanism (it represents all product polling - NPM, APM, UDT...). This metric is computed from delay of every single poll job and it means that polling jobs are delayed but they are NOT discarded in case that value is lower than 100. For example if you see that polling completion is 80% it means that polling is in average 20 seconds behind. So you can see that 1% represents 1 second of delay in polling schedule in 10.2 release (this ratio may be changed in the future). Polling completion is changed frequently since it reflects average of last 100 polling jobs. Polling completion is mostly impacted by CPU and memory resource so if see polling completion significantly lower than 100% you should check your CPU and memory utilization.
Total Job Weight (Total Job Weight for NPM Jobs) - represents complexity of actual polling plan and it serves as the base value for getting polling rate. It is combination of all elements managed by Orion and their polling frequencies.
Total Job Weight is basically the sum of the “True Job Weight” for each job in the Job Engine. For getting the "True Job Weight" we use following formulas where every type of polling job has predefined "weight" and polling frequency. Let's assume we have a job with weight of 100 and polling interval of 10 minutes:
10 minutes(interval) / 1 minute(throttle ratio unit of time) = 10
100(job weight) / 10(from above, interval/throttle ratio unit of time) = 10 (True Job Weight)
We apply this for each job defined in the throttle group and summarize the true weights of each job. For example let's say we have 300 jobs with the same job weight and polling interval:
10(True weight for Job 1) + ... + 10(True weight for Job 300) = 3000 (Total Job Weight)
Currently only the NPM jobs currently have a weight larger than “1”. For instance APM poll jobs won’t have a really big impact at the scale factor until there will be a lot of them. Right now, this is really just a number that tells us how much “weight” is going through the system per unit time. We don’t have a number that would represent the high end of what can be run per unit time.
Polling Rate – this will tell you what the current utilization of your Nodes, Volumes and Interfaces polling capacity is. Any value below 100 is OK. If it exceeds 85% then you are approaching the maximum amount of polling your server can handle and you will see notification banner in your Orion web console (see below). If the polling load is more than what the new poller can handle (i.e. more than 100%), the polling intervals will automatically be increased to handle the higher load. That means even if your CPU is not fully used new poller will increase polling frequencies if you reach the polling rate limit. Polling rate uses Job Total Weight as the base value. For example if there will be only NPM polling jobs with total weight 3000 then we use following formula for getting polling rate value:
(3000 (Total Job Weight)/2600 (Maximum polling load for NPM jobs)) x 100 = 115% (polling rate).
And because in this case value exceeded 100% then we apply throttling which means we multiply frequencies of all NPM jobs by 1.15.
As I mentioned the old poller was single threaded but the new poller is multi-thread capable so we had to introduce throttling mechanism which prevents new poller to consume all available resources for Nodes/Volumes/Interfaces polling only. Thanks to the throttling mechanism your system should have enough resources for other applications like Solarwinds APM or UDT and also for Orion web console performance.
You can also check this KB article.
If you are the real geek you probably guess that there is something more you can follow than just those three metrics. And you are right. Our development team brought a really nice set of details about polling engine you can monitor in real time so keep reading.

Advanced monitoring of new poller performance

If you really want to understand what is behind the new poller then the best way to monitor performance of polling, results processing and storing results to DB are performance counters.
This is the complete list of all available counters related to the new poller. If you need to double check your performance status you should watch mainly following items:
DPPL waiting items – this counter should not be growing constantly in time. Ideally it should  go to zero in between polling intervals. If this value keeps growing your polling results are not being proceed in the expected time and you can see gaps in charts or in poll reports. This is usually caused by slow hardware. (DDPL means Date Processor Pipe Line)
DPPL Avg. Time to Process item – reflect the time of writing polling results to DB. The optimal value should be less than 0.500(ms) otherwise you will experience noticeable delays between result processing and storage to database.
Scale Factor: Orion.Standard.Polling - represents scale factor/polling rate I mentioned above. It can tell you if the system is using throttling or not and what the current utilization is.
Messages in Queue – if this value is persistently growing then it means you are not able to process all poll results on time. This is usually because of slow hardware. But also database connectivity and performance play a key role in the ability for the collector to store results and can result in a queue backup.

If you run “perfmon” from windows start menu command line and paste (use CTRL+C and CTRL+V) attached counter list into performance monitor window on your Orion server you will see all the counters related to polling performance:

countersProperties.txt

  • Nice post, Very useful

    Could you kindly tell me if the DPPL waiting items are critical how to find out where the exact problem is ?

  • RichT,

    I am surprised as you to be honest. You are the first person who I have spoken with who actually shares my query/concern. I honestly thought I was going a little crazy to be ssemingly the only person chasing these answers. To answer you question though, no i never got an exact answer either and I tried asking here and via support tickets.

    For now it seems most people are happy with the extra polling power the multi-threaded poller provides, but I would predict that this will resurface again once users start squeezing their multi-threaded pollers for more than they are capable of; only then will they realise that capacity planning the pollers in almost any way with the provided data is not possible.

    Regards

    Ciaran

  • I require the exact same information Ciag, did you get a response?:

    1. What is the 'throttle ratio unit of time' shown here as 1 minute ?

    2. What is the 'job weight' of individual polling types ? All we know is that NPM jobs > 1.

    It would seem that without either of these values the 'Total Job Weight' and hence the polling load can't be calculated.

    And where does the 'maximum load for NPM jobs' of 2600 come from? And what's the 'throttle group' construct?

    For something that's so fundamental to the sizing and scaling of NPM (and apparently in conflict with SLX licensing) I'm surprised there hasn't been more comment here.

  • We have been having quite a few issues with our polling engine stopping after reboots and very slow console response when trying to access the Orion web views. I have included our pref counters and am curious if anything looks to be out of wack here. Any input would be most welcomed.

    Orion Pref Counters.jpg

  • Hi folks,

    can someone please explain to me what the 'throttle ratio unit of time is'? Is it a constant or a variable?

    Also some detail around the weight of different polling types would very be useful. For example, what weight is interface status monitoring or what weight is interface statistic monitoring?

    It mentions above about calculating 'True job weight' which is the weight of a job type / polling interval for that job type, but there is no mention of what job types carry what weight. All that is mentioned is the NPM jobs are the only ones that can carry a weight greater than 1.

    Unless I am missing something here if you don't know the weight of your different job types you cannot work out your true job weight, or am I mistaken?

    Regards

    Ciaran

Thwack - Symbolize TM, R, and C