Skip navigation

This post compiles important information you need
to know about the "New Poller" metrics and performance measurement.

 

We implemented a brand new polling mechanism that
is responsible for scanning (SNMP,ICMP and WMI queries) Nodes, Volumes and
Interfaces in NPM 10.2. The old poller was able to use only single thread (so
performance and scalability was limited) which was one of the main drivers why
we implemented totally new poller that can use multi-threaded (core)
environments. Since the new poller is highly scalable we would like to give you
few tips about how to measure and control the poller’s overall
performance.

 

Regarding the new poller you could hear the
"Collector" term so let me clarify all the wording before you dive into the
details. New poller represents generic term of new polling mechanism and you can
understand that as set of services and components where Collector is the
controller of the general polling activity and maintains all polling schedules
and also it stores poll results into database. For the execution of polling
(SNMP, ICMP, WMI request) Collector communicate with Job Engine which is service
that runs polling jobs and returns the results back to the
Collector.

 

What
are the new metrics I can use to monitor the new poller?

 

When we introduced the new poller we also added new
important metrics that you can monitor in main Orion web console and which can
give you control of overall polling performance.

 

There are three new important metrics you may watch
(click on settings -> polling engines to see screen
bellow):

 

 

Polling Completion – represents percentage delay in the
primary polling mechanism (it represents all product polling - NPM, APM,
UDT...). This metric is computed from delay of every single poll job and it
means that polling jobs are delayed but they are NOT discarded in case that
value is lower than 100. For example if you see that polling completion is 80%
it means that polling is in average 20 seconds behind. So you can see that 1%
represents 1 second of delay in polling schedule in 10.2 release (this ratio may
be changed in the future). Polling completion is changed frequently since it
reflects average of last 100 polling jobs. Polling completion is mostly impacted
by CPU and memory resource so if see polling completion significantly lower than
100% you should check your CPU and memory utilization.

 

Total Job Weight (Total Job Weight for NPM Jobs)
- represents complexity of actual polling plan and it serves as the base value
for getting polling rate. It is combination of all elements managed by Orion and
their polling frequencies.

 

Total Job Weight is basically the sum
of the “True Job Weight” for each job in the Job Engine. For getting the "True
Job Weight" we use following formulas where every type of polling job has
predefined "weight" and polling frequency. Let's assume we have a job with
weight of 100 and polling interval of 10 minutes:

 

10 minutes(interval) / 1 minute(throttle ratio unit
of time) = 10

 

100(job weight) / 10(from above, interval/throttle
ratio unit of time) = 10 (True Job
Weight
)

 

We apply this for each job defined in the throttle
group and summarize the true weights of each job. For example let's say we have
300 jobs with the same job weight and polling interval:

 

10(True weight for Job 1) + ... + 10(True weight
for Job 300) = 3000 (Total Job
Weight
)

 

Currently only the NPM jobs currently have a weight
larger than “1”. For instance APM poll jobs won’t have a really big impact at
the scale factor until there will be a lot of them. Right now, this is really
just a number that tells us how much “weight” is going through the system per
unit time. We don’t have a number that would represent the high end of what can
be run per unit time.

 

Polling Rate – this will tell you what the
current utilization of your Nodes, Volumes and Interfaces polling capacity is.
Any value below 100 is OK. If it exceeds 85% then you are approaching the
maximum amount of polling your server can handle and you will see notification
banner in your Orion web console (see below). If the polling load is more than
what the new poller can handle (i.e. more than 100%), the polling intervals will
automatically be increased to handle the higher load. That means even if your
CPU is not fully used new poller will increase polling frequencies if you reach
the polling rate limit. Polling rate uses Job Total Weight as the base value.
For example if there will be only NPM polling jobs with total weight 3000 then
we use following formula for getting polling rate value:

 

(3000 (Total Job Weight)/2600 (Maximum polling load
for NPM jobs)) x 100 = 115% (polling
rate
).

 

And because in this case value exceeded 100% then
we apply throttling which means we multiply frequencies of all NPM jobs by
1.15.

 

As I mentioned the old poller was single threaded
but the new poller is multi-thread capable so we had to introduce throttling
mechanism which prevents new poller to consume all available resources for
Nodes/Volumes/Interfaces polling only. Thanks to the throttling mechanism your
system should have enough resources for other applications like Solarwinds APM
or UDT and also for Orion web console performance.

 

 

You can also check this KB
article
.

 

If you are the real geek you probably guess that
there is something more you can follow than just those three metrics. And you
are right. Our development team brought a really nice set of details about
polling engine you can monitor in real time so keep reading.

 

Advanced
monitoring of new poller performance

 

If you really want to understand what is behind the
new poller then the best way to monitor performance of polling, results
processing and storing results to DB are performance counters.

 

 

This is the complete list of all available counters
related to the new poller. If you need to double check your performance status
you should watch mainly following items:

 

DPPL waiting items – this counter should not be growing
constantly in time. Ideally it should  go
to zero in between polling intervals. If this value keeps growing your polling
results are not being proceed in the expected time and you can see gaps in
charts or in poll reports. This is usually caused by slow hardware. (DDPL means
Date Processor Pipe Line)

 

DPPL Avg. Time to Process
item
– reflect the
time of writing polling results to DB. The optimal value should be less than
0.500(ms) otherwise you will experience noticeable delays between result
processing and storage to database.

 

Scale Factor: Orion.Standard.Polling - represents
scale factor/polling rate I mentioned above. It can tell you if the system is
using throttling or not and what the current utilization is.

 

Messages in Queue – if this value is persistently
growing then it means you are not able to process all poll results on time. This
is usually because of slow hardware. But also database connectivity and
performance play a key role in the ability for the collector to store results
and can result in a queue backup.

 

If you run “perfmon” from windows start
menu command line and paste (use CTRL+C and CTRL+V) attached counter list into
performance monitor window on your Orion server you will see all the counters
related to polling performance:

 

 

  

 


Filter Blog

By date: By tag: