I eagerly await answers to this...I sometimes find my pollers have "gone to lunch" using 95% of CPU or something, usually after a few months of running after an upgrade or something (may or may not be related to tuning :).
It would be NICE to have an option to auto-tune or manual so if auto -> auto adjusts recommended polls per second as well.
First as a small start, here are a few links to some other posts that contain in them some useful nuggets to be aware of. Some of the posts are old so a bit of the info no longer applies but they are still a good read enjoy!!
Good stuff - also note there is a Orion Level II customer training tomorrow and Aug 19 which will cover poller tunning and other topics. Brandon Shopp will be delivering this and he is a guru on this stuff. Tomorrow is booked full but you can still register for the 19th. Here is the reg link...
Thanks for that guys I've signed up for the session on the 19th.
I have asked these questions in the past and had intended to add the answers as a part of this post but I must have not retained them as I am unable to find them in my documentation so I'm asking them again.
1. Polling completion percentage:
A. To prevent gaps in graphs it is recommended that the polling completion percentage is above a certain mark. In the past I was told a minimum completion percentage of 99.5% was enough but more recently it was suggested that 99.7% or higher would be minimum. So just what should be the minimum polling completion percentage to aim for to prevent gaps in graphs?
B. JHow is the completion percentage calculated, is it taken directly from the polling engine values or does it consider exactly what gets written to the database as well?
C. Is it possible get a completion percentage of 100%, if not, why?
2. Assuming a poller is correctly tuned and within it's limit of elements, what might be the most likely causes to slowdown or prevent the polling cycle from completing
Eg. Down nodes, slow responding nodes?
I definitely have some useful documentation about tuning that I will fish out and upload when I get some time.
Poller Tuning Community Tips
NOTE: Information below only applies the now old Netperfmon polling engine. It does not count for the new Collector services. The Netperfmon service will be soon phased out which will make most of the information herein redundant when that happens.
This is an adaptation of a document I created for other Admins I work with who use NPM. Within are some of the tips I have used to successfully tune my Orion NPM polling engines to achieve optimal performance from them. The majority of the information within is to be used in addition to that provided in the NPM Admin guide, as such it is assumed that the reader is familiar with the core concepts set out in the admin guide. Please feel free to add your own tips or correct any mistakes. Some of this information has been provided by SW engineers and some of it has been learned through the tuning process. I have tried to be as comprehensive as possible meaning there is probably more detail in here than is required by most but I’m hoping that it will benefit those who are in the midst of tuning or who just want to give the performance of their pollers a little kick.
If you haven’t done so already you should do some basic inventory work, documenting the number of elements types per polling engine. Also know the polling intervals for those element types. Knowing these stats is an essential part of tuning, later in the document you can compare your stats against some known limits of the polling engines this will tell you whether your expectations for the poller’s performance are plausible or if they are beyond its abilities; in which case you may need to scale back your expectations; by either reducing the frequency of polls or reduce the number of monitored elements. If scaling back your expectations is not desirable then you may have to consider buying another polling engine.
Gathering the element type totals and comparing them against the specified limits provided within this document will give you a fair idea of what is possible from your poller. You should gather or be aware of the following numbers for each polling engine:
Total number of elements
Number of Node elements
Number of Interface elements
Number of Volume elements
The current polling interval for each type of monitored element
The desired polling interval for each type of monitored element
Of all the limits one of the most important if not the most important is the maximum number of elements a single poller is capable of polling. The magic number here as most of you will know is roughly 8000 elements. That’s the limit for elements being polled at ‘default’ intervals without experiencing gaps in graphs, but is by no means an exact figure and is variable based on the environment your NPM poller is operating in.
This number is important when calculating how much additional strain will be placed on the polling engine if the polling intervals are altered. You need to be sure that your current or expected demands will not far exceed this limit, possibly outweighing the capacity of the poller.
So what happens when we start altering polling intervals and how does altering chronological intervals affect the number of polled elements? Well it’s really quite simple; let’s say I half the default polling interval setting for an element type, this is the same as doubling the number polls for that element type over the original time period.
For example, if a polling interval for 1000 node elements is halved from 10 min to 5 min this effectively doubles the load on the poller. So put simply instead of polling each node once every 10 min it will now poll each node twice every ten minutes equivalent to 2000 polled elements over 10 minutes.
1000 node elements @ 5min interval = (is equivalent to) 2000 polled node elements every 10 min
This method is based upon a perspective that the default polling intervals are a constant and when an interval setting is altered only the number of polls within the default polling interval changes. This is just one perspective for weighing up the number of polls against the number of elements but it is the one that I find useful for calculating the load on a poller. With this in mind take a look at the below scenario. This can be used to gauge the load on the poller.
Firstly the default values for polling intervals are:
Nodes: 10 min
Interfaces: 9 min
Volumes: 15 min
Node Elements: 1500 Polling interval: 5 min
Interface elements: 1500 Polling interval: 9 min
Total number of elements 3452
Total number of polls needed to complete polling cycle 4952 approx
So how did I come to the figure of 4952? Well, it’s all about the polling intervals.
If you notice in the above scenario we pretty much halved the interval setting for the node statistics. Doing this effectively doubles the load on the poller for that statistic so in the above example instead of polling 1500 nodes every 10 min for statistics it’s actually going to poll 1500 nodes every 5 min which is 3000 every 10 min; that’s almost 2 full cycles over the original default time period.
The same is also true in the opposite direction. If for example you increase the interval for volumes from 15 min to 30 min you half the load on the poller for volume statistics collection.
So we’ve covered a simplistic view of the max limit for total number of elements and that it should not exceed approx 8000 per poller. But there is another limit that can be important specially if you are dealing with very short intervals. This is the number of polls per minute that a poller can handle. Once again this is not an exact figure but approximately the poller can handle 1000-1500, maybe even 2000 polls per minute. This is a large window I know but later in this doc I will outline methods you can use to determine if you’re poller is near it’s limit.
Polls per Second tuning
In the last section we worked out how the number of elements and the polling intervals are related, we also covered that the limits of the poller can be variable and differ from one environment to another. One tool that helps us to compensate and customise performance for some of the environmental variations in performance is the ‘Polls per second tuning’ tool.
This tool is quite straight forward to use but there are a few things you need to keep in mind when using it.
Which ever creative methods you use to choose your max PPS (Polls Per Second) value I would be interested in hearing about them, so feel free to add them below.
Personally I don’t use the recommended values within the tool and the ‘max polls’ calculation in the admin guide doesn’t really suit my environment either. So I have to use other methods for PPS statistics tuning.
For example, currently my max stats collection per second is set to 32 pps but the recommendation in the tuner tool is 49 pps. I have always found that in my environment choosing a pps value much lower than the recommendation yields better results (There’s a good chance this may not be the case for your environment). There are a few tools I was able to use to help me come to a max value of 32 pps and you too should keep an eye on these tools while tuning. First tool you should always keep and eye on when tuning is the ‘Polling status’ tool in the system manager. It is the best way to see what is happening with the engine in real-time. What I found when the Max PPS was set too high for my poller was that the max outstanding polls was constantly up over 300, what I also saw was prolonged periods where ‘Avg Disk queue lengths’ on the SQL server were at excessive levels, and most importantly I had gaps in my graphs, always keep checking your graphs!! I was able to view the avg disk queue length in the ‘Performance Monitor’ on the SQL server. More on using the Performance monitor later in the doc. Using the ‘Polling status’ tool is for me one of the more useful tools when tuning, however, beware that having the ‘System Manager’ open has a significant drain on the polling engine’s performance so try not to keep it open longer than you need to.
Within the ‘Polling Status’ window there are a number of statistics that are important with regard to the health of your polling cycle. The first of these I will always check upon opening the window is the ‘SNMP Outstanding’. I will watch this for about 10 – 15 seconds. The behaviour of this counter should bubble up and down pretty much always. For example;. it might be 0 then 100 then 23 next 276 then 0 again and so on…. This is considered normal healthy operation. Even if it is briefly bursting up to maybe 300 or more it still should be ok. You start running into trouble when it never goes below 200 – 300 approx.
Troubleshooting gaps in graphs
Ok so what if you’ve tried tuning the engines but there are still gaps in your graphs grrrr!!!? Fear not the following sections provide some tips that can be used when troubleshooting gaps in graphs.
- Polling Status
After you have confirmed that your current polling requirements are within the engines ability and you have tuned your poller accordingly the first place you should go to confirm everything is healthy should be the ‘polling status’ window in the System Manager. This is only available on the polling engine server and is real-time. It is only relevant to the polling server you are logged into. If you have multiple pollers you will have to log into each engine and review the ‘Polling Status’ for each poller.
As mentioned earlier that a healthy polling cycle should show the ‘bubbling’ behaviour in the ‘SNMP Outstanding’ figures. Watch out for figures that never go below 200-300 or never move off 0. Beware that if you hit the ‘Max Outstanding polls’ limit (900) polling will pause for a set interval. If this is happening to you it might be caused by one of a few things but the first thing I would do is recheck your number of elements and compare them to the limits set out above, if these are good go back and retune the poller. If that yields no improvement check the SNMP timeouts. More on timeout values below.
Another stat to watch is:
SNMP Statistic Polling Index – This should stay at the max out of max for a healthy polling cycle ie. 2000 out of 2000. If the cycle keeps restarting without ever getting close to the max limit this might suggest the poller is running out of time before completing the cycle.
Network Devices & Timeouts
Another factor to consider is the SNMP Time out setting. It might not sound like much but if you’re pushing your poller hard and it is waiting excessive amounts of time for responses because the timeouts are set too high, this can affect your poller’s ability to complete its full cycle. One tip I have picked up via SW is to reduce the SNMP timeout values if you can. First find the devices on your network which are consistently the slowest to respond. Measure their worst times for a given time period then double that figure and set it as your overall SNMP timeout value. This is a very easy and effective way to increase poller efficiency, the benefits would probably be more apparent in medium to large size networks but I would still highly recommend giving it a try regardless of network size if your network allows for it. Just make sure that you’re not decreasing the value too much or you could start seeing missed polls on slower responding elements.
- Polling Completion % Accuracy
You might have noticed that not once through the tuning process have I mentioned the polling completion tool. Until recently I used the ’Polling Completion Percentage’ statistic from the ‘Engines’ table within the NPM database as the main verification method for poller tuning. I had treated it as a reliable source and it wasn’t until I started running more detailed troubleshooting of my NPM installation in relation to poller tuning that I began to question the accuracy of the reported values. Before I proceed I just want to clarify that for the early stages in poller tuning and as part of a general polling engine health check I still would recommend reviewing the ‘Polling Completion Percentage’.
Some common questions about this statistic:
What is a good completion % so that I don’t have gaps in my graphs?
Like other facets of poller tuning there is no magic number where gaps in graphs suddenly disappear. But in my own experience I would recommend firstly getting the completion % constantly above 99% as a start. This is probably considered quite a low percentage and I have been told to get it to 99.5% or above in the past. However I haven’t seen any gaps in my graphs since I finished tuning and my completion percentage is around 99.0 – 99.2 %. My main point here is don’t allow yourself to become fixated on the completion % I’ll explain more in detail below.
Can the polling completion % be 100%, if not why not?
Short answer is no, not likely. But, it’s not important to get 100%. Why? To answer this we need to understand just a little of what the completion % is a measurement of. Put simply this is a measurement taken from a ‘checklist’ of tasks completed or yet to be completed by multiple services and components running on the polling engine server. I can’t go into more detail because honestly I don’t know much more than that. But we don’t need to understand any more than that. What we can take from this information is that the polling completion percentage is measurement taken at the beginning of the polling process and does not truly measure polling completion success for the entire process. It does not consider what gets polled against what actually gets written to the database. But as I mentioned above this statistic is still useful for the early stages of tuning and for general health checks.
- Graph creation Accuracy
When it comes to graph creation in the web GUI there can be anomalous behaviour with the drawing of the graphs. Sometimes it is possible that when the graph is being drawn gaps will exist where a poll was successfully executed and written to the database. The cause for this can be due to the fact that the polling engine does not poll its elements in the same order for each cycle. This means that statistics for elements may not appear in the timeline at exact set intervals. There is a good reason for this which is not relevant to this topic. In some cases when the graph is being drawn and it finds a data point which is not quite at a position in time where the graph expects it to be, the graph could leave that point blank and a gap in the graph will appear.
Personally this is something which I have never seen ever in our environment but if you think this maybe affecting you there is a very simple way to check. Check the database details for the given time of the gap and see if there is a gap in time where a poll was missed or if the data is present in the DB but the graph ignored it. If you see this issue I would suggest contacting Solarwinds about it or submit a feature request. SW are aware of the issue already but the more people who highlight it the more likely they are to fix it.
- Server Performance
One of the more difficult things I found to do was to accurately analyse SQL server performance. I’m not an SQL specialist by any stretch but I found a very useful document online that was a big help to me which I would like to share in this document. It goes a good deal further than just looking at avg disk queue length and will categorically tell you if you have any weak points in your environment. It will even instruct you what needs to be done to correct any weaknesses.
This article is available at
And this may also be useful :
If you find that after tuning and troubleshooting there are still a small number of devices that are still exhibiting gaps in graphs there might be characteristics specific to these devices that are causing gaps. It could be worth looking into the following topics to find a resolution.
Static interval settings
If you have adjusted interval settings it is possible that you may not have changed the interval for every node, interface or volume. Statically assigned intervals will take precedence over a mass interval change rollouts unless you use the ‘Re-apply’ option. To check for this I have developed three reports; one for each type of element which shows the respective elements which do not comply with specified interval setting. The reports are named below and are available in the content area.
- Adjusting static interval elements
If you are happy to have the same interval for every polled element for the interval type then using the re-apply option would be the best choice here
If you have a large number of devices which need to be set to customised intervals then you will still need to adjust them manually
High number of interfaces/ slow responding node
Some Cisco switches such as the Cat 6500 can have large number of interfaces, if you are monitoring every one of these interfaces on a busy switch the SNMP packets may not be getting back in time or they may be getting dropped as SNMP packets can be some of the first things dropped if a Cisco device is getting overly stressed. In this case you are limited as to what you can do. From the NPM side you can increase SNMP timeout values, increase the polling interval for the node or reduce the number of polled elements on the device. On the device you might be able to increase memory, implement traffic calming/ load balancing measures or change Qos settings but altering Qos isn’t really a solution.
For the moment that’s it. Most users will be familiar with this info but I hope that this doc will provide usefulness to some. Poller tuning is a bit of a black art, for some environments it’s straight forward as for others it requires significant effort to get right. With my environment being the latter I can understand the need for a doc like this and has been my main encouragement to getting this up on Thwack. I would hope others will join in and contribute to this post.
You mentioned that the SNMP Statistic Polling index should not "consistently staying at just one number or never moving off 0".
My understanding is that this number measures how many SNMP queries being generated by the poller are being responded to during the allotted time interval, and that therefore a number as close to 100% as possible was optimum. Why would I want to see the number of received SNMP queries momentarily decrease? Isn't a consisten 100 of 100 better than a momentary value of 25 of 100?
Thanks for your efforts in this thread.
I was under the impression that this was a cyclical counter for the number of SNMP polls completed so far in a polling cycle. Where a healthy cycle would start some where near the bottom and work it's way through the elements until it reached the end of the list then restarts again.
Having said that the last time I discussed this item with an engineer was a few years ago so my memory might be failing me on this one. If there are any engineers out there who would like to add their 2 cents worth I would be appreciative.
Some simple rules I use to keep them in good shape
- Put the poller status resource on your main screen so you can actively monitor your pollers. You will easily catch issues by paying attention to "Polling Completion" and "Last Database Sync". A polling completion that drops significantly should be investigated especially if it goes below 99% and obviously if last DB sync is more than a few seconds something is wrong.
-Reboot your pollers at least monthly, this can clear a lot of potential issues before they happen
-Watch the MEM and CPU utilization of your pollers , a sharp increase in either can mean something is going on
-Run the "Pools Per Second" utility often especially after adding or removing a large number of elements
-8000 elements is the limit a poller can handle that is suggest by Solarwinds. I find this to be fairly accurate, I would watch your polling completion closely as you approach that number and if your hardware allows you to go over it I would pay even closer attention to it. I have had my pollers doing as many as 10k elements each but I found this to cause the poller to be very prone to crashing or for polling to stop for some reason. Keeping them below 8k seems to allow them to operate with no errors for long periods of time.
-Polling can not only be MEM and CPU intensive it can also disk I/O intensive so keep your disks in good shape by making sure they are not near capacity and are defrag'd now and then. Polling will cause a ton of small reads and writes leading to heavy fragmentation.
-Keep an eye on your C:\windows\temp directory. Clearing this now and then can help avoid issues and seeing things like very large files (in excess of 100MB+) can mean you are running out of RAM frequently.
Thanks to everyone so far who has contributed to this post I'm delighted to see some interest in this topic.
You may indeed be correct, but we may also be saying similar things. Does the index represent the total number of polls produced over a certain time period? If so, what time period? In other words, if the CPU is set to produce 30 polls per sec and the current index reads 519, does that mean represent 17 seconds of polls?
I think I would appreciate it if a SW engineer once and for all explained each term shown in the Polling Status window of System Manager and how they interact. Step-by-step how does Orion collect and process ICMP and SNMP data?
Us details people would like to understand! :)
What a superb article. Thank you.
I was told by SW that if your engines run on VMs the count should be reduced from 8000 to 7000.
I record all of these statistics once a month so that I can easily show my boss (& his boss) that we need another polling engine. A snapshot figure is great, but being able to show the trend is crutial in getting additional funding.
I run a report monthly which is a SQL query & then I keep it.
select ServerName, IP, KeepAlive, Elements, Nodes, Interfaces, pollers, pollingcompletion from Engines
I also run an XLS, I'll clean it up & upload it soon.
Does anyone know what affect APM or Netflow has on the pollers is? Surely the number of pollers is reduced if APM is polling on the same server.
Great post Ciag,
Question on the number of polls needed to complete a polling cycle.
I'll use 1000 Interfaces as an example.
You say the default threshold is 9 min and if you have 1000 interfaces polling at 4.5min interval that equates to 2000 polls. That is fine. But that only takes into account Statistic pollers.
Do you have to do the same with status polling. So the same 1000 interfaces will have status pollers as well. So if we have them set at the 2 min default then that will add another 1000 pollers to our total.
So for these 1000 interfaces with Status pollers at 2 min and Statistic pollers at 4.5 mins. This will equate to 3000 pollers??
One last question my understanding of the difference between a Status poller and the Statistics poller is that the statistics poller is the one that records to the DB and the former is used against alerts etc but not recorded. IS that correct or can someone explain what the difference is?
Happy Christmas Everyone!!