I haven't seen gaps since ver 7something, but without additional information about your polling completion rate polling frequency and charting rates etc, I would have to speculate that your gaps could be from having to many elements on an individual polling engine. I have eight active polling engines and the highest element count I have is around 5k on a single engine.
I am using 9.1 SP1 of Orion SLX. I have 2 polling engine and the SQL DB is external. I have 6 to 7 K interfaces on each polling engine, and atleast what I am hearing is that polling engines are supposed to be up to 8K.
Thanks I would love to compare notes on equipment to see if there is something in this environment that is different. The SQL DB is running SQL Server 9.0.1399 on a Win server 2003 SE (5.2 Build3790)
The Processor is an Intel- Xeon with 4 3.2GHz CPUs. 4 Gigs RAM.
C Drive "OS" is RAID 1
D Drive "SQL RAID 10 with about a terabit of disk space.
If I could compare the topology, perhaps there is something in this environment that does not work well. You mentioned 5 thousand node limits did you see a problem after 5K in the past?
I open support tickets, they age them out after 5 days then start a new ticket. so far iv'e been asked for the same information several times but no one from support has been able to step to the plate follow thorugh and resolve. The support for this anomaly has been terrible.
I have over 21K of pollers on ONE polling engine. The application server is 2 dual core. The database is now on a 64bit SQL server that is also 2 dual core.
I have had gaps if I try to do all of the work in less than 10 minutes of statistics.
There are several things that I have been successful in tuning.
1. Make sure that your SQL database LOG file is being maintained at a rate of no more than 10 to 15 minutes increments. The log file can get too big and take a lot of time to maintain, thus causing delays in adding more info to the database. Your DBAs will probably cry about this but they will change it for you.
2. Make sure that you are NOT running on a SAN. While they are supposed to be fast, they actually cannot handle the rapid amount of smal bit data that Orion sends. If the Orion data gets behind several large packets, there will be degredation. We worked on that one for almost a year! Putting the DB on a physical drive on the DB server really helped.
3. Make sure that Orion is not re-discovering all of the network devices frequently. We use a one week algorythm that seems to help (10080 minutes).
4. Cut your SNMP and Ping timeouts way down. Most every device should reply in 1 second unless it is on a very slow link. Also, if there are devices that are known to be down for a length of time, consider unmanaging them. Orion will wait for an answer from each one. If the timeouts are set high, Orion just waits.
5. Make sure that the database has been maintained properly. If the database has not been re-orged or the files structure is fragmented, it will take a longer time for Orion to get confirmation that the data was processed.
There are just some of the things that we have run into. They are not meant to be the end all/be all answers. I just thought that some of them might help you. I hope that they do.
The post at the link below indicates that each polling engine can handle about 1000 elements per minute. This means if you have a five minute polling interval then the polling engine will support around 5,000 elements.
I went through every step listed in the walkthrough, and made any and every change that showed different values. That included decreasing polling to 7 minutes (which is very unfortunate) each poller has between 6 and 7 thousand elements. So far it looks the same.
I tried every single step mentioned in "http://www.solarwinds.com/support/orion/docs/gaps/gaps.htm" however still have these stubborn gaps. I would happily take any advice to help get past this problem. Support pretty much has me repeating the same steps, then disappears for a few days then asks me to repeat the same steps and send diagnostics then goes back to step one. Its been a month. Any advanced help would really be appreciated.
What's the LAN latency between your poller(s) and your DB server?
What does the SNMP pollers "status index" number indicate. I see a lower numbers their - is that reflecting failed SNMP requests?
SNMP Status Polling Index 2738 out of 6597
Orion Network Performance Monitor Version 9.1.0
Network Node Elements 485
Interface Elements 6081
Volume Elements 31
Date Time 12/19/2008 12:17:01 PM
Max Outstanding Polls 900
ICMP Status Polling Index 6597 out of 6597
SNMP Status Polling Index 2738 out of 6597
ICMP Polls per second 0
SNMP Polls per second 52
Max Status Polls Per Second 52
DNS Outstanding 0
ICMP Outstanding 112
SNMP Outstanding 482
ICMP Statistic Polling Index 9264 out of 9709
SNMP Statistic Polling Index 9709 out of 9709
ICMP Polls per second 56
SNMP Polls per second 14
Max Statistic Polls Per Second 56
I second Bryan's question - what is the network latency between your pollers and the DB server.
Sorry if these have all been asked before;
How big is your DB?
What are your Status and Statistics polling intervals?
What does the Polls per Second Tuning application say?
What are your Nightly Maintenance roll-up intervals? (detailed > hourly > daily)
How long are you keeping the Data?
4GB of RAM is marginal for an installation this size.
I would suggest at least 12GB of RAM.
Please add this report to one of your pages to see what your polling completion rate is:
Admin > Customize View (pick a view and edit) > Add > Miscellaneous > Polling Engine Status
Now you can see what is going on wiith your pollers directly from the web page.
I ran a continuous ping between each of the the pollers and the SQL DB, all < 1ms, I saw a very few that were 15ms but for the most part it is rock solid.
The DB size is about 11.3 gig
Status and statistical polling intervals are both 420 seconds (scaled back from 300 sec to see if it would helps, it didn't help much)
polls per second tuner is always re aligned to be exact. I frequently add new routers so after I add I will update the PPS field.
Nightly maintenance as follows (nightly 7 days, hourly 30 days, Daily 365 days)
Network events deleted after 30 days
Nice idea adding the polling status to the view any other places to look for the cause of the gaps?
Polling Engines Status Last Database Update 23 seconds ago Polling Engine on BEDSWPOLLER2 IP Address xx.xx.xx.xx Last Database Sync 1 minutes, 13 seconds ago Network Elements 487 Nodes, 6115 Interfaces, 31 Volumes, 6633 Total Elements Running Since 12/16/2008 10:58:37 AM Polling Completion 98.96 % Operating System Windows 2003 Standard Edition Service Pack 2.0 Package Orion NPM Polling Engine v9 SLX Polling Engine on SOLARWINDS IP Address XX.XX.XX.XX Last Database Sync 1 minutes, 9 seconds ago Network Elements 206 Nodes, 6913 Interfaces, 88 Volumes, 7207 Total Elements Running Since 12/16/2008 10:56:36 AM Polling Completion 99.12 % Operating System Microsoft Windows NT 5.2.3790 Service Pack 2 Service Pack Service Pack 2 Package Orion NPM v9 SLX
The keys for me was that the DB server be on the same LAN as the pollers with latency with less than 5 ms. We tried running it over a wan with latency of 40ms and it was gapping pretty badly.
Since it looks like latency is good my only suggestion is either turn back the polling even more (ie back to 10 minutes) or buy a new poller and spread the elements out even further. I'd audit what you have and remove elements you don't care about. It's amazing how much people had but don't care about.
Oh..I see your running NT on one of your servers. Might be time to go to 2003 x64.
Will try the LAN migration and see if that helps.
Thanks for the assistance
Looking at your Polling status, it's apparent your DB server is not up to the task.
You polling completion should be above 99%
Your Last Database Sync should be almost real time - usually within 10 seconds.
Once again, your DB server is THE KEY here.
I'm running a Raid 0 local SCSI 3 disk array with 14GB of RAM on an Opteron server running Windows X64 & MS SQL 64bit SE.
Note, these are both Opteron x64 servers running Windows 2003 32bit, but are reported as NT 5.2
Here is what mine looks like (very similar to yours):
Last Database Update Now Polling Engine on APP1234 IP Address 123.456.789.1 Last Database Sync Now Network Elements 699 Nodes, 6379 Interfaces, 116 Volumes, 7194 Total Elements Running Since 10/29/2008 9:03:31 AM Polling Completion 99.29 % Operating System Microsoft Windows NT 5.2.3790 Service Pack 2 Service Pack Service Pack 2 Package Orion Network Performance Monitor V8 SLX Polling Engine on APP1235 IP Address 123.456.789.2 Last Database Sync 1 second ago Network Elements 1028 Nodes, 5870 Interfaces, 77 Volumes, 6975 Total Elements Running Since 10/29/2008 2:15:42 AM Polling Completion 99.40 % Operating System Microsoft Windows NT 5.2.3790 Service Pack 2 Service Pack Service Pack 2 Package Orion V8 SLX Poller
One other major difference is the Status and statistics polling intervals.
I poll for Node status every 300 seconds & CPU, Memory and Volume statistics every 300 seonds.
However, the one that really adds load to the DB is the interface statistics.
I poll critical nodes more often, but they are only about 10% of the total monitored nodes.
I poll for Interface status every 300 seconds but the default statistics poll is every 600 seconds.
I only poll the interfaces on backbone and critical circuits every 300 seconds.
I suggest you add the Polling details view to your Node details page, and Interface Polling details view to your interface details page.
Polling Details Polling Engine APP1234 (123.456.789.1) Polling Interval 300 seconds Next Poll 10:04 AM Statistics Collection 5 minutes Enable 64 bit Counters No Rediscovery Interval 600 minutes Next Rediscovery 07:10 PM Last Database Update 23-Dec-08 10:04 AM Interface Polling Details Polling Engine APP1234 (123.456.789.1) Polling Interval 300 seconds Next Poll 10:10 AM Statistics Collection 10 minutes Enable 64 bit Counters No Rediscovery Interval 600 minutes Next Rediscovery 07:42 PM Last Database Update 23-Dec-08 10:05 AM
Hmmm this is interesting, (albeit I am running 9.1 of Orion not v8). The completion rate on the one polling engine above 99% is also having the issue, however, you mentioned the time since the last DB synch. That time should be under 10 seconds? Now thats interesting being that I am 2 minutes"ish". Can you elaborate on the synch process a bit- trying to get a handle on what resource the SQL DB needs to complete that task etc.
Polling Engine on BEDSWPOLLER2 IP Address xx.xx.xxx.xx Last Database Sync 1 minutes, 58 seconds ago Network Elements 487 Nodes, 6118 Interfaces, 31 Volumes, 6636 Total Elements Running Since 12/16/2008 10:58:37 AM Polling Completion 98.91 % Operating System Microsoft Windows NT 5.2.3790 Service Pack 2 Service Pack Service Pack 2 Package Orion NPM Polling Engine v9 SLX Polling Engine on SOLARWINDS IP Address xx.xx.xx.xx Last Database Sync 2 minutes, 2 seconds ago Network Elements 206 Nodes, 6913 Interfaces, 88 Volumes, 7207 Total Elements Running Since 12/16/2008 10:56:36 AM Polling Completion 99.12 % Operating System Microsoft Windows Server 2003 Standard Edition Service Pack Service Pack 2 Package Orion NPM v9 SLX
I suggest you run the following search in this forum for additional info;
I've discussed this issue some time ago in this forum.
Re: Gap in chart
As for what the polling completion percentage means, check the following link: http://www.solarwinds.com/support/orion/docs/gaps/gaps.htm
What is the average disk queue length, as per the troubleshooting doc above?
As for DB sync, there is no fast or fixed value for this. I'm just comparing this to what I am seeing on my install, which is working fine.
Finally, let me re-iterate, what are the specs on your DB server?
You can save yourself a lot of grief and work by upgrading your DB server, before doing anything else, if it is not up to the specs I mentioned previously.
Forgive me for interjecting here, but this is a very instructive thread. I have a couple of questions:
(1) Are you seeing "gaps" for interfaces on both pollers? The reason I ask is that your polling completion on one poller shows 98.96% while the other is over 99%. Or is that difference not significant?
(2)A more general question about the Polling Status. Exactly how does one read the numbers? You have ICMP and SNMP Index for both Status (Ping?) Pollers and also for Statistics (SNMP) Pollers. It seems you should have one polling index each for ICMP and SNMP, not two. Confusing.
(3) Last, are you currently thinking that the latency over the WAN connec tion to your database server is the most likely cuplrit?
- Yes on both pollers, and I see the behavior of interfaces data gaps on devices residing on not only either polling engine, but also local and distant networks. They only point of interest is that I do see "some" differences with device near my own LAN however I do still see the problem there as well just not as prolific.
- Polling stats question I will leave up to the SW support folks
- Will be placing the SQL DB on the same network as the pollers to see if that helps.
Any update on your situation? Have you been able to move your DB server onto the LAN with your pollers?
Thank you very much I will try each step and let you know