21 Replies Latest reply on Jan 8, 2009 4:32 PM by freemen

    Continious Gaps in interface utilization Data

    mfraser

      Hi Folks,

      I have had a long standing problem with Continuous Gaps in interface utilization Data. I am now using 9.1 SP1 however I have seen this problem for at least the past three upgrades. I have opened several tickets, bought a new Terrabit SQL DB (RAID10) . I have 2 pollers each monitoring approx 300 nodes and no more then 6 to 7 thousand interfaces. I have tried everything I can think of but no matter what I do, the gaps continue. I have a very large group of application users who are now creating work around tools like MRTG and Cacti to get the data that should be available in Orion. Anyone care to take a stab at finally resolving this?

        • Re: Continious Gaps in interface utilization Data
          jtimes

          I haven't seen gaps since ver 7something, but without additional information about your polling completion rate polling frequency and charting rates etc, I would have to speculate that your gaps could be from having to many elements on an individual polling engine. I have eight active polling engines and the highest element count I have is around 5k on a single engine.

            • Re: Continious Gaps in interface utilization Data
              mfraser

              I am using 9.1 SP1 of Orion SLX. I have 2 polling engine and the SQL DB is external. I have 6 to 7 K interfaces on each polling engine, and atleast what I am hearing is that polling engines are supposed to be up to 8K.

              Thanks I would love to compare notes on equipment to see if there is something in this environment that is different.  The SQL DB is running SQL Server 9.0.1399 on a Win server 2003 SE (5.2 Build3790)

              The Processor is an Intel- Xeon with 4 3.2GHz CPUs. 4 Gigs RAM.

              C Drive "OS" is RAID 1
              D Drive "SQL RAID 10 with about a terabit of disk space.

               

              If I could compare the topology, perhaps there is something in this environment that does not work well. You mentioned 5 thousand node limits did you see a problem after 5K in the past?

              I open support tickets, they age them out after 5 days then start a new ticket. so far iv'e been asked for the same information several times but no one from support has been able to step to the plate follow thorugh and resolve. The support for this anomaly has been terrible.

                • Re: Continious Gaps in interface utilization Data
                  mhh351

                  I have over 21K of pollers on ONE polling engine. The application server is 2 dual core. The database is now on a 64bit SQL server that is also 2 dual core.

                  I have had gaps if I try to do all of the work in less than 10 minutes of statistics.

                  There are several things that I have been successful in tuning.

                  1. Make sure that your SQL database LOG file is being maintained at a rate of no more than 10 to 15 minutes increments. The log file can get too big and take a lot of time to maintain, thus causing delays in adding more info to the database. Your DBAs will probably cry about this but they will change it for you.

                  2. Make sure that you are NOT running on a SAN. While they are supposed to be fast, they actually cannot handle the rapid amount of smal bit data that Orion sends. If the Orion data gets behind several large packets, there will be degredation. We worked on that one for almost a year! Putting the DB on a physical drive on the DB server really helped.

                  3. Make sure that Orion is not re-discovering all of the network devices frequently. We use a one week algorythm that seems to help (10080 minutes).

                  4. Cut your SNMP and Ping timeouts way down. Most every device should reply in 1 second unless it is on a very slow link. Also, if there are devices that are known to be down for a length of time, consider unmanaging them. Orion will wait for an answer from each one. If the timeouts are set high, Orion just waits.

                  5. Make sure that the database has been maintained properly. If the database has not been re-orged or the files structure is fragmented, it will take a longer time for Orion to get confirmation that the data was processed.

                  There are just some of the things that we have run into. They are not meant to be the end all/be all answers. I just thought that some of them might help you. I hope that they do.

                  • Re: Continious Gaps in interface utilization Data

                    The post at the link below indicates that each polling engine can handle about 1000 elements per minute.  This means if you have a five minute polling interval then the polling engine will support around 5,000 elements.

                    Re: Gap in chart

                        • Re: Continious Gaps in interface utilization Data
                          mfraser

                          Thanks Mark,

                          I went through every step listed in the walkthrough, and made any and every change that  showed different values. That included decreasing polling to 7 minutes (which is very unfortunate) each poller has between 6 and 7 thousand elements.  So far it looks the same.

                          • Re: Continious Gaps in interface utilization Data
                            mfraser

                            Hi Mark,

                            I tried every single step mentioned in "http://www.solarwinds.com/support/orion/docs/gaps/gaps.htm" however still have these stubborn gaps. I would happily take any advice to help get past this problem. Support pretty much has me repeating the same steps, then disappears for a few days then asks me to repeat the same steps and send diagnostics then goes back to step one. Its been a month. Any advanced help would really be appreciated.

                              • Re: Continious Gaps in interface utilization Data
                                BryanBecker

                                What's the LAN latency between your poller(s) and your DB server?

                                BB

                                • Re: Continious Gaps in interface utilization Data
                                  mfraser

                                  What does the SNMP pollers "status index" number indicate. I see a lower numbers their - is that reflecting failed SNMP requests?

                                   

                                  SNMP Status Polling Index    2738 out of 6597

                                   

                                  Orion Network Performance Monitor Version 9.1.0
                                  NetPerfMon Engine   
                                  Network Node Elements    485
                                  Interface Elements    6081
                                  Volume Elements    31
                                  Date Time    12/19/2008 12:17:01 PM
                                  Paused    False
                                  Max Outstanding Polls    900
                                  Status Pollers   
                                  ICMP Status Polling Index    6597 out of 6597
                                  SNMP Status Polling Index    2738 out of 6597
                                  ICMP Polls per second    0
                                  SNMP Polls per second    52
                                  Max Status Polls Per Second    52
                                  Packet Queues   
                                  DNS Outstanding    0
                                  ICMP Outstanding    112
                                  SNMP Outstanding    482
                                  Statistics Pollers   
                                  ICMP Statistic Polling Index    9264 out of 9709
                                  SNMP Statistic Polling Index    9709 out of 9709
                                  ICMP Polls per second    56
                                  SNMP Polls per second    14
                                  Max Statistic Polls Per Second    56

                                    • Re: Continious Gaps in interface utilization Data
                                      Network_Guru

                                      I second Bryan's question - what is the network latency between your pollers and the DB server.

                                      Sorry if these have all been asked before;

                                      How big is your DB?
                                      What are your Status and Statistics polling intervals?
                                      What does the Polls per Second Tuning application say?
                                      What are your Nightly Maintenance roll-up intervals? (detailed > hourly > daily)
                                      How long are you keeping the Data?

                                      4GB of RAM is marginal for an installation this size.
                                      I would suggest at least 12GB of RAM.

                                      Please add this report to one of your pages to see what your polling completion rate is:

                                      Admin > Customize View (pick a view and edit) > Add > Miscellaneous > Polling Engine Status

                                      Now you can see what is going on wiith your pollers directly from the web page.

                                        • Re: Continious Gaps in interface utilization Data
                                          mfraser

                                          I ran a continuous ping between each of the the pollers and the SQL DB, all < 1ms,  I saw a very few that were 15ms but for the most part it is rock solid.

                                          The DB size is about 11.3 gig

                                          Status and statistical polling intervals are both 420 seconds (scaled back from 300 sec to see if it would helps, it didn't help much)

                                          polls per second tuner is always re aligned to be exact. I frequently add new routers so  after I add  I will update the PPS field.

                                          Nightly maintenance as follows  (nightly 7 days, hourly 30 days,  Daily 365 days)

                                          Network events deleted after 30 days

                                          Nice idea adding the polling status to the view any other places to look for the cause of the gaps?

                                           

                                           

                                          Polling Engines Status                                                        
                                                                                                   Edit Resource                                                        Online Manual                                                   

                                           

                                          Last Database Update23 seconds ago
                                           
                                          Polling Engine on BEDSWPOLLER2
                                          IP Addressxx.xx.xx.xx 
                                          Last Database Sync1 minutes, 13 seconds ago 
                                          Network Elements487 Nodes, 6115 Interfaces, 31 Volumes, 6633 Total Elements 
                                          Running Since12/16/2008 10:58:37 AM 
                                          Polling Completion98.96 % 
                                          Operating SystemWindows 2003 Standard Edition 
                                          Service Pack2.0 
                                          PackageOrion NPM Polling Engine v9 SLX 
                                           
                                          Polling Engine on SOLARWINDS
                                          IP AddressXX.XX.XX.XX
                                          Last Database Sync1 minutes, 9 seconds ago 
                                          Network Elements206 Nodes, 6913 Interfaces, 88 Volumes, 7207 Total Elements 
                                          Running Since12/16/2008 10:56:36 AM 
                                          Polling Completion99.12 % 
                                          Operating SystemMicrosoft Windows NT 5.2.3790 Service Pack 2 
                                          Service PackService Pack 2 
                                          PackageOrion NPM v9 SLX 
                                            • Re: Continious Gaps in interface utilization Data
                                              BryanBecker

                                              The keys for me was that the DB server be on the same LAN as the pollers with latency with less than 5 ms.  We tried running it over a wan with latency of 40ms and it was gapping pretty badly.

                                              Since it looks like latency is good my only suggestion is either turn back the polling even more (ie back to 10 minutes) or buy a new poller and spread the elements out even further.  I'd audit what you have and remove elements you don't care about.  It's amazing how much people had but don't care about.

                                              Oh..I see your running NT on one of your servers.  Might be time to go to 2003 x64.

                                              BB

                                                • Re: Continious Gaps in interface utilization Data
                                                  mfraser

                                                  Will try the LAN migration and see if that helps.

                                                  Thanks for the assistance 

                                                    • Re: Continious Gaps in interface utilization Data
                                                      Network_Guru

                                                      Looking at your Polling status, it's apparent your DB server is not up to the task.
                                                      You polling completion should be above 99%
                                                      Your Last Database Sync should be almost real time - usually within 10 seconds.

                                                      Once again, your DB server is THE KEY here.
                                                      I'm running a Raid 0 local SCSI 3 disk array with 14GB of RAM on an Opteron server running Windows X64 & MS SQL 64bit SE.
                                                      Note, these are both Opteron x64 servers running Windows 2003 32bit, but are reported as NT 5.2

                                                      Here is what mine looks like (very similar to yours):

                                                      Last Database UpdateNow
                                                       
                                                      Polling Engine on APP1234
                                                      IP Address123.456.789.1
                                                      Last Database SyncNow 
                                                      Network Elements699 Nodes, 6379 Interfaces, 116 Volumes, 7194 Total  Elements 
                                                      Running Since10/29/2008 9:03:31 AM 
                                                      Polling Completion99.29 % 
                                                      Operating SystemMicrosoft Windows NT 5.2.3790 Service Pack 2 
                                                      Service PackService Pack 2 
                                                      PackageOrion Network Performance Monitor V8 SLX 
                                                       
                                                      Polling Engine on APP1235
                                                      IP Address123.456.789.2
                                                      Last Database Sync1 second ago 
                                                      Network Elements1028 Nodes, 5870 Interfaces, 77 Volumes, 6975 Total  Elements 
                                                      Running Since10/29/2008 2:15:42 AM 
                                                      Polling Completion99.40 % 
                                                      Operating SystemMicrosoft Windows NT 5.2.3790 Service Pack 2 
                                                      Service PackService Pack 2 
                                                      PackageOrion V8 SLX Poller 
                                                        • Re: Continious Gaps in interface utilization Data
                                                          Network_Guru

                                                          One other major difference is the Status and statistics polling intervals.

                                                          I poll for Node status every 300 seconds & CPU, Memory and Volume statistics every 300 seonds.
                                                          However, the one that really adds load to the DB is the interface statistics.
                                                          I poll critical nodes more often, but they are only about 10% of the total monitored nodes.

                                                          I poll for Interface status every 300 seconds but the default statistics poll is every 600 seconds.
                                                          I only poll the interfaces on backbone and critical circuits every 300 seconds.

                                                          I suggest you add the Polling details view to your Node details page, and Interface Polling details view to your interface details page.

                                                           

                                                          Polling  Details
                                                           Polling EngineAPP1234 (123.456.789.1) 
                                                           Polling Interval300 seconds 
                                                           Next Poll10:04 AM 
                                                           
                                                           Statistics Collection5 minutes 
                                                           Enable 64 bit CountersNo 
                                                           
                                                           Rediscovery Interval600 minutes 
                                                           Next Rediscovery07:10 PM 
                                                           
                                                           Last Database Update23-Dec-08 10:04  AM 

                                                           

                                                           

                                                          Interface  Polling Details
                                                           Polling EngineAPP1234 (123.456.789.1) 
                                                           Polling Interval300 seconds 
                                                           Next Poll10:10 AM 
                                                           
                                                           Statistics Collection10 minutes 
                                                           Enable 64 bit CountersNo 
                                                           
                                                           Rediscovery Interval600 minutes 
                                                           Next Rediscovery07:42 PM 
                                                           
                                                           Last Database Update23-Dec-08 10:05  AM 

                                                          • Re: Continious Gaps in interface utilization Data
                                                            mfraser

                                                            Hmmm this is interesting, (albeit I am running 9.1 of Orion not v8). The completion rate on the one polling engine above 99% is also having the issue, however, you mentioned the time since the last DB synch.  That time should be under 10 seconds? Now thats interesting being that I am 2 minutes"ish".  Can you elaborate on the synch process a bit- trying to get a handle on what resource the SQL DB needs to complete that task etc.

                                                            Polling Engine on BEDSWPOLLER2
                                                            IP Addressxx.xx.xxx.xx
                                                            Last Database Sync1 minutes, 58 seconds ago 
                                                            Network Elements487 Nodes, 6118 Interfaces, 31 Volumes, 6636 Total Elements 
                                                            Running Since12/16/2008 10:58:37 AM 
                                                            Polling Completion98.91 % 
                                                            Operating SystemMicrosoft Windows NT 5.2.3790 Service Pack 2 
                                                            Service PackService Pack 2 
                                                            PackageOrion NPM Polling Engine v9 SLX 
                                                             
                                                            Polling Engine on SOLARWINDS
                                                            IP Addressxx.xx.xx.xx 
                                                            Last Database Sync2 minutes, 2 seconds ago 
                                                            Network Elements206 Nodes, 6913 Interfaces, 88 Volumes, 7207 Total Elements 
                                                            Running Since12/16/2008 10:56:36 AM 
                                                            Polling Completion99.12 % 
                                                            Operating SystemMicrosoft Windows Server 2003 Standard Edition 
                                                            Service PackService Pack 2 
                                                            PackageOrion NPM v9 SLX 
                                                          • Re: Continious Gaps in interface utilization Data
                                                            borgan

                                                            Forgive me for interjecting here, but this is a very instructive thread. I have a couple of questions:

                                                            (1) Are you seeing "gaps" for interfaces on both pollers? The reason I ask is that your polling completion on one poller shows 98.96% while the other is over 99%. Or is that difference not significant?

                                                            (2)A more general question about the Polling Status. Exactly how does one read the numbers? You have ICMP and SNMP Index for both Status (Ping?) Pollers and also for Statistics (SNMP) Pollers. It seems you should have one polling index each for ICMP and SNMP, not two. Confusing.

                                                            (3) Last, are you currently thinking that the latency over the WAN connec tion to your database server is the most likely cuplrit?

                                                              • Re: Continious Gaps in interface utilization Data
                                                                mfraser
                                                                • Yes on both pollers, and I see the behavior of interfaces data gaps on devices residing on not only either polling engine, but also local and distant networks. They only point of interest is that I do see "some" differences with device near my own LAN however I do still see the problem there as well just not as prolific.
                                                                • Polling stats question I will leave up to the SW support folks
                                                                • Will be placing the SQL DB on the same network as the pollers to see if that helps.
                                                              • Re: Continious Gaps in interface utilization Data
                                                                freemen

                                                                mfraser,

                                                                Any update on your situation? Have you been able to move your DB server onto the LAN with your pollers?

                                                  • Re: Continious Gaps in interface utilization Data
                                                    mfraser

                                                    Thank you very much I will try each step and let you know

                                                     

                                                    Mark