12 Replies Latest reply on Dec 1, 2009 3:22 PM by NinjaNerd56

    Very slow polling and/or alerting

      Running Orion 9.5 SP4 on a Windows Server 2003 VM with SP2.  VM is a dual core, 2.6ghz server with 4 gig of ram.  CPU averages 50%.  Memory utilization is around 1.67gb.

      I have a total of 4,708 Network Elements.  290 Nodes, 4,027 interfaces and 391 volumes.  I've deleted out all of the interfaces that I can, i.e. user switch ports.

      I am finding that my alerting is very slow.  That's the first symptom anyway.  I can shut down an interface and it can take it upwards of 7 minutes to get me an alert.  I've verified that it's not just a slow e-mail system by going into Orion System Manager, Advanced Alerts, I continually refresh it after I shut down the interface and this is the time (7 minutes or so) before it shows up in the alerts.

      For this particular interface, I have the status polling set to 45 seconds and statistics at 5 minutes.  Everything else in my system is set to the default node polling interval of 120 seconds and 90 seconds for interfaces.  However, I'm wanting to get that time down as low as possible to get alerts faster.

      The alert I'm testing, I have it "Check this Alert every" set to 15 seconds.  Once the alert shows up, I get my e-mail right after that, so I know that's working.  My Trigger condition has the "Do not trigger this action until condition exists for more than" set to 0 seconds.  No Alert Suppression is configured.  Time of day is from 12:00 AM to 11:59 pm, 7 days a week.

      Now, I say that alerting is the symptom but I think that it is not polling on a timely basis.  I say this because when I look at the node and interface in question, either in the web interface or System Manager during this whole time after I have shut the interface down, it is showing it as up.  I keep refreshing but it still shows it as up.  It's not until right around when the alert happens that it finally shows it as down.  So it seems that even though I have the polling interval set low, it's not polling as it should.

      Any ideas?

        • Re: Very slow polling and/or alerting
          warbird

          What is your polling completion percentage?  If it drops below 99%, then that helps point to the problem.

          Also, are you running your SQL db on a separate server with fast disks (raid 10)?  What is your average disk queue length?

            • Re: Very slow polling and/or alerting

              Polling completion is currently 99.12%.  I don't have direct access to the db server but the dba said that with a couple of exceptions during maintenance periods, the performance is very good.  He did some database tracing that showed that transactions were being entered into the database in microseconds, so very quickly.

              Doing more testing, I shut down an interface that I have monitored for alerting.  The polling interval for this interface is currently set to 90 seconds.  It took 5 minutes and 48 seconds to indicate that the interface was down, and I'm not talking about the alerting function but within System Manager, I hit 'Refresh' every few seconds on the interface.  Once it reflects that the interface is down, my alert fires very quickly.

              I then brought the interface back up and it took 6 minutes and 52 seconds to show the interface was back up.

              I then shut the interface back down, waited 1 minute and 45 seconds (just to give it time for it's 90 second poll plus whatever additional fast polls it does) and of course it doesn't change.  However, if I hit the 'Poll' button in System Manager, it changes state right away and my alert fires.

              Think I should open a case on it?

                • Re: Very slow polling and/or alerting
                  warbird

                  Sure would be nice if you could see the avg. disk queue length for yourself.  There is good info about how to get that data in this thread:  Monitoring Orion for Performance?

                  I am uncertain if a very high average disk queue length would directly result in a lower poll completion or not.  Your poll completion number looks good.

                  I would try to monitor the avg disk queue length, just to see what it looks like, and yes, open a case.

                    • Re: Very slow polling and/or alerting

                      I was able to get the data from the dba.  He gave me the following info for the disk queue length. 

                      Average is 1.94 for the D:\ (datafiles) and 0.03657 for the L:\ (transaction logs) drives

                      This data also includes a spike he sees around 2 am (the Orion db archive time) so he said that if we eliminated that spike, these numbers would be even lower.

                        • Re: Very slow polling and/or alerting

                          Just to update this issue, I worked with support and the solution to increasing my alerting was to increase the 'Maximum Node and Interface Status Polls' per second as well as the 'Maximum Statistics Collection' per second via the 'Polls Per Second Tuning' utility.

                          The utility will give you the recommended settings to use but I had to play with this a lot because even though my cpu wasn't being impacted (as the tool warns about), I found that my completed percentage dropped way too low.  What I learned is that you change the settings, have to restart the NetPerfMon service (usually killing the service off and restarting) and then letting it run for a while.  If at first the polling completion percentage wasn't high enough, I'd give it a little more time to settle down.  After a while longer, if it still wasn't high enough, I'd adjust the number down even further, wash, rinse, repeat.

                          I'm not polling as much as I'd like to and therefore am not alerting as fast as I'd like to.  I don't have an explanation why I can't increase the polling interval on the nodes/interfaces or the polls per second to the recommended settings per the above said utility without killing the polling engine.  I am well below the poller threshold (right around 4k elements) and cpu and memory are less than 50% utilized.

                          However, it's better than it was before.

                            • Re: Very slow polling and/or alerting
                              warbird

                              Interesting.  Thanks for the update.  I would have thought that increasing the number would have solved your issue??  On my secondary polling engine, the Polls Per Second wizard currently recommends 82/65.  I have it set to 100/100.  That polling engine has just over 7,000 elements and seems to be doing okay.

                                • Re: Very slow polling and/or alerting

                                  Yes, I did increase the polls per second with the utility.  I wanted to decrease my polling interval as well (poll faster) but I end up killing the polling engine.

                                  The utility recommends 104/59 for me but I've only been able to get to 85/69.  Even going to 90/69 kills the poller.

                                  Is your poller running on a VM?  Mine is and I half wonder if that's the issue.

                                    • Re: Very slow polling and/or alerting
                                      warbird

                                      Nope.  I avoid VM's like the plague when it comes to something as system intensive as Orion.  How many other VM's are running on the physical server? 

                                      • Re: Very slow polling and/or alerting

                                        Hi guys...

                                        I have an 9.5SP4 installation with all the SPs and HotFixes. The 2nd Poller is running 2114 Nodes, 3113 Interfaces, 282 Volumes, 5509 Total Elements.

                                        I had used the Polling Tuner and set it to the "Recommended" settings, 100/36. I just finished setting up over a dozen Alerts for network nodes based on location, importance (severity of failure impact), equipment type, Server OS type and role, and so on. All tested fine and I fine tuned the timing and suppression on several. So, I take off Wednesday for Thanksgiving, confident Orion is rocking.

                                        This morning, I had Alert e-mails and checked them against the Event Log and everything looked good. Then, I saw a router go down on our "big map" and waited for the Alert to fire at 5 minutes down. 24 minutes go by and NADA. I then notice I've got 64,000 oustanding SNMP Polls on the 2nd Poller. Did a Stop/Start on NetPerfMon after tuning down to 90/32. Still have a bunch outstanding, so I'm not happy yet.

                                        The Main NPM server, the 2nd Poller, and the SQL 2005 Server are running on VMWare ESX partitions. Everybody is on Server 2008, x64 with 8GB of RAM. The SQL DB is the ONLY instance on it's VM. Overall performance has been very. very good...until this weirdness.

                                        Anything new in your world?

                                        Thanks!