76 Replies Latest reply on Apr 22, 2010 9:31 AM by andressk

    NPM Polling stops.  No Rhyme or reason.

      We were running NPM 9.0.  Polling would stop at random occasions.  The service itself was still running, but no informaiton was being updated on the webpages. 

      Stopping and starting the service did not resolve the issue.  Rebooting the server and all would be back to normal besides the missing info from the time frame the polling stopped.

      Seeing that 9.5 was out.  I went ahead and updated with a few hitches that SolarWinds support helped out with.  We are still having the polling stop at random times.  Could be middle of the weekend or middle of the day... or late at night.  Service is still running.  Stopping and starting the service does not clear up the problem.  Rebooting the server does.

      I have made a ticket (Case #99603) but wanted to post to see if others have seen similar issues.  I did read other posts but they seemed to be some time in the past, and they had reported stopping and starting the service resolved the issue.

      Thanks for your time and assistance.

        • Re: NPM Polling stops.  No Rhyme or reason.

          our server had this problem sometime back, v9.  The NetPerfMon would stop polling.  A stop / start on NetPerfMon did the trick.   We haven't had this problem for past several weeks.  One thing you can do is setup a tcpdump on a separate server, for ICMP, and then email if its stops seeing pings.

            • Re: NPM Polling stops.  No Rhyme or reason.
              bshopp

              Sounds like you already have a support ticket, might also want to go to latest and see if anything changes, we released 9.5 SP1

                • Re: NPM Polling stops.  No Rhyme or reason.

                  I was having the same problem with 9.1 SP2, it was an 4 GIG MEM/Quad Core Zeon SLX polling engine, 8 GIG MEM/DUAL QUAD core SLX NPM and a 16GIG MEM/DUAL QUAD CORE XEON SQL cluster, I had a case open for a few weeks so after multiple options suggested by SW support there was no changes and no idea why it was stopping, when 9.5 came out and since the server was not in production at the time I upgraded it to 9.5 and so far it has not reocurred. having been a Network Engineer for 10 years and resisting Cisco's first response to upgrade the code as soon as I open a ticket, I would not suggest this for production enviroements unless you can justify it with out first testing it in non production enviroments.

              • Re: NPM Polling stops.  No Rhyme or reason.

                Much appreicate all the responses.

                I applied the SP 1 for 9.5 on the day I posted the above message (after I posted that is).  That evening the polling stopped.

                Since the last halt, there has not been another one.  The web interface is moving at a better pace, but no where near the speeds it was running at when we were running 9.0.

                We will sit and wait to see if the problem has actualy gone away.  But I am eager to see some more improvements to the web interface's speed.

                 

                Thanks.

                  • Re: NPM Polling stops.  No Rhyme or reason.
                    bshopp

                    If it keeps happening, be sure to open a support ticket.  We have made some more improvements in 9.5 and continue to do so with each and every release

                      • Re: NPM Polling stops.  No Rhyme or reason.

                        We have a similar problem when polling has stopped everynight after we upgraded to 9.5 SP1.  It looks to be around the same the the DB maintaince runs.  We are currently running the maintaince now during the day and it looks like polling has come to a halt and the DB server is maxed out.

                          • Re: NPM Polling stops.  No Rhyme or reason.

                            We have the same issue...  Been working on it for two weeks...  Just randomly stops and there are never any logs or reasons for the support group.  Today I found some patches for Server 2008, including a few SNMP fixes for mass SNMP polls from the OS that would cause the SNMP engine to fail, that didn't work, then found there were more SNMP fixes in SP2 for 2008.  I've gone to that and have been stable for a few hours now...  I'll give an update tommorow.

                              • Re: NPM Polling stops.  No Rhyme or reason.
                                rmmagow

                                I too upgraded Orion to 9.5 SP1. Two pollers, both are now stopped and will not restart. Ticket opened w/support and diags uploaded. My 9.5 upgrade is not going smoothly. Web is very slow and various other functions not working properly.

                                • Re: NPM Polling stops.  No Rhyme or reason.
                                  byrona

                                  We seem to be having a similiar issue.  We are running 9.5 SP4 on Windows 2008 Server.  At seemingly random times our statistic colletions stop with no warning.  The only way I know is when I go to look at historical data such as CPU or memory utilization on a node the data is missing.  Restarting the NetPerfMon service resolves the problem for the moment but it always seems to stop again.

                                  I can't find any logs to suggest what is going on.  I have a ticket opened with SW but they also can't seem to figure out what the problem is.

                          • Re: NPM Polling stops.  No Rhyme or reason.
                            mattinglyd

                            I was using SNMPv3 and this seemed to be the cause.  This was occurring on random devices - if I moved them to SNMPv2 the problem would stop.

                              • Re: NPM Polling stops.  No Rhyme or reason.
                                DGoodale

                                I had nothing but problems with SNMPv3. After trying to struggle through it with my Orion and CiscoWorks servers, I finally gave up after 3 months of grief. SNMPv3 is much more secure than SNMPv2 but requires extensive configuration changes on the devices themselves to allow access to the data in the MIB tables. I had a few devices that just would NOT talk SNMPv3 no matter what I did. What a relief to roll back to v2.

                                 

                                            Dan Goodale - Network Engineer, Triwest Healthcare 

                              • Re: NPM Polling stops.  No Rhyme or reason.

                                I am the OP.  This problem is still occuring.  At present, we have worked with various people at SolarWinds. The original ticket got closed for resolution on some corrupt database issues. 

                                They started a new ticket to log fresh working on this issue.  Case # 115898.

                                I have been in contact with Sr Tech management, as well as developer.  They have me running a debug dll for the issue and uploading a diagnostics after each failure.

                                They are hinting at it possibly being caused by a specific device I am polling, but I personalty am not feeling strong in that direction.  I am defering to their knowledge of course as it is their product.

                                Presently I have Senior Management here at my company very frustrated at the cash we spend on this product, with how long this issue has been going on.

                                  • Re: NPM Polling stops.  No Rhyme or reason.
                                    byrona

                                    Thanks for your reply though it makes me feel even worse about the issue than I already was as it sounds like a much bigger issue than I had originally thought.

                                    We also just sunk a bunch of money into this product for a very large project that we are working on.  We are pretty much supposed to be going live with this project this week for the customer and if this problem doesn't get resolved Sr. Management is going to be very upset at the money they just spent, especially considering the product we were using before was Open Source and pretty much not costing us anything.

                                      • Re: NPM Polling stops.  No Rhyme or reason.
                                        shorn

                                        I work with byrona (above) and am also trying to get this resolved. Since it seems like SolarWinds is aware of the issue I'm going to focus on detecting the problem. Since the engine doesn't actually stop, detecting the problem is a bit tricky. Does anyone know of a way to access the poller statistics from a script? We have noticed that the number of SNMP Polls Outstanding counter on the Engine Status web page starts climbing when this occurs. My thought is to create a script that checks this value periodically and then sends an email when it crosses a threshold, I'm just not sure where I can grab the data I need. Does anyone have any suggestions? My preference would be to just access some of there application components directly instead of parsing a web page.

                                        • Re: NPM Polling stops.  No Rhyme or reason.

                                          Hello Everyone--

                                          EEK! Lots of posts on this. Just want to let you all know that I've contacted the Product Managers about your concerns and I know they'll get back to you asapy to get this addressed.

                                          Thanks for your patience.

                                          Marie

                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                              bshopp

                                              Hey all, we are aware of this issue and are actively working it to further diagnose and see what is going on.  The Dev team has been working directly with some experiencing this issue to further investigate.  Do each of you have support cases open?  If so, could you PM me via thwack the case numbers please?

                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                  swack

                                                  Was there ever resolution for this issue?  I'm running NPM SP4 and NTA SP2 and even with Netflow service disabled I see the poller stops showing interface data after running for a while (though strangely it still shows availability).  Restarting the NPM services on the poller seems to fix the issue.

                                                    • Re: NPM Polling stops.  No Rhyme or reason.
                                                      byrona

                                                      We have not yet seen any resolution.  We are working directly with the SW team to help get this resolved.  Open a ticket directly with them and reference this issue so they know you are one of the people experiencing this problem.

                                                      • Re: NPM Polling stops.  No Rhyme or reason.

                                                        My issue is still being worked on.

                                                        At present I have been told there are 3 developers working on my issue.

                                                          • Re: NPM Polling stops.  No Rhyme or reason.
                                                            byrona

                                                            I have a theory on this problem.

                                                            I have noticed that when ever I find the system in this failed state that the "SNMP Outstanding" is in the 2k range which is well above it's normal 100ish.

                                                            I get a funny feeling that something is causing the SNMP polls to get behind or wedged so that the system works itself into a failure state over a period of time.

                                                            Yesterday we had a firewall directly upstream of our Orion system get replaced which caused Orion to loose connection to the world generating a lot of alerts.  The firewall issue was resolved shortly after it happened.  However;  I noticed that after this happened "SNMP Outstanding" ran above 300 (slightly higher than normal) pretty much until I went home for the day.

                                                            This morning around 10:30AM this failure occurred.  I have my system set to do collections every 10 minutes and looking at the data on several systems about every 1 in 3 data points is missing as though the system was starting to drag on it's collections.

                                                            It seems that the loosing connection to the world due to the firewall issue yesterday may have been what started a slow failure of the SNMP collection system.  This is just a theory with some supporting data but I hope it helps to resolve the issue.

                                                            We have compiled diag data from our system and sent it to SW via our open ticket on this issue.

                                                          • Re: NPM Polling stops.  No Rhyme or reason.
                                                            swack

                                                            I opened support case 120288 "Node polling okay, but interface polling not working" and was told there is a MEMORY LEAK in the business layer of the Orion systems.  I was given 3 DLL files to replace existing ones and a procedure to replace the overgrown LDF database log file on the poller.  I will continue to monitor, but if nothing breaks in the next 24 hours, I'll consider it fixed.

                                                              • Re: NPM Polling stops.  No Rhyme or reason.

                                                                Great Swack--

                                                                Keep us all posted.

                                                                M

                                                                  • Re: NPM Polling stops.  No Rhyme or reason.
                                                                    byrona

                                                                    I am curious how many of the people experiencing this problem are running on a 64bit OS?

                                                                    We are running Windows 2008 64bit and wondering if Orion has problems under these conditions.

                                                                      • Re: NPM Polling stops.  No Rhyme or reason.
                                                                        swack

                                                                        Well, I am still having problems with this issue even after running 4 SQL scripts to clean up part of the database. I'm pretty disappointed with how unresponsive the support team has been to my problem.  I shouldn't have to call and wait on hold forever just to be told "we'll have your engineer call you back".  I haven't bothered calling, just keep e-mailing back to the support person. What's the STRATEGY? If this doesn't work, what's next? I think it's time to get it escalated to developers! (case 120288)

                                                                          • Re: NPM Polling stops.  No Rhyme or reason.

                                                                            Swack,

                                                                              I understand your frustration.  I am in the same boat.  The only thing I can suggest is ensure to CC your Sales Rep as well during your e-mail.  That way they know it is effecting their bottom dollar.

                                                                            As well, I suggest next time you call in, request the ticket to be escalated to a Senior Tech, and get their Direct #.  I agree, with something of this magnitude their should be much better response.

                                                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                                                              byrona

                                                                              Swack

                                                                              Our company just like you and Mcampbell are in the same boat.  We spent a lot of money on NPM, additional pollers, APM, additional web servers; basically a whole suite of the SW products not to mention the hardware and Microsoft licenses and we are now unable to go into a production environment because of this problem and are failing to meet our obligations to our customers

                                                                              We have escalated this issue to the US Technical Support Manager.  Hopefully by all of us banding together and escalating this issue it will get more resources allocated to getting it resolved.

                                                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                                                              ecornwell

                                                                              I am curious how many of the people experiencing this problem are running on a 64bit OS?

                                                                              We are running Windows 2008 64bit and wondering if Orion has problems under these conditions.

                                                                               

                                                                              We're running Windows 2003 64bit and have notcied problems with the services stopping and the poller running but not putting data into the database.

                                                                  • Re: NPM Polling stops.  No Rhyme or reason.
                                                                    BryanBecker

                                                                    I'm in the same boat....services up but no data is collected.  I'm even seeing that launching System Manager shows no devices.  I have to stop all the services and restart them and it works again.

                                                                    If there is a beta dll or a fix I hope they get this out ASAP!  I have seen 2 scenarios where this needs to be fixed.  1.  The services are running but no data and 2. The services just stop for no reason. 

                                                                    Many people, including me, have asked for a way to trigger an alert on these issues and not find out days later of the issue.

                                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                                  swack

                                                                  I've e-mailed a request to the engineer working on my case, asking that it be escalated up the chain.  I also included a link to this thread so they can maybe start seeing the pattern of similar issues from multiple customers.

                                                                  • Re: NPM Polling stops.  No Rhyme or reason.
                                                                    byrona

                                                                    A few more items I thought I would bring up as we go through this problem...

                                                                    1)  Would AntiVirus software on the polling engine and/or the database contribute to or cause this problem?

                                                                    2)  Which method of SQL DB redundancy would be best for Orion DB?  We are currently using log shipping but are not sure if that is the best in this situation.

                                                                    • Re: NPM Polling stops.  No Rhyme or reason.
                                                                      rdollins

                                                                      Hi

                                                                      I am running SNMPV3. At any time a node will no longer be able to get stats on interfaces and volumes. Stats from Universal Device Poller on the node keep working with absoutley no problem. If I have the AIX crew restart the SNMP daemon, all will return to normal at once for the interfaces and volumes. But this issue occurs on several differnt types of devices so it is not limited to AIX devices. 

                                                                      I start running a packet capture on one of the devices that are not able to get interface stats.. What I see is the the poller is sending the correct engine boots and correct engine time for everything except the interface stats.   He sends an engine ID of 0 and engine time of 0 when he asks for interface stats, he gets a reply back that says "NotInTimeWindows" or "unknowEngineID" .Then  a few seconds later when he is polling for a Universal Device Poller stat he goes back to the correct Engine boots  and correct engine time, and he of course gets back a good reply..

                                                                      This behaviour can start on any node at any time.. The ugly fix is to restart snmp on the device or delete the node from Orion and recreate, and you are good.. Any idea on what is causing this to occur?

                                                                       

                                                                      Rick 

                                                                      • Re: NPM Polling stops.  No Rhyme or reason.
                                                                        rmmagow

                                                                        I am also experiencing random polling stops. I notice it is always my primary poller and it  happens most often over the weekend. I usually kill the still running netperfmon.exe task, restart it, start system manager and connect to polling engine. It takes a while but these steps generally bring back the poller. The poller only stops in the early morning (3~5AM) hours. Every time I reboot the server (2003) running the poller I get messages about sending dumps to microsoft. These dumps are produced by the business layer process. I'm running 9.5 w/current patches. We have decided to NOT buy APM and NPM because of this unstability. Since I run an unlimited license, this is costing SW a bit of revenue. Support's been good but the problem is not fixed.

                                                                          • Re: NPM Polling stops.  No Rhyme or reason.
                                                                            BryanBecker

                                                                            Add me to the list...I had my 1st occurrence last week on my new 9.5 SP4 server.  Since I was out sick most of the week no one noticed until like 4 days later.  I have 2 new 9.5 SP4 servers connecting to the same SQL server.  All the servers are on the same LAN basically yet only 1 of the 2 had an issue.  There has to be a way to add a service checker that if the netperfmon service is stopped to try and restart it....if it continues to fail send an email out or something.  I've seen this many times in various Orion versions that we could lose days of data if someone isn't constantly checking.

                                                                            Now in Windows Services all of these are set to "Restart the Service" on the Recovery tab but that doesn't seem to be the case.

                                                                            BB

                                                                          • Re: NPM Polling stops.  No Rhyme or reason.
                                                                            byrona

                                                                            Well, there is a bit of hope.  SolarWinds got us a development patch yesterday and the very early preliminary results look good.  According to our support engineer our diagnostic files indicated an issue related to wrong handling of some different behaviors of the WinSock API under Windows 2008.

                                                                            I will keep you all posted!

                                                                              • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                warbird

                                                                                I have been experiencing a similar issue but not positive it is exactly the same.  It seems that when we add new nodes or interfaces to our secondary poller, it will crash usually within 24 hours.  9.5 SP4, 32 bit Windows, 2 pollers, standalone SQL server, etc.

                                                                                The netperfmon service will eventually crash, after having gone sideways for many hours, which is noted by the fact it would start doing db syncs at larger intervals until finally crashing altogether.

                                                                                Also, the netperfmon service would not auto recover, even though we have it set to in Windows Services.  It has to be manually restarted.

                                                                                I have not opened my own ticket on this yet, as I have been trying to evaluate the cause.

                                                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                  firehawk_350

                                                                                  byrona, how is the dev patch working?  We currently have this problem with a 5 NPM pollers, 2 APM pollers, and 2 Netflow collectors.. I see a ton of "time out for buffer latch wait" in addition my DB will start to freeze and SQL is only using 8gb of the 24gb on the dedicated SQL server.  I look in the DB and there is a ton of blocking going on inside netperfmon db.  We are running on Windows2008 64 bit. 

                                                                                    • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                      byrona

                                                                                      The patch seemed to pretty much solve the initial problem and I think (though I am not certain) that the fix was rolled out in the post recent set of patches for everybody.

                                                                                      Now we think we may have a different polling problem where some node level collections will stop on groups of systems at a time.  This problem is a bit more elusive as it doesn't seem to happen frequently and only on some things when it does.  We are working to better understand this problem.

                                                                                      If you haven't already you should try applying the most recent SW patches and see if that helps.

                                                                                        • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                          warbird

                                                                                          byrona,

                                                                                          Do you notice this happening after you have added new nodes or made a bunch of changes (edited node names, added interfaces on already monitored nodes, etc)?  I have been noticing that my secondary polling engine will intermittently go sideways after many new nodes are added or many existing nodes have been edited.  "Many" is a loose term here, as it does not happen if only 2 or 3 nodes are added/edited.  The intermittent nature is making it difficult to track down or collect data.

                                                                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                              byrona

                                                                                              As far as this new issues is concerned, I have not been able to make any specific correlation as it hasn't happened enough times.  I will be sure to post more information about it as it comes available.  Thanks for asking thought and I will be sure to watch for this.

                                                                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                  JesperVestergaard

                                                                                                  All

                                                                                                  We now also have this problem, after re-installing our secondary poller on a 2008 box.

                                                                                                  The SNMP polling simply stops. - Only fix is to restart NPM service.

                                                                                                  Any fix yet ? (besides appliying sp and hot-fixes to current level

                                                                                                  (We are on SP5, no hot-fixes)

                                                                                                    • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                      byrona

                                                                                                      Jesper

                                                                                                      After getting up to current SP5 with all hot-fixes we don't seem to be experiencing this problem.  If you are still having this problem after updating your system you should definitely open a support ticket and maybe even reference this thread.

                                                                                                      After talking with the development team that is assigned to this problem, they seemed to think that they had a handle on it and understood what was causing it.

                                                                                                        • Re: NPM Polling stops.  No Rhyme or reason.

                                                                                                          Hi All,

                                                                                                          We are seeing this problem at the moment on our Windows 2008 server running NPM 9.5 SP5 with APM 3.1. SNMP polling stops usually after a few minutes sometimes as long as 45 minutes. The SNMP outstanding packet queue count grows and no new polls are completed. ICMP carries on just fine.

                                                                                                          I have logged  a support case (# 137809) but it seems to be going nowhere at the moment.

                                                                                                          This is really having a huge impact on us, so would appreciate any feedback or assistance in solving this problem.

                                                                                                          Best Regards,

                                                                                                          Stephen Parker

                                                                                                            • Re: NPM Polling stops.  No Rhyme or reason.

                                                                                                              Stephen,

                                                                                                                   I would call back in and press the issue, and refrence this post.  My issues were resolved by the tech team supplying me with a beta DLL file which over all corrected the issue.  This did take some time to work through with them as they supplied me with numerous different DLL''s to work on the issue.

                                                                                                               

                                                                                                              Good luck to you.

                                                                                                              • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                bshopp

                                                                                                                I have talked to the support team, you should be getting contacted soon

                                                                                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                  ecklerwr1

                                                                                                                  Not that it helps but these threads are what have kept me from upgrading my NPM 9.5 SP2 to any later Service Packs for fear that this polling engine problem might start up.  Luckily I have had no issues running NTA 3.6RC3 with my NPM 9.5 on SP2.

                                                                                                                    • Re: NPM Polling stops.  No Rhyme or reason.

                                                                                                                      From the posts on this thread, it seems that this problem is with pollers running on Windows 2008. We have a couple of additional NPM - SLX 9.5 - SP5 pollers running on Windows 2003 and these have been fine.

                                                                                                                      Support did contact me yesterday evening and gave me early access to a new version of NPM 9.5.1, which is due to be released imminently.  So far, (12 hours in), it looks promising as SNMP collections are still working. I'll give this a few days before being 100% convinced, but so far so good!

                                                                                                                      Best Regards

                                                                                                                      Stephen Parker

                                                                                                                        • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                          ecklerwr1

                                                                                                                          @Stephen-

                                                                                                                          I'm running on 2k3 but it's good to hear things looks better so far with NPM 9.5.1.  Keep us posted on this as I would really like to get my NPM up to 9.5.1 also.  I've been having great experience with the NTA 3.6RC3.  Finding out that the issues that arose during some of the post NPM 9.5SP2 Service Packs with polling engine stopping being resoved is encouraging.  I give a lot credit to SW's crew for using the collaboration in Thwack as a tool to add new features we request and to right the ship on bugs that crop up.

                                                                                                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                              chris.lapoint

                                                                                                                              I've been having great experience with the NTA 3.6RC3.  Finding out that the issues that arose during some of the post NPM 9.5SP2 Service Packs with polling engine stopping being resoved is encouraging.  I give a lot credit to SW's crew for using the collaboration in Thwack as a tool to add new features we request and to right the ship on bugs that crop up.

                                                                                                                              Thanks Stephen for the positive feedback.  Continuous process improvement is a big focus area for us so it's good to know that the changes we made are working.   We're also looking at different ways of to make community engagement and collaboration even better, especially around feature planning so stay tuned!

                                                                                                                              • Re: NPM Polling stops.  No Rhyme or reason.

                                                                                                                                Hi, just to let everyone know that we have still had problems with SNMP data collection stopping, this time after a couple of days. I've had to reboot my poller (or stop and start services) twice since moving to 9.5.1, so whilst it's better than it was, it doesn't look completely fixed yet.

                                                                                                                                One thing I did notice since the update to 9.5.1 is that the SNMP polling stats and SNMP packet queues do continue to change, whereas before they would freeze when the SNMP engine gave up. I only noticed that we weren't collecting data, or it wasn't beeing written to the DB, when I saw a big gaps in my graphs.

                                                                                                                                I've passed my findings back to support as well. I'll keep this thread updated with progress.

                                                                                                                                Best Regards

                                                                                                                                SP

                                                                                                                                  • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                    chris.lapoint

                                                                                                                                    Stephen, thanks for the heads-up.  Please continue to work this issue through support.   I'll make sure the dev team is aware if/when this gets escalated.

                                                                                                                                      • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                        chris.lapoint

                                                                                                                                        BTW, for folks following this thread, we've done extensive beta and RC testing of 9.5.1 specifically around the "stopped polling" issue and have validated this service release resolves issues in 99% of the cases.  There were a couple of issues that we're still working through that were specific to a few customers.  

                                                                                                                                        So, if you were experiencing this issue with 9.5, please do upgrade to 9.5.1.   We're are absolutely committed to resolving all issues so please open a support ticket if you continue to experience problems.

                                                                                                                                        Thanks, 

                                                                                                                                          • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                            oiram

                                                                                                                                            Current situation is that APM pollers on 2008 (not R2!) may in turn cause LSASS service to overutilize CPU. This is actually related to internal platform problem as Microsoft admits that they have performance issues on that platform and these are fixed in Windows 2008 R2. Unfortunate is that Orion isn’t supported on that platform yet which is why in these rare ocassions (especially when LSASS goes through the roof) you may consider temporarily downgrade to 2003 which does not expose these flaws.

                                                                                                                                            Also, we work to covert as many WMI monitors to Performance Counters which will substantially improve monitoring and issues related to it (like one with LSASS).

                                                                                                                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                              BryanBecker

                                                                                                                                              Chris/SW....not to complicate this issue further but we have instances that any network event that would cause interruption between the poller and DB that the poller just doesn't recover automatically.  We are running 9.5.1 and last night we had a network issue where communication between the poller and DB server was down but came back up.  The services on the poller we are "running" but looking at the polling engine page the last db sync/update was down for a long time and the poller was also down.  The network event was fixed but I had to restart all the services this morning to get it working again.

                                                                                                                                              So 2 things here.  Orion needs a better way to self heal in cases like this.  Maybe a heartbeat to the DB server.  If it's lost it keeps trying and once it's restablished to automatically stop and restart the services.  Second we need some alert mechanism that the poller services are down and/or the last db sync/update has been over X amount of time.  Both of these are critical and hopefully something you guys can get to us ASAP.

                                                                                                                                              Thanks.

                                                                                                                                              BB

                                                                                                                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                  chris.lapoint

                                                                                                                                                  o 2 things here.  Orion needs a better way to self heal in cases like this.  Maybe a heartbeat to the DB server.  If it's lost it keeps trying and once it's restablished to automatically stop and restart the services.

                                                                                                                                                  I'll look into this.  My understanding is we already do this today.  You may have hit the maximum retries.

                                                                                                                                                  .  Second we need some alert mechanism that the poller services are down and/or the last db sync/update has been over X amount of time.  Both of these are critical and hopefully something you guys can get to us ASAP.

                                                                                                                                                  Do you receive anything in the Orion NPM events resource for this issue?    If not, perhaps at the very least, we can provide something like that.   I'll look into this.

                                                                                                                                                    • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                      jedski

                                                                                                                                                      I have the same experience and as per the Solarwinds Tech Engr, the server is the problem and the load of the server is huge. I have the ticket number 136549 and 135528. We have 1 main orion and 2 pollers, only 1 poller crashes or stops. This occurs normally early in the morning.

                                                                                                                                                        • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                          warbird

                                                                                                                                                          Yup.  Our issues were due to too much load on one or both of the polling engines.  I was unable to determine if it was the primary polling engine or the secondary but as soon as we brought a 3rd poller online and I reduced the load on the existing pollers, the strange issues we were seeing went away.

                                                                                                                                                          The curious thing about these load issues is that, in our case, they were 'invisible' until the polling issues started.  Nothing I could monitor via perfmon showed me that the hardware itself was taxed beyond limits.  This leads me to believe that, at least in our circumstance, the load issue is software related and there currently is no metric to map it.  Or I was simply looking in the wrong place.

                                                                                                                                                          I will likely be bringing a 4th poller online soon.  My boss wants NTA running again and since that heavily taxes the primary poller, my plan is to move elements off of the primary and have it only doing NTA and the basic primary pollers job.  If strange issues appear again, I'll post about it here.

                                                                                                                                                      • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                        chris.lapoint

                                                                                                                                                        Just to clarify, there are 2 issues being tracked in this thread which are separate and not necessarily related:

                                                                                                                                                        1. Polling engine loses connectivity to the database and doesn't recover the connection until the service is restarted.  

                                                                                                                                                        • What to do?  We believe we have a fix for this and we're testing it out with a few customers.  Please contact support and we'll see if you're a candidate.

                                                                                                                                                        2. Polling engine stops because of excessive loading on the server.  This could be because you are polling more than the recommended number of elements per polling engine or you've ratcheted up the polling interval (which reduces max elements per poller)

                                                                                                                                                        • What to do?  As warbird noted, distributing the load across additional polling engines should help.   You should still open a support ticket so that we can verify that your issues aren't related to something else.
                                                                                                                                                    • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                      johan

                                                                                                                                                      Hi,

                                                                                                                                                       

                                                                                                                                                      We are having the same issue running npm 9.5.1 on Win 2003 VM (ESX 3.5/4). Polling engine keeps stopping for no apparent reason. DB server is also a VM on the same cluster. Running SQL 2005 SP3, logged a call - ref Case #155628 

                                                                                                                                                      CPU, memory and disk space is not the issue.

                                                                                                                                                        • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                          ecornwell

                                                                                                                                                          Hi Johan,

                                                                                                                                                          Please let us know what you find.  We have a very similiar envrionment and have the problem randomly as well.

                                                                                                                                                          Thanks,
                                                                                                                                                          Eric

                                                                                                                                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                              andressk

                                                                                                                                                              Same here.

                                                                                                                                                              We have NPM SLX 9.5.1 no additional pollers with a very similiar environment.

                                                                                                                                                              Our problem is that the polling engine stops for a few minutes and then starts again, that´s what we have seen on Monitor Polling Engine tool with update time with 10 seconds. The results are gaps, and alerts delayed out of time.

                                                                                                                                                              On a support ticket that we opened, support said that our environment is not the recommended, and maybe our DB server is overloaded because it´s on a VM.

                                                                                                                                                              I´ve made all recommendations to our servers team, but until they don´t give us a phisycal server to the DB (and the aprobal process is very extensive), we have to wait and working with the ones that we have on VM environment.

                                                                                                                                                              Now, the point is if we can do something else until we can get the phisycal server assigned?

                                                                                                                                                              Thanks in advance for any suggestion.

                                                                                                                                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                                  warbird


                                                                                                                                                                  Now, the point is if we can do something else until we can get the phisycal server assigned?

                                                                                                                                                                   



                                                                                                                                                                  Reduce your statistical polling interval.  If you are running NTA, reduce the number of flows pointed at your primary poller.

                                                                                                                                                                    • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                                      johan

                                                                                                                                                                      Hi Everyone,

                                                                                                                                                                      I just upgraded NPM to ver 10 RC2 and changed the VM environment to have the servers running on Vsphere (4). We also took away 2 of the processors on the front-end/polling engine. Disks are still at raid 5 on the DB (recommended that it is set to RAID 10 (1+0) and on a physical box, but we are unable to do so at the moment).

                                                                                                                                                                      Short of it all was that my baseline calculations ran through without the polling engine freezing once, disk read/write queue lengths stats were much better (did not flatline nearly as much as it did before). Stats also does not seem to have those funny gaps in them. Event logs also did not have nearly as much errors in it as it did before.

                                                                                                                                                                      I suggest you upgrade to ver 10 of NPM?

                                                                                                                                                                      • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                                        warbird


                                                                                                                                                                         



                                                                                                                                                                        Now, the point is if we can do something else until we can get the phisycal server assigned?

                                                                                                                                                                         



                                                                                                                                                                        Reduce your statistical polling interval.  If you are running NTA, reduce the number of flows pointed at your primary poller.

                                                                                                                                                                         



                                                                                                                                                                        I mistyped.  To clarify what I meant:  You can decrease load on the server by slowing down how often the statistical polling is done.  You do this by increasing the time between polls (increase the interval, not decrease as I mistyped).

                                                                                                                                      • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                        TimothyGaray

                                                                                                                                        I'll weigh in for our similar problems.

                                                                                                                                        Running 1 Orion NPM, APM, NTA and NCM server and 1 polling engine only.  Both are VMware virtual machines.  Both are Windows Server 2003 Standard x64.

                                                                                                                                        Same situation, polling stops.  Trying to restart service is futile.  Rebooting gets things running.

                                                                                                                                        Did anyone find how to monitor and thereby alert on the SNMP Outstanding value?

                                                                                                                                        Thanks!

                                                                                                                                        -Tim

                                                                                                                                          • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                            rmmagow

                                                                                                                                            Hi, I put this message in yesterday, forgot about this trail. Check your SQL server and Orion logs carefully:

                                                                                                                                            Most of you probably know this but just in case:

                                                                                                                                            Symptom, Poller seems to stop at a certian time each week/day etc. My root cause was the DB group cycles SQL on all servers every Sunday morning to clear various SQL memory issues etc. Not a re-boot, just a down and up on SQL. Anyway, every Monday I'd see red on the monitoring polling engine. I spoke with the SQL team and they do it every Sunday at 3:00 AM. I created two scheduled tasks on mu Orion poller to pause netperfmonservice at 2:55 AM Sunday and to continue at 03:15. Works nice and clean, poller is up all the time now.

                                                                                                                                            syntax at the command prompt on the server:

                                                                                                                                            at 02:55 /every:Su net pause netperfmonservice

                                                                                                                                            at 03:15 /every:Su net continue netperfmonservice

                                                                                                                                            Lots of different ways to accomplish this looking for SQL messages, trap functions etc. This was simple and works good enuf.

                                                                                                                                            Thanks

                                                                                                                                              • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                savell

                                                                                                                                                We use the following SQL statement as an ADO user experience monitor within ipMonitor.

                                                                                                                                                select ServerName, KeepAlive, GetDate(), DateDiff(second, KeepAlive, GetDate()) from [dbo].[Engines] where DateDiff(second, KeepAlive, GetDate()) > 60

                                                                                                                                                If the row count returned is > 0, then at least one of your pollers hasn't updated the database within 60 seconds (indicating an issue with the polling process).

                                                                                                                                                ipMonitor then sends out e-mails warning of the failure. The same SQL could be used in a number of different forms outside ipMonitor (i.e. scheduled SSIS job for instance).  

                                                                                                                                                Dave.

                                                                                                                                                Edit: I should note that we did this as a result if seeing this issue within our implementation. In our case it was the result of a switch failure (reboot) that occurred from time to time - which provided the network connection to a couple of pollers and the database server. Once the pollers got behind, they seemed to never recover without a restart. The switch was replaced and the problem has not re-occurred. A similar problem I imagine would occur with a database restart, firewall connection issues etc.

                                                                                                                                            • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                              mondtw

                                                                                                                                              We are also having the issue where the polling engine stops nightly and we need to restart NPM service to get it running again.  I've had a ticket open without resolution.  They have indicated that our poller is over-loaded but with no changes in elements, polling intervals, or polls per second a system that worked flawlessly under 9.5 now fails nightly under 9.5.1.

                                                                                                                                                • Re: NPM Polling stops.  No Rhyme or reason.
                                                                                                                                                  Lertsa

                                                                                                                                                  Hi all,

                                                                                                                                                  there's another thread of a problem that has similar symptoms than this one, not sure if everyone is experiencing them the same way. (Re: NetPerfMon Service Error: Exception in Service Timer Tick - 0)

                                                                                                                                                  Figured I should share my views and experiences of this problem on this thread too if it would help someone to figure out why this is happening. Bolded some points that might be of some use.

                                                                                                                                                  I've had this problem (Orion NPM stops polling without any visible reason) with versions 9.5.0 SP4 and 9.5.1. Got some help from the Solarwinds helpdesk and we checked the databases for mangled IP's, duplicates etc. Also disabled UAC and put Orion running on a local administrator account instead of the system account. I also renamed all my nodes so that no underscore was used (there was something about this in the manual somewhere...) After these changes the Orion stayed up&running for 2,5 weeks without problems until now. Earlier these changes the problem appeared usually within a week. NPM service restart fixes the problem.

                                                                                                                                                  Earlier, every time this happened the server event viewer application log was flooded with "NetPerfMon Service Error:  Exception in Service Timer Tick - 0-error". When this situation was on, I couldn't start the System Manager because of the compatibility error. The flooding starts with 5 seconds of 1850 entries and after the 5s, one entry every 5 seconds.

                                                                                                                                                  Now the same things happen but I've noticed two separate occasions where NPM stops polling but there's no flooding in the application log. Yesterday morning the polling had stopped at 7.15 but the application log flooding started at 9.51 and at the same second I logged on the server using domain admin account. There were no log entries in between.

                                                                                                                                                  If this is indeed a platform/version problem, ppl should compare their environments to see if there's something to consider.

                                                                                                                                                  Our environment:

                                                                                                                                                  Software:
                                                                                                                                                  Orion NPM SLX versions 9.5.0 SP4 and 9.5.1 tested
                                                                                                                                                  381 nodes 4800 interfaces

                                                                                                                                                  Operating system:
                                                                                                                                                  64-bit Windows 2008 server SP2

                                                                                                                                                  Database:
                                                                                                                                                  64-bit Microsoft SQL Server Express 2008
                                                                                                                                                  Located on the same server with the software

                                                                                                                                                  Hardware:
                                                                                                                                                  Intel Xeon E5540 @ 2,53GHz, 6,00GB RAM

                                                                                                                                                  Antivirus+firewall:
                                                                                                                                                  F-secure antivirus for servers 8 (upgraded now to 9)
                                                                                                                                                  Can't remember if the Windows firewall was on or off during the testing, F-sec installation turned it automatically on.

                                                                                                                                                  Others:
                                                                                                                                                  The server is connected to the domain, the Orion services are running on a local administrator account.
                                                                                                                                                  .NET 3.5 SP1 with update and hotfixes

                                                                                                                                                  Hope someone figures out a fix for this soon, I have been struggling with this for almost 6 months now (the server motherboard was defected and that took some weeks at first...). The old EE-server was shut down already so a working NPM is really needed to monitor the network.

                                                                                                                                                  - Lertsa

                                                                                                                                                  ps. I'm opening a new ticket when the problem appears again.