9 Replies Latest reply on Sep 3, 2014 11:17 AM by dgglynn

    NPM toplogy calculation problems

    jeffnorton

      I have seen various discussions about issue with the NPM topology implementation and problems with it.  These range from putting extra load on Cisco routers to extra load on polling servers.  So I thought I would discuss my recent issue.  Let me start by explaining our physical set up.  We have a main Orion poller and it is backed up to a secondary server using the Orion FOE product.  We then have 4 additional pollers as well as an additional Orion web server set up for https.  We are running NPM, NCM, NTA, VNQM, UDT, and SAM.  We are polling about 50,000 resources.  Recently we upgraded over a short time frame to NPM 10.7, NCM 7.3, NTA 4.0.2, and SAM 6.1.1.  The NCM upgrade started causing us some problems with when failing over from the Primary to the Secondary via FOE.  Lot's of alphabet soup here.  Basically when we failed over from the Primary to the Secondary the sync between the two would never complete.  It hung up on several files that were still in use on the Primary after the failover.  What we found was that the Orion topology calculator executable program never shut down.  Since it's not a service but a an executable, neither the FOE or Service Manager shut it down.  When I shut down the exe via task manager, the FOE sync process would complete.  Of course we referred this to support and the response was not really what I expected.  The explanation was that our network was so big that the topology calculations weren't completing.  I also was advised that Dev was aware of this and would possibly have something out by the next release.  I also go to looking at the amount of memory the topo calc executable was consuming an it was 1.2g which even surpassed the business layer process, so it\s huge.  I then looked at the amount of CPU process being used up on the SQL server to support this data and again it was huge.  I also looked across Thwack and the knowledgebase and saw a lot of queries and responses where people asked about shutting down the topology components.  There were a lot.  I tried one last tedious step in seeing if I could reduce the resources being used by topology.  I went in device by device and removed the topology resources from all the layer 2 switches we had in Orion.  There were 500.  Did I say tedious.  Since there is no way to do this in mass, this was done device by device.  It had very little affect on the amount of memory being used by the topocalc process.  So my last step was to follow the instructions in Knowledgebase article 3523 an disable topology on all pollers including the main.  That significantly reduced the memory utilization on all the pollers and it significantly reduced the process utilization on the SQL server.  Now having said all this the Knowledgebase article also says that the change is not permanent in that the next time the config wizard is run topology will default back to on.

       

      Ok, so all of this was and is very wordy, but it has given me pause to think that topology, while nice to have, is not worth the reduced server performance.  Also I tend to agree with many of the articles that I read that Solarwinds should give the NPM users a way, in the admin section, to permanently turn off this function.  Obviously, a more ideal solution would be to fix the entire topology calculation process.  I very much disagree with their assessment that my network is a large one.  Reading many of the Thwack post I see references to users with much larger networks.  This just tells me Solarwinds is setting the bar to low for what it considers a large network and thereby engineering their products at a low capacity level.  My observation is the product is licensed on a resource level but engineered on a device level.  It is entirely probable that a user network could have a low number of devices with high number of resources.

       

      So I am challenging the product managers to get involved here and raise the priority of fixing topology calculation and managing topology calculations.  Personally, I would rather see no more enhancements until Solarwinds fixes many of the performance bugs.

        • Re: NPM toplogy calculation problems
          jeffnorton

          I really like the silences I get from Solarwinds on this one, NOT.  All one has to do is type in topology on the Thwack search box and reams of discussions show up.  As of this morning we have turned off topology for the product and flushed all the database tables.  In addition to the discussion above, most of the layer 3 data that Orion was showing was inaccurate.  Since we're mostly a Cisco shop I have substituted a universal poller using CDP neighbor mibs (without keeping historical data).  I'm going to reiterate that Solarwinds needs to step back and fix the issues people are complaining about before adding any more features. 

            • Re: NPM toplogy calculation problems
              rob.hock

              Hi Jeff,

               

              Apologies for the difficulties sir. Is there a support case opened so we can reference it? In the interim, it sounds like you have been successful in removing topology calculation to reduce load (not that this is a solution, just a workaround.) Since you're on 10.7, there is also a method of removing topology polling in bulk through the "Manage Pollers" page:

               

              6-19-2014 10-13-09 AM.jpg

                • Re: NPM toplogy calculation problems
                  jeffnorton

                  Yes I found the bulk poller option to and used it.  And yes I do have a support case open.  It's sitting in dev with a response that the specific topo calc issue has been seen and will be addressed in the next release.  However, the multitude of issue with topology calculation, such as load on routers, load on pollers, and erroneous results are not being addressed or least not being stated that they are being looked at.  I for one would like to have some positive feed back from the NPM product manager that these complaints have been heard and being looked at.

                   

                  It seems that many things just sort of flop out there on Thwack with no positive feedback from the Solarwinds staff.  For instance reports that come out in the specific time zones, customer dev licenses, backing up binary config files in NCM, etc.

                   

                  Again fix what you have.  There's a ton of things commented on Thwack

                    • Re: NPM toplogy calculation problems
                      rob.hock

                      Jeff,

                       

                      Apologies if there is a communication disconnect sir, but we must certainly take customer feedback on thwack very seriously. In regards to the topology support case, I'm not sure we've exhausted all avenues of investigation, and will ask our dev team to look into it. We are indeed constantly improving our polling algorithms and calculation based on feedback and diagnostics through our support channel. Route polling in particular has seen significant efficiency gains as of late. If you happen to have any time next week, we'd love to setup a call to talk through outstanding issues and current pain points. Please feel free to email me (rob.hock{at}solarwinds.com) with availability and we'll set it up straight away.

                       

                      Regards,

                       

                      Rob

                  • Re: NPM toplogy calculation problems
                    RichardLetts

                    It's just as bad if you are a Juniper shop running LLDP (which does at least have an out of the box poller), but only because the topology calculator will be missing the necessary information to do any work...

                     

                    With LLDP Junipers are not very consistent in this respect -- on switches they advertise unit 0 on LLDP, and on routers they advertise the interface.

                     

                    L2 bridge/MAC addresses are tied to the unit on both platforms

                    Physical interfaces are the place that error counters accrue <-- this is what we monitor for that reason

                    You don't really want to monitor both the unit0 and the physical since that doubles the number of interfaces being monitored.

                     

                    L3 ARP tables are scoped by the VRF, so without VRF polling for the ARP data this is a nothing is returned (see IDEAs on VRF polling).

                     

                    This basically means the topology calculator in my environment lacks the necessary information to tie things together in any meaningful manner

                     

                    I feel that topology calculation might benefit from a more focused discussion by those people who have opinions and ideas on how this could work in a large environment with weird and wonderful equipment.

                  • Re: NPM toplogy calculation problems
                    alexfoster

                    I have just gone through the pain of this issue (4 days of diagnostics with support).  Our installation is only polling 3000 elements so 6% of your load - this has nothing to do with the number of resources being polled - this is a bug with the topology-poller, pure and simple.   Shortly after upgrading to 10.7 - we starting hitting disk space issues and these were attributed to huge crash files being created as a result of the *SolarWinds.JobEngineWorker.v2* application crashing (System.OutOfMemoryException errors).  When the topology poller (process is solarwinds.orion.core.collector) was running, memory was being consumed in the order of 1.4-1.5 GB and eventually the process crashed and then restarted.  A crash file equivalent in size to the amount of memory used at the time of the crash was created (so a 1.4GB file) - multiply that by a crash occurring every hour and our disk space became very quickly consumed.  One symptom of this issue were that some charts were showing incomplete data (due to low disk space) - once the disk space was freed up this resolved the data issue (true cause was unknown at this point).  Support have changed the frequency that the topology poller runs at - every 6 hours - but the application is still crashing when it runs, creating a 1.4GB crash file every time.   The only solution I feel, is to disable the poller.  For those suffering the same issue, the crash files are located in C:\ProgramData\Microsoft\Windows\WER\ReportQueue, each crash creates its own folder and the crash file has a .hdmp extension.

                      • Re: NPM toplogy calculation problems
                        sja

                        same her on NPM 11.1 1.3  GB crash file in the same location.

                        The Topology is on ONLY  8 core Juniper routers " so it's scale quite good?"

                         

                        /SJA

                        • Re: NPM toplogy calculation problems
                          dgglynn

                          Thanks for the hint to look in the C:\ProgramData\Microsoft\Windows\WER\ReportQueue directory for crash dumps.

                           

                          We appear to have ten or so ~700 MB hdmp dump files in our system.

                           

                          We have also been having issues where what were previously successfully polled switches for topo are now producing results that are discordant.

                           

                          We are seeing tables that show a pulled remote port ID, but claim the local port is unknown, and another part of that same table will show the same local port ID's correctly, and claim the same remote port is unknown.

                           

                          Oftentimes these half correct, but half blind results are also paired with a table row that includes both local and remote ports successfully ID'd.

                           

                          So I will have three entries for each port, local and remote, with only one pair complete and correct, and two entries that show only half correct info, with all possible option of unknown (remote and local).

                           

                          Of course, since it was working, I am once again filled with dismay that something our team depends on is going from pedantic but workable, with cajoling and discipline, to fricking broken, rendering a working element to useless, and requiring us to build work arounds to extract semi-useful information for support personnel. Very Frustrating.