I have seen various discussions about issue with the NPM topology implementation and problems with it. These range from putting extra load on Cisco routers to extra load on polling servers. So I thought I would discuss my recent issue. Let me start by explaining our physical set up. We have a main Orion poller and it is backed up to a secondary server using the Orion FOE product. We then have 4 additional pollers as well as an additional Orion web server set up for https. We are running NPM, NCM, NTA, VNQM, UDT, and SAM. We are polling about 50,000 resources. Recently we upgraded over a short time frame to NPM 10.7, NCM 7.3, NTA 4.0.2, and SAM 6.1.1. The NCM upgrade started causing us some problems with when failing over from the Primary to the Secondary via FOE. Lot's of alphabet soup here. Basically when we failed over from the Primary to the Secondary the sync between the two would never complete. It hung up on several files that were still in use on the Primary after the failover. What we found was that the Orion topology calculator executable program never shut down. Since it's not a service but a an executable, neither the FOE or Service Manager shut it down. When I shut down the exe via task manager, the FOE sync process would complete. Of course we referred this to support and the response was not really what I expected. The explanation was that our network was so big that the topology calculations weren't completing. I also was advised that Dev was aware of this and would possibly have something out by the next release. I also go to looking at the amount of memory the topo calc executable was consuming an it was 1.2g which even surpassed the business layer process, so it\s huge. I then looked at the amount of CPU process being used up on the SQL server to support this data and again it was huge. I also looked across Thwack and the knowledgebase and saw a lot of queries and responses where people asked about shutting down the topology components. There were a lot. I tried one last tedious step in seeing if I could reduce the resources being used by topology. I went in device by device and removed the topology resources from all the layer 2 switches we had in Orion. There were 500. Did I say tedious. Since there is no way to do this in mass, this was done device by device. It had very little affect on the amount of memory being used by the topocalc process. So my last step was to follow the instructions in Knowledgebase article 3523 an disable topology on all pollers including the main. That significantly reduced the memory utilization on all the pollers and it significantly reduced the process utilization on the SQL server. Now having said all this the Knowledgebase article also says that the change is not permanent in that the next time the config wizard is run topology will default back to on.
Ok, so all of this was and is very wordy, but it has given me pause to think that topology, while nice to have, is not worth the reduced server performance. Also I tend to agree with many of the articles that I read that Solarwinds should give the NPM users a way, in the admin section, to permanently turn off this function. Obviously, a more ideal solution would be to fix the entire topology calculation process. I very much disagree with their assessment that my network is a large one. Reading many of the Thwack post I see references to users with much larger networks. This just tells me Solarwinds is setting the bar to low for what it considers a large network and thereby engineering their products at a low capacity level. My observation is the product is licensed on a resource level but engineered on a device level. It is entirely probable that a user network could have a low number of devices with high number of resources.
So I am challenging the product managers to get involved here and raise the priority of fixing topology calculation and managing topology calculations. Personally, I would rather see no more enhancements until Solarwinds fixes many of the performance bugs.