Starting/Stopping NPM starting to cause serious problems.

Question

Hi,

We have SLX system with around 3500 nodes totalling around 10000 elements. The slow performance related to starting and stopping of the NPM service has been mentioned by others on this forum previously, but as our system has been extended the issue has grown in to a serious problem. The situation was culminated some weeks ago when I attempted to upgrade NPM 9.0 to NPM 9.1 and SP2 and to install a second poller.

Stopping the service: Well, it just won't stop. Not by stopping it manually from Services, nor will it obey installation programs. If you don't kill the process, the service will hang in the "stopping" state until the installation bombs. And it's got worse. At least initially the service seemed to have been configured to restart after stopping. I configured "Take No Action" the service recovery options. Since I seemed to have recurring problems with this issue, I'm thinking that the setup programs or the configuration wizard might be setting the service to restart. Am I right? "Start setup which tries to stop the service" --> "service won't stop" --> "I kill the service process" --> "Windows restarts the service" --> "setup continues but won't be able to replace the service .exe" --> "tough luck". I know that this issue should have been addressed with the latest versions, but I didn't notice it. It's possible that I didn't wait long enough, but based on experience I do know to wait for quite a long time. Experiences, others?

Starting the service:  No greased lightning? Sluggish, maybe? I timed it, 12 minutes! The startup of the service seems to be fairly linearly dependent on the amount of elements. Now that I have the elements divided to the two pollers, the startup times are around half. But we are still talking about 5+ minutes.

These problems have now escalated from mere nuisances to problems with tangible consequences. Last time I attempted to make four installations at once that (theoretically should have) required starting and stopping the service four times (NPM 9.1, SP2, SP2 to new poller, moving of nodes to new poller). I had a very generous service window (2 hours), but the frantic service killings and long waits for service startups really took its toll. For some reason the SP2 installation actually failed as the device tree remained totally empty when opening the System Manager. So I had to reinstall NPM 9.1, which did repair the problem. However by this time I had used so much time that I didn't attempt to reinstall the SP2 anymore. I have been installating and upgrading every version and service pack since summer 2006 and this was the first time I actually quit the task before completion. Another problem is that moving nodes from one poller to another requires stopping the services at least according to the warning showed by the application. Does this really mean starting and stopping of the NPM service, or have I misunderstood something? Some nodes were not reachable from the new poller because of insufficient firewall rules, so I tossed the nodes back to the original poller. The firewall rules were fixed, but I didn't really want to move the affected nodes to the new poller anymore. It would have taken around 10 minutes again and I had already caused quite a piece of flat line into the graphs with the initial installations. I'll move them later in conjuction with some other upgrade, configuration or maintenance. Until that time, about 20 nodes are located "on the wrong poller considering the quidelines as to which device groups are on which poller". Here we are talking about such a small amount of nodes that I could manually delete them and add them to the new poller, but then I would lose the gathered data.

The platform we are running is quite good, dual processor Opteron with 4GB memory and Windows 2003 x64 Server SP2. By reading the forum, it seems that other people are also having similar problems with the NPM service, and it doesn't seem to be the only performance related complaint either. It isn't reassuring when a complaint about 100% CPU load in conjuction with NCM and RTCD is answered with a suggestion of not using the feature until a future version is released. Simple inventories crippled my server CPU-wise also when evaluating NCM 5.0. That was just a VMware node so I thought I might have been given just some minute resources. I asked the server guy about the resource configurations and he realized there was no hard limit on the resources. "You sucked the juices of the whole box! You're capped now!".

I have no idea how complex operations NPM does during startup, so I do not know how big of a task it is to speed it up. But Solarwinds asks now and then what features people would like in future versions and my vote would go to performance optimization. I'm impressed by the speed that you guys add visible features to the applications, but at this point I would really appreciate some under-the-hood work also. In the end I'd like to throw a couple of workaround ideas in the air that should be fairly easy to implement from a programming point of view...

- Make it an option that the setup program won't start or stop the services. This way I can manually kill 'em all (did Metallica use one of the early versions?) and then start them after all installations, configuration wizard and potential node reallocations have been done.

- Add a "reload the devices from database" to System Manager, so that nodes can be reallocated between pollers without touching the services.

Best regards,

Marko

neilmborilla · Answer

Somebody needs to get this resolved.

rdoda · Answer

One thing I do notice in evenlogs is error like:

EventType clr20r3, P1 solarwinds.businesslayerhost.exe, P2 9.1.0.328, P3 48e402ab, P4 system.data, P5 2.0.0.0, P6 471ebf27, P7 23cb, P8 295, P9 system.data.sqlclient.sql, P10 NIL.

rdoda · Answer

We have 64GB of RAM, 8 pollers with about 2500 elements each, still have the same issue.

Also I have seen that once it loses connection with DB or is not able to re-connect, it doesn't establish the connection again till a restart. Has anyone else noticed that? We are running 9.1 SP3.