This discussion has been locked. The information referenced herein may be inaccurate due to age, software updates, or external references.
You can no longer post new replies to this discussion. If you have a similar question you can start a new discussion in this forum.

Possible bug?

Hi folks,

I've been having just the worst time ever trying to resolve my information service v3 issues. First, a .NET framework issue was identified and resolved. Second a broken IPAM view was fixed. Third an issue with netflow was resolved. The service has been uninstalled and reinstalled more than 20 times already between my own troubleshooting and troubleshooting from support. One by one the error messages started to clear from the Swis v3 log. But one issue remained. I noticed that if you give it a week slowly the cpu use and memory use creeps up but is never released. the log isn't huge like it once was. Went from having several created within a few hours to just one log created over a span of a week. And the size isn't really big at all.

No errors being observed any longer. But there are a few warnings that generate. Does not seem to be anything affecting performance. But does appear to have a resource leak where it takes up all the resources of my server.

Initially we had a build spec'd out at 8 cores at 2.4ghrz. Now we have 24 cores at 2.4ghrz. And all cores are at 50% use. With a total of 50% use of the entire CPU. When I check processes only process taking any CPU is Information Service V3 at between 40 and 60 percent. And 1 to 2 gigs of memory use.

Could this be a possible resource leak in the code that SolarWinds has not caught? Has anyone noticed information Service V3 in your environment using this amount of resources or more?

Thanks!

  • No, My issue is different. In my environment what happens is that Information Service v3 will slowly creep up on CPU and memory usage. Then just stay at a high usage and never release. I don't see any logs about the service restarting or acting in any weird way. It stays on. It functions but it just kills my resources.

    We didn't see this with 11.5.3. This was post npm 12 upgrade. Initially we had 2008 R2 SP1 running on 8 vCPU's. After the upgrade our CPU's would peg out to the point we couldn't even open the VM. So we upgrade the vm. Gave it dedicated storage on RAID 10 to remove possible hard drive bottlenecks and increased vCPU count to 12. No luck. We went from pegging 8 cores to staying at 90% on 12 cores.

    We figured it might of been an issue on the host being that we were on an older saturated host. So we built a brand new vm on newer less used host. And we went crazy on spec's 24 vCPU , 2012 R2 for the OS, and 48 gigs of ram were as before we had 12.

    During investigations we found we did have some broken views, among other issues that contributed to the issue and those have since been fixed. But now even with no errors on the logs and a very powerful VM. Information Service v3 still takes 50% cpu steady and about a gig of ram. And this is 50% across all 24 cores due to better threading and handling of windows 2012 R2 datacenter edition.

    This is why I'm thinking it could be a potential software resource leak somewhere. But I wanted to open this up to see if this is only in my environment or are other people seeing in their environments too.

  • screenshot.JPG

    The above is only the Information service v3. with around 3 to 4 percent being everything else. But as you can see this isn't supposed to be the case on such a powerful server VM. One Service killing a machine like this just doesn't appear to sound right to me.

  • Your issue sounds like possible VM overprovisioning. How many cores do the actual processors hosting your VM have? If you have either the same number or more vcpu's provisioned than the physical host you're going to have IO issues that can end up spiking CPU use substantially.

  • Initially we had 2 sockets and 8 cores. So 2 cpus by 4 cores each to have a total of 8 cores. Performance in this configuration was unusable. And CPU was constantly pegged at 100% nearly all the time. We spent most of the time rebooting the VM more than anything else just to gain console access.

    Then we switched to 4 sockets by 3 cores. For a total of 12 cores. With this setup we where at 90% CPU and still did not offer any stability.

    And now we switch to 6 sockets by 4 cores for a total of 24 cores. And now we have some availability to be able to run the environment without reboots.

    I'm not convinced that it's a VM setup issue as I've been through changes time and again and we've tried many different setups with all giving the same exact result regardless of VM setup.

    And in our current system if I stop information service v3. I have a max of 3% cpu. I turn the service on and I go to 60% use. It's a service problem and not a hardware problem. 

    Again, I was just trying to understand if this was seen by any other environment out there or if it's just specific to mine. It'll help greatly guide me towards a resolution.

    I've been researching these issues for over an year now. And it's been a difficult ride.  I'm hoping I can get some help.

  • I ended up calling Solarwinds on this. They have taken a look and agreed that this isn't right. They have gathered logs and are investigating. They are considering bugging this issue internally and sending it to the DEV team. I'll update when I have an update from them.

  • Issue turned out to be local. An IPAM view long ago had been set to look up UDT information when the previous administrator was testing UDT. UDT was never licensed but the view was never modified. With SolarWinds we found the view and removed the option. This fixed the view and cleared a majority of errors in information service v3. Other issues where also local and had been fixed. Basically all errors where removed and fixed. Logs are practically error free now.

    A few warnings here and there and everything seemed fine but CPU and ram use by the service is still awfully high for a strong system we have deployed. Solarwinds has taken the logs and a few proc dumps. And has sent this to their DEV team.

    Because stability was restored the site has not gone down again and performance improved. We're now just waiting on SolarWinds to figure out why there is a high resource use with the service.

  • Any update on this, I've only had the issue since upgrading to NPM12 but it is regularly bringing the whole monitoring solution down completely, Information Service v3 doesn't appear to have any unusual errors

  • In my case it turned out to be a corrupted ipam view. I had bad data in my database as well. So during a change control I was able to do some maintenance. I called SolarWinds and they helped me identify the bad view. Once all this was removed I ran a reindex of my database and my problem was fixed.

    I still do have a bit of useage on information service v3 but turns out the use is legit and not caused bad corruption or anything like that. I would say give solarwinds a call and let them know you were having this issue. They could help you pin point the problem. 

  • Thanks for your response, I’ll raise with Solarwinds

    Regards

    Julian