1 of 1 people found this helpful
Are you seeing the same or similar thing as in this thread? SolarWinds.InformationService.ServiceV3.exe - Railing a single processor?
No, My issue is different. In my environment what happens is that Information Service v3 will slowly creep up on CPU and memory usage. Then just stay at a high usage and never release. I don't see any logs about the service restarting or acting in any weird way. It stays on. It functions but it just kills my resources.
We didn't see this with 11.5.3. This was post npm 12 upgrade. Initially we had 2008 R2 SP1 running on 8 vCPU's. After the upgrade our CPU's would peg out to the point we couldn't even open the VM. So we upgrade the vm. Gave it dedicated storage on RAID 10 to remove possible hard drive bottlenecks and increased vCPU count to 12. No luck. We went from pegging 8 cores to staying at 90% on 12 cores.
We figured it might of been an issue on the host being that we were on an older saturated host. So we built a brand new vm on newer less used host. And we went crazy on spec's 24 vCPU , 2012 R2 for the OS, and 48 gigs of ram were as before we had 12.
During investigations we found we did have some broken views, among other issues that contributed to the issue and those have since been fixed. But now even with no errors on the logs and a very powerful VM. Information Service v3 still takes 50% cpu steady and about a gig of ram. And this is 50% across all 24 cores due to better threading and handling of windows 2012 R2 datacenter edition.
This is why I'm thinking it could be a potential software resource leak somewhere. But I wanted to open this up to see if this is only in my environment or are other people seeing in their environments too.
Your issue sounds like possible VM overprovisioning. How many cores do the actual processors hosting your VM have? If you have either the same number or more vcpu's provisioned than the physical host you're going to have IO issues that can end up spiking CPU use substantially.
Initially we had 2 sockets and 8 cores. So 2 cpus by 4 cores each to have a total of 8 cores. Performance in this configuration was unusable. And CPU was constantly pegged at 100% nearly all the time. We spent most of the time rebooting the VM more than anything else just to gain console access.
Then we switched to 4 sockets by 3 cores. For a total of 12 cores. With this setup we where at 90% CPU and still did not offer any stability.
And now we switch to 6 sockets by 4 cores for a total of 24 cores. And now we have some availability to be able to run the environment without reboots.
I'm not convinced that it's a VM setup issue as I've been through changes time and again and we've tried many different setups with all giving the same exact result regardless of VM setup.
And in our current system if I stop information service v3. I have a max of 3% cpu. I turn the service on and I go to 60% use. It's a service problem and not a hardware problem.
Again, I was just trying to understand if this was seen by any other environment out there or if it's just specific to mine. It'll help greatly guide me towards a resolution.
I've been researching these issues for over an year now. And it's been a difficult ride. I'm hoping I can get some help.
I ended up calling Solarwinds on this. They have taken a look and agreed that this isn't right. They have gathered logs and are investigating. They are considering bugging this issue internally and sending it to the DEV team. I'll update when I have an update from them.
Issue turned out to be local. An IPAM view long ago had been set to look up UDT information when the previous administrator was testing UDT. UDT was never licensed but the view was never modified. With SolarWinds we found the view and removed the option. This fixed the view and cleared a majority of errors in information service v3. Other issues where also local and had been fixed. Basically all errors where removed and fixed. Logs are practically error free now.
A few warnings here and there and everything seemed fine but CPU and ram use by the service is still awfully high for a strong system we have deployed. Solarwinds has taken the logs and a few proc dumps. And has sent this to their DEV team.
Because stability was restored the site has not gone down again and performance improved. We're now just waiting on SolarWinds to figure out why there is a high resource use with the service.
Any update on this, I've only had the issue since upgrading to NPM12 but it is regularly bringing the whole monitoring solution down completely, Information Service v3 doesn't appear to have any unusual errors
In my case it turned out to be a corrupted ipam view. I had bad data in my database as well. So during a change control I was able to do some maintenance. I called SolarWinds and they helped me identify the bad view. Once all this was removed I ran a reindex of my database and my problem was fixed.
I still do have a bit of useage on information service v3 but turns out the use is legit and not caused bad corruption or anything like that. I would say give solarwinds a call and let them know you were having this issue. They could help you pin point the problem.
Thanks for your response, I’ll raise with Solarwinds
2 of 2 people found this helpful
Wanted to provide a general answer.
After hearing back from DEV, Here is the response I got from solarwinds:
I have an update from our development team. What they've found is that the memory dump shows a lot of threads stuck waiting for GC. What we want to try is the following:
Add this key to SolarWinds.InformationService.ServiceV3.exe.config inside <runtime> section:
<gcServer enabled="true" />
C:\Program Files (x86)\Common Files\SolarWinds\InformationService\V3 is where you will find the config file.
This alone provided a huge boost in swis v3 performance. However, I was still seeing high memory and cpu. So I investigated further and found that I had two broken IPAM views. (Errors appeared on the logs). Took me time to pin point the view, once found the issue was fixed by unchecking reference to UDT since we do not have UDT installed in our environment. This cleared memory issues but CPU still remained awfully high for our environment. After some time of pouring through logs and reading different thwack articles, I was able to see that I had a high amount of corrupted data in my database that accumulated over time. This was a time consuming process but I cleared a good amount of bad data and fixed alot of other things. This pretty much cleared several logs of errors. Found that McAfee was causing problems, Spoke with security team and was engaged in a long call with McAfee support who was able provide some exceptions in the system that worked and fixed the problem. Found bad credentials, etc.
Point here is that the issue wasn't one thing causing problems. It was a combination of several problems that caused information service v3 to become highly unstable.
Today the process uses about 60% cpu and normal memory usage. But monitoring it this is from normal operations for our environment. I still have way to many problems and decided at this point a redeployment will best and will be setting a new deployment soon for our environment.
I hope this information helps anyone out there.