11 Replies Latest reply on Jan 3, 2017 8:59 AM by lcsw2013

    Possible bug?

    lcsw2013

      Hi folks,

       

      I've been having just the worst time ever trying to resolve my information service v3 issues. First, a .NET framework issue was identified and resolved. Second a broken IPAM view was fixed. Third an issue with netflow was resolved. The service has been uninstalled and reinstalled more than 20 times already between my own troubleshooting and troubleshooting from support. One by one the error messages started to clear from the Swis v3 log. But one issue remained. I noticed that if you give it a week slowly the cpu use and memory use creeps up but is never released. the log isn't huge like it once was. Went from having several created within a few hours to just one log created over a span of a week. And the size isn't really big at all.

       

      No errors being observed any longer. But there are a few warnings that generate. Does not seem to be anything affecting performance. But does appear to have a resource leak where it takes up all the resources of my server.

       

      Initially we had a build spec'd out at 8 cores at 2.4ghrz. Now we have 24 cores at 2.4ghrz. And all cores are at 50% use. With a total of 50% use of the entire CPU. When I check processes only process taking any CPU is Information Service V3 at between 40 and 60 percent. And 1 to 2 gigs of memory use.

       

      Could this be a possible resource leak in the code that SolarWinds has not caught? Has anyone noticed information Service V3 in your environment using this amount of resources or more?

       

      Thanks!

        • Re: Possible bug?
          John Handberg

          Are you seeing the same or similar thing as in this thread?  SolarWinds.InformationService.ServiceV3.exe - Railing a single processor?

          1 of 1 people found this helpful
            • Re: Possible bug?
              lcsw2013

              No, My issue is different. In my environment what happens is that Information Service v3 will slowly creep up on CPU and memory usage. Then just stay at a high usage and never release. I don't see any logs about the service restarting or acting in any weird way. It stays on. It functions but it just kills my resources.

               

              We didn't see this with 11.5.3. This was post npm 12 upgrade. Initially we had 2008 R2 SP1 running on 8 vCPU's. After the upgrade our CPU's would peg out to the point we couldn't even open the VM. So we upgrade the vm. Gave it dedicated storage on RAID 10 to remove possible hard drive bottlenecks and increased vCPU count to 12. No luck. We went from pegging 8 cores to staying at 90% on 12 cores.

               

              We figured it might of been an issue on the host being that we were on an older saturated host. So we built a brand new vm on newer less used host. And we went crazy on spec's 24 vCPU , 2012 R2 for the OS, and 48 gigs of ram were as before we had 12.

               

              During investigations we found we did have some broken views, among other issues that contributed to the issue and those have since been fixed. But now even with no errors on the logs and a very powerful VM. Information Service v3 still takes 50% cpu steady and about a gig of ram. And this is 50% across all 24 cores due to better threading and handling of windows 2012 R2 datacenter edition.

               

              This is why I'm thinking it could be a potential software resource leak somewhere. But I wanted to open this up to see if this is only in my environment or are other people seeing in their environments too.

            • Re: Possible bug?
              lcsw2013

              screenshot.JPG

               

              The above is only the Information service v3. with around 3 to 4 percent being everything else. But as you can see this isn't supposed to be the case on such a powerful server VM. One Service killing a machine like this just doesn't appear to sound right to me.

                • Re: Possible bug?
                  designerfx

                  Your issue sounds like possible VM overprovisioning. How many cores do the actual processors hosting your VM have? If you have either the same number or more vcpu's provisioned than the physical host you're going to have IO issues that can end up spiking CPU use substantially.

                    • Re: Possible bug?
                      lcsw2013

                      Initially we had 2 sockets and 8 cores. So 2 cpus by 4 cores each to have a total of 8 cores. Performance in this configuration was unusable. And CPU was constantly pegged at 100% nearly all the time. We spent most of the time rebooting the VM more than anything else just to gain console access.

                       

                      Then we switched to 4 sockets by 3 cores. For a total of 12 cores. With this setup we where at 90% CPU and still did not offer any stability.

                       

                      And now we switch to 6 sockets by 4 cores for a total of 24 cores. And now we have some availability to be able to run the environment without reboots.

                       

                      I'm not convinced that it's a VM setup issue as I've been through changes time and again and we've tried many different setups with all giving the same exact result regardless of VM setup.

                       

                      And in our current system if I stop information service v3. I have a max of 3% cpu. I turn the service on and I go to 60% use. It's a service problem and not a hardware problem. 

                       

                      Again, I was just trying to understand if this was seen by any other environment out there or if it's just specific to mine. It'll help greatly guide me towards a resolution.

                       

                      I've been researching these issues for over an year now. And it's been a difficult ride.  I'm hoping I can get some help.

                  • Re: Possible bug?
                    lcsw2013

                    I ended up calling Solarwinds on this. They have taken a look and agreed that this isn't right. They have gathered logs and are investigating. They are considering bugging this issue internally and sending it to the DEV team. I'll update when I have an update from them.

                    • Re: Possible bug?
                      lcsw2013

                      Issue turned out to be local. An IPAM view long ago had been set to look up UDT information when the previous administrator was testing UDT. UDT was never licensed but the view was never modified. With SolarWinds we found the view and removed the option. This fixed the view and cleared a majority of errors in information service v3. Other issues where also local and had been fixed. Basically all errors where removed and fixed. Logs are practically error free now.

                       

                      A few warnings here and there and everything seemed fine but CPU and ram use by the service is still awfully high for a strong system we have deployed. Solarwinds has taken the logs and a few proc dumps. And has sent this to their DEV team.

                       

                      Because stability was restored the site has not gone down again and performance improved. We're now just waiting on SolarWinds to figure out why there is a high resource use with the service.

                      • Re: Possible bug?
                        central.services@emishealth.com

                        Any update on this, I've only had the issue since upgrading to NPM12 but it is regularly bringing the whole monitoring solution down completely, Information Service v3 doesn't appear to have any unusual errors

                          • Re: Possible bug?
                            lcsw2013

                            In my case it turned out to be a corrupted ipam view. I had bad data in my database as well. So during a change control I was able to do some maintenance. I called SolarWinds and they helped me identify the bad view. Once all this was removed I ran a reindex of my database and my problem was fixed.

                             

                            I still do have a bit of useage on information service v3 but turns out the use is legit and not caused bad corruption or anything like that. I would say give solarwinds a call and let them know you were having this issue. They could help you pin point the problem. 

                          • Re: Possible bug?
                            lcsw2013

                            Wanted to provide a general answer.

                             

                            After hearing back from DEV, Here is the response I got from solarwinds:

                             

                            ****

                            I have an update from our development team. What they've found is that the memory dump shows a lot of threads stuck waiting for GC. What we want to try is the following:

                             

                             

                            Add this key to SolarWinds.InformationService.ServiceV3.exe.config inside <runtime> section:

                            <gcServer enabled="true" />

                             

                             

                            C:\Program Files (x86)\Common Files\SolarWinds\InformationService\V3 is where you will find the config file.

                            ****

                             

                            This alone provided a huge boost in swis v3 performance. However, I was still seeing high memory and cpu. So I investigated further and found that I had two broken IPAM views. (Errors appeared on the logs). Took me time to pin point the view, once found the issue was fixed by unchecking reference to UDT since we do not have UDT installed in our environment. This cleared memory issues but CPU still remained awfully high for our environment. After some time of pouring through logs and reading different thwack articles, I was able to see that I had a high amount of corrupted data in my database that accumulated over time. This was a time consuming process but I cleared a good amount of bad data and fixed alot of other things. This pretty much cleared several logs of errors. Found that McAfee was causing problems, Spoke with security team and was engaged in a long call with McAfee support who was able provide some exceptions in the system that worked and fixed the problem. Found bad credentials, etc.

                             

                            Point here is that the issue wasn't one thing causing problems. It was a combination of several problems that caused information service v3 to become highly unstable.

                             

                            Today the process uses about 60% cpu and normal memory usage. But monitoring it this is from normal operations for our environment. I still have way to many problems and decided at this point a redeployment will best and will be setting a new deployment soon for our environment.

                             

                            I hope this information helps anyone out there.

                            2 of 2 people found this helpful