Solawinds VMAN Configuration Polling

We have an issue I'm trying to solve. Every 12 hours our ESX hosts spike in CPU for about 15 minutes (at Midnight and Noon). We've figured out that it's VMWare VMAN Configuration polling that's causing it. We determined this by shutting that off and it not reoccurring. As well as looking at the database job the setup makes sense in that it runs every 12 hours, etc.

My question is, why does this happen and how do we prevent it from affecting the environment like this because as you can imagine this causes issues and confusion in the environment when we see a spike like that for all our VM hosts. It's simply not healthy.

So, my questions are this - Is there anything we can do to prevent the CPU spikes such as:
1. Can this be scheduled to run every hour instead which maybe will result in it pulling a smaller amount of data or will it do the same large hit against the ESX hosts just more often?
2. Is there a way to schedule this to run at like 7am and 7pm instead of midnight and Noon?
3. What is this doing that causes such a large spike and can SolarWinds change the app to be less of a hit to the system?
4. Any other thoughts?

Thanks in advance.

Parents
  • Seen this many times. 

    Is there a reason that you are querying the esxi hosts directly instead of polling your vcenter?  It doesn't solve the problem exactly, mostly it moves the issue but there is a lot less risk of having a problem if the vcenter CPU consumption is high compared to having it happen on each esxi host.

    1 Running more often does not reduce the impact of the requests, just makes it so you hit the API more often and so you have the CPU getting taxed more often.

    2 No there is no built in mechanism to control the scheduling directly.  The fact that yours happens as midnight and noon is probably just random coincidence because i've seen it happen at all kinds of hours but you don't really get to pick.  Maybe it starts counting after a server reboot or something similar but it would be very clunky to try and actively manage that.

    3 The simple explanation about why it happens is that the config poll asks a whole series of requests from the vcenter/esxi about every current existing object and their relationships, so depending on how many vm's and datastores and disks and snapshots and everything else it can be a bit of a burden to respond to all the queries.

    4 I've run VMAN in environments with tens of thousands of VM's where we wrestled this but in the end we just had to provide a lot of resources at the Vcenter level to be able to respond to the queries in a timely way, and our Orion pollers had a lot of CPU and RAM as well.  It's definitely not fun to leave the vcenter maxed out for long periods of time but just throwing more CPU at it allowed it to work through everything faster so instead of having vcenter pegged for an hour it was still pegged, but just for a few minutes at a time.

  • We are actually polling the vcenter...not the ESX hosts, but during the configuration polling of the vcenter, it spikes the ESX hosts as well, i'm guessing, it's gather stats, etc.   Thanks a ton for the information and feedback!

Reply Children
No Data