cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

My Top esxtop Metrics For Performance Monitoring

Level 11

Dropping into an SSH session and running esxtop on an ESXi host can be a daunting task!  With well over 300 metrics available, esxtop can throw numbers and percentages at sysadmins all day long – but without completely understanding them they will prove to be quite useless to troubleshooting issues.  Below are a handful of metrics that I find useful when analyzing performance issues with esxtop.

CPU

Usage (%USED) - CPU is usually not the bottleneck when it comes to performance issues within VMware but it is is still a good idea to keep an eye on the average usage of both the host and the VMs that reside on it.  High CPU usage levels on a VM may be an indicator of a requirement for more vCPU’s or an sign of something that has gone awry within the OS.  Chronic high CPU usage on the host may indicate the need for more resources in terms of either additional cores or more ESXi hosts needed within the cluster.

Ready (%RDY) - CPU Ready (%RDY) is a very important metric that is brought up in nearly every single blog post dealing with VMware and performance.  To be simply, CPU Ready measures the amount of time that the VM is ready to process on physical CPUs, but is waiting for the ESXi CPU scheduler to find the time to do so.  Normally this is caused by other VMs competing for the same resources.  VMs experiencing a high %RDY will definitely experience some performance implications and may indicate the need for more physical cores, or can sometimes be solved for removing un-needed vCPU’s from VMs that do not require more than one.

Co-Stop (%CSTP) - Similar to ready Co-Stop measures the amount of time the VM was incurring delay due to the ESXi CPU Scheduler – the difference being Co-Stop only applies to those VMs with multiple vCPU’s and %RDY can apply to VMs with a  single vCPU.  A high number of VMs with a high Co-Stop may indicate the need for more physical cores within your ESXi host, too high of a consolidation ration, or quite simply, too many multiple vCPU VMs.

Memory

Active (%ACTV) - Just as it’s a good idea to monitor the average CPU usage on both hosts and VMs it’s also the same for active memory.  Although we cannot necessarily use this metric for right sizing due to the the way it is calculated it can be used to see which VMs are actively and aggressively touching memory pages.

Swapping (SWR/s,SWW/s,SWTGT,SWCUR) - Memory swapping is a very important metric to watch.  Essentially if we see this metric anywhere above 0 it means that we are actively swapping out memory pages and processes to the swap file that is create upon VM power on.  This means instead of paging memory to RAM, we are using much slower disk to do so.  If we see swap occurring we may be in the market for more memory on our physical hosts, or looking to migrate certain VMs to other hosts with free physical RAM.

Balloon (MEMCTLGT) - Ballooning isn’t necessarily a bad metric for memory consumption but can definitely be used as an early warning symptom for swapping.  When a value is reported for ballooning it basically states that the host cannot satisfy the VMs memory requirements, and is essentially reclaiming unused memory back from other virtual machines.  Once we are through reclaiming memory from the balloon driver then swapping is the next logical step, which can be very very detrimental on performance.

Disk

Latency (DAVG, GAVG, KAVG, QAVG) - When it comes to monitoring disk i/o latency is king.  Within a virtualized environment there are many different areas where latency may occur though, from leaving the VM, going through the VMkernel, HBA, and storage array.  To help understand total latency we can look at the following metrics.

  • KAVG – This the amount of time that the I/O spends within the VMkernel
  • QAVG – This is the amount to time that the I/O spends in the HBA driver after leaving the VMkernel
  • DAVG – This is the amount of time the I/O takes to leave the HBA, get to the storage array and return back.
  • GAVG- We can think of GAVG (Guest Average) as the sum of all three metrics (KAVG, QAVG, DAVG) – essentially the total amount of latency as it pertains to the applications within the VM.

As you might be able to determine a high QAVG/KAVG can most certainly be a result of too small of a queue depth within your HBA – that or possibly your host is way too busy and VMs need to be migrated to others.  A high DAVG (>20ms) normally indicates an issue with the actual storage array, either it is incorrectly configured and/or too busy to handle the load.

Network

Dropped packets (DRPTX/DRPRX) - As far as network performance there are only a couple of metrics in which we can monitor from a host level.  The DRPTX/RX monitor the packets which are dropped either on the transmit or receive end respectively.  When we begin to see this metric go above 1 we may come to the conclusion that we have very high network utilization and may need to either increase our bandwidth out of the host, or possible somewhere along the path the packets are taking.

As I mentioned earlier there are over 300 metrics within esxtop – the above are simply the core ones I use when troubleshooting performance.  Certainly having a third party monitoring solution can help  you to baseline your environment and utilize these stats to more to your advantage by summarizing them in more visually appealing ways.  For this week I’d love to hear about some of your real life situations -   When was there a time where you noticed a metric was “out of whack” and what did you do to fix it?  What are some of your favorite performance metrics that you watch and why?   Do you use esxtop or do you have a favorite third-party solution you like to utilize?

Thanks for reading!

10 Comments
MVP
MVP

Thanks, gives me some stuff to refer to regarding VM performance.

Level 17

Really great insight, Thanks for the read. I'm trying to get our VM folks to let SW peer a little into their environment.

Level 12

Good information, thanks mwpreston​. cahunt, ​our VM group is also very protective of their environment, can only get basic monitoring with SW

Level 13

Charles

Virtualization Manager is perfect for insight into hypervisor perfromance. 

Level 13

Scott

Let me know if there are any questions that need answering to alleviate the concerns of the VM group.  Virtualization Manager uses the vSphere API and WMI for Hyper-V and needs basic read only access for general monitoring. Additonal permissions will be required if executing management actions from Virtualization Manager.

Great info. You're right, going over 300 metrics to decide which to monitor and measure is a tad bit overwhelming. These are very helpful. Thanks!

Great read! Thanks for the info SomeClown​!

I, like most, am not expert in "all the things", and I really appreciate it when people share their wisdom. It not only allows me to build my knowledge own base, but it can join the dots between things I do know, leading to 'UREKA' moments

Thanks again.

Level 21

And that is if you are only using VMWare; we use both VMWare and Hyper-V so having the right tools in place like VMAN is absolutely necessary to manage these types of enviornments.

MVP
MVP

keep it simple and scalable...

Level 9

Hi

​I was being asked about DAVG due to strange storage/esxi issues being encountered. Discovered your nice article above. Can I ask, did you manage to get these metrics out? Any chance you have any advice? SSH level, script used? Performance Counters to be checked/set-up?

Thank you