Dropping into an SSH session and running esxtop on an ESXi host can be a daunting task! With well over 300 metrics available, esxtop can throw numbers and percentages at sysadmins all day long – but without completely understanding them they will prove to be quite useless to troubleshooting issues. Below are a handful of metrics that I find useful when analyzing performance issues with esxtop.
Usage (%USED) - CPU is usually not the bottleneck when it comes to performance issues within VMware but it is is still a good idea to keep an eye on the average usage of both the host and the VMs that reside on it. High CPU usage levels on a VM may be an indicator of a requirement for more vCPU’s or an sign of something that has gone awry within the OS. Chronic high CPU usage on the host may indicate the need for more resources in terms of either additional cores or more ESXi hosts needed within the cluster.
Ready (%RDY) - CPU Ready (%RDY) is a very important metric that is brought up in nearly every single blog post dealing with VMware and performance. To be simply, CPU Ready measures the amount of time that the VM is ready to process on physical CPUs, but is waiting for the ESXi CPU scheduler to find the time to do so. Normally this is caused by other VMs competing for the same resources. VMs experiencing a high %RDY will definitely experience some performance implications and may indicate the need for more physical cores, or can sometimes be solved for removing un-needed vCPU’s from VMs that do not require more than one.
Co-Stop (%CSTP) - Similar to ready Co-Stop measures the amount of time the VM was incurring delay due to the ESXi CPU Scheduler – the difference being Co-Stop only applies to those VMs with multiple vCPU’s and %RDY can apply to VMs with a single vCPU. A high number of VMs with a high Co-Stop may indicate the need for more physical cores within your ESXi host, too high of a consolidation ration, or quite simply, too many multiple vCPU VMs.
Active (%ACTV) - Just as it’s a good idea to monitor the average CPU usage on both hosts and VMs it’s also the same for active memory. Although we cannot necessarily use this metric for right sizing due to the the way it is calculated it can be used to see which VMs are actively and aggressively touching memory pages.
Swapping (SWR/s,SWW/s,SWTGT,SWCUR) - Memory swapping is a very important metric to watch. Essentially if we see this metric anywhere above 0 it means that we are actively swapping out memory pages and processes to the swap file that is create upon VM power on. This means instead of paging memory to RAM, we are using much slower disk to do so. If we see swap occurring we may be in the market for more memory on our physical hosts, or looking to migrate certain VMs to other hosts with free physical RAM.
Balloon (MEMCTLGT) - Ballooning isn’t necessarily a bad metric for memory consumption but can definitely be used as an early warning symptom for swapping. When a value is reported for ballooning it basically states that the host cannot satisfy the VMs memory requirements, and is essentially reclaiming unused memory back from other virtual machines. Once we are through reclaiming memory from the balloon driver then swapping is the next logical step, which can be very very detrimental on performance.
Latency (DAVG, GAVG, KAVG, QAVG) - When it comes to monitoring disk i/o latency is king. Within a virtualized environment there are many different areas where latency may occur though, from leaving the VM, going through the VMkernel, HBA, and storage array. To help understand total latency we can look at the following metrics.
- KAVG – This the amount of time that the I/O spends within the VMkernel
- QAVG – This is the amount to time that the I/O spends in the HBA driver after leaving the VMkernel
- DAVG – This is the amount of time the I/O takes to leave the HBA, get to the storage array and return back.
- GAVG- We can think of GAVG (Guest Average) as the sum of all three metrics (KAVG, QAVG, DAVG) – essentially the total amount of latency as it pertains to the applications within the VM.
As you might be able to determine a high QAVG/KAVG can most certainly be a result of too small of a queue depth within your HBA – that or possibly your host is way too busy and VMs need to be migrated to others. A high DAVG (>20ms) normally indicates an issue with the actual storage array, either it is incorrectly configured and/or too busy to handle the load.
Dropped packets (DRPTX/DRPRX) - As far as network performance there are only a couple of metrics in which we can monitor from a host level. The DRPTX/RX monitor the packets which are dropped either on the transmit or receive end respectively. When we begin to see this metric go above 1 we may come to the conclusion that we have very high network utilization and may need to either increase our bandwidth out of the host, or possible somewhere along the path the packets are taking.
As I mentioned earlier there are over 300 metrics within esxtop – the above are simply the core ones I use when troubleshooting performance. Certainly having a third party monitoring solution can help you to baseline your environment and utilize these stats to more to your advantage by summarizing them in more visually appealing ways. For this week I’d love to hear about some of your real life situations - When was there a time where you noticed a metric was “out of whack” and what did you do to fix it? What are some of your favorite performance metrics that you watch and why? Do you use esxtop or do you have a favorite third-party solution you like to utilize?
Thanks for reading!