Fire-fighting mode for DBAs can be stressful when they have co-workers and managers breathing down their necks due to application slow-downs and/or outages. Logic says something changed, but what? In a worst-case scenario, the database instance itself looks fine, nothing changed within the database and the SQL being executed was running fine before. Of course, the SysAdmin says nothing is wrong with the physical server or storage which makes it even more questionable. Hmm, could you be running in a virtual machine (VM)? Is your VM resource starved and competing with other VMs?
According to Gartner’s Market Guide for Server Virtualization, “Hypervisor-based server virtualization is now mature, with 80% to 90% of server workloads running in a virtual machine (VM) for most midsize to large enterprises.” Additionally, anecdotal evidence states 70% of all databases are virtualized. In fact, here at SolarWinds, 50% of our database instances run in a VM. For all the benefits of virtualization like cost savings and ease of migrating workloads, the abstraction of the virtual layer from the physical hardware can introduce some challenges.
And let’s not forget the elephant in the room, snapshots. Many DBAs I’ve talked to are at a loss as to why SysAdmins and IT ops perform snapshots of their database instance VMs, which in turn can cause performance issues, especially if a memory snapshot is invoked which renders the VM inactive while the memory is written to disk. Database backups are best left to DBAs who ensure referential integrity is maintained to recover a database.
Which Metrics Matter?
If you find yourself running your database instances in a VMware VM, what do you need to look for to see if the VM your database is running in has problems? There are many metrics available, so let’s review the usual suspects.
This metric indicates the VM (and the database trying to run inside it) was ready to run but instead sat idle waiting behind other VMs contending to control the same shared resources such as physical CPUs or memory.
For example, a vSphere host has six physical CPUs, and two VMs are configured to each require four virtual CPUs (vCPUs) before they can run. This situation means only one VM can run at a time. You can eliminate the VMs queueing behind each other by either moving a VM to another host or configuring both VMs to require three or fewer virtual CPUs.
The term “oversubscription” simply means you’ve assigned more virtual resources than what physical resources exist to run all VMs concurrently. It may seem a bit strange but reducing the number of vCPUs may dramatically increase its performance. Generally, oversubscription should not go above 5%.With the SolarWinds®Database Performance Analyzer (DPA) VM Option, an easy way to see how many physical CPUs your host server has is to view the Host tab on the VM CONFIG page.
VM CPU Usage
Actively used CPU as a percent of total available virtual CPU in the virtual machine.
Host CPU Usage
Actively used CPU as a percent of total available CPU on the machine. If this number is high you might see VMs with high CPU ready and/or co-stop.
Active CPU is approximately equal to the ratio of the used CPU to the available CPU where: Available CPU = # of physical CPUs x clock rate.
When your database instance is running in a VM, with the VM Option, DPA automatically expands the data in the CPU tab to include this information along with other VM specific metrics.
The time a VM waits for a vCPU is due to scheduling (lack of resources). So basically, your VM can be waiting on physical CPU resources in use by other VMs. If you see high Host CPU Usage this is probably a sign there are too many VMs on this host and/or you need more physical CPU resources.
VM Memory Swap Rate
The “swap in” and “swap out” rates generally mean you have a shortage of physical memory on the host, so the memory is swapped out and in from disk.
VM Active Memory Usage
This is the memory in use as a percent of the memory configured for the VM.
Host Memory Usage
This is the memory usage on the host (consumed memory / total machine memory). If this is high (e.g., GT 90%) this could indicate host memory over-commit which could lead to high VM swap rates.
VM Memory Overhead
This is simply the amount of memory used to run the VM. Over-configuring memory (or excess vCPU for that matter) will unnecessarily increase overhead. That said, there’s memory needed by ESXi itself and the virtual machine (virtual machine frame buffer).
VM Memory Balloon
The balloon driver reclaims pages on the server considered less valuable. The crux of this VMware proprietary technique is to match the behavior of a guest OS. You should only see this when the host is running low or out of physical memory.
If you see the virtual machine your database instance is running in has a certain percent of memory claimed by the balloon driver, look for memory swapping which could affect your VM’s performance. However, if you don’t see any swapping issues you don’t and won’t necessarily have a performance problem.
VM Disk Commands
Number of disk commands executed is an indication of how busy the disks are. That said, unless you see large queues developing and commands start to be aborted there isn’t a problem.
If you see aborted disk commands, then your storage is severely overloaded and can lead to serious application response issues.
VM Disk Usage
Available if you aren’t using a NFS datastore, it will show the average disk I/O rates across all virtual disks on the VM.
VM Read / Write Rates
VM disk read rate is the average amount of data read from the disk each second during the collection interval. For a VM, this is the rate at which data is read from each virtual disk to the virtual machine.
VM disk write rate is the average amount of data written to disk each second during the collection interval—simply the rate data is written to each virtual disk on the VM.
Host Disk Device Read / Write Rates
The host disk read-and-write rate is the average read/write rate across all disks/LUNs on the host. The rate represents the read/write throughput at the host level across all disks/LUNs and VMs running on the host.
If the database instance has I/O performance issues, you may have another VM on the same host causing the delays. Compare this metric to the physical I/O rate from the database instance. If the Host rate is higher, then it’s likely another VM is the problem. Otherwise, the VM your instance is running in may be causing too much of a demand on the underlying physical storage.
Host Max Disk Latency
This is the highest latency value across all disks used by this host.
Host Disk Latency
Read latency is the average amount of time to process a read command to a disk to the host (across all VMs). High disk latency indicates storage may be slow or overloaded.
Write latency is similar to read and is the average amount of time to process a write command from the specific disk across all VMs.
Expected disk latencies will depend on the nature of the storage like read/write mix, randomness and I/O size along with the capability of the storage subsystem.
In addition to these metrics being found in DPA, you can execute the “esxtop” command from your VMware ESXi host or look at various utilization metrics from the VMware ESXi console. SolarWinds Virtualization Manager also reports on all of these metrics and more in a friendlier format with both historical and real-time data.
Sample Nightmare Scenario Avoided
As I mentioned when I started off, a nightmare scenario could be when everything associated with the database instance seems fine—nothing changed. Since we’ve covered the essential VM metrics you should be monitoring, let’s walk through a hard-to-find problem for a database instance running in a VM using SolarWinds Database Performance Analyzer (DPA) with the VM Option. In the 2019.4 release of DPA, we expanded the VM option to go beyond the basic resource metrics to include additional HOST metrics and to make note of events, as seen in the DPA CPU tab in RESOURCES.
* Example of event logging in DPA 2019.4
Let’s walk through our sample “nightmare” scenario.
Problem ticket open for poor application performance response time
Users complained the morning of Monday, December 2 “around 8 a.m.” they experienced abnormally long wait times.
No outages were recorded from the IT Ops group
You go to DPA to look at the Database instance supporting the application
You notice a longer than normal wait occurrence on December 2, and the machine learning anomaly detection flags this time as a critical wait time delta from what is normally expected at this time of day.
You then look at the tab ADVISORS for additional data for this day.
As it turns out, a specific query accounts for the top amount of execution time.
You select this query to find out more about it and what occurred at the time. From the QUERY DETAIL page, you see the longest wait time was for memory/CPU from which you click on the green bar for memory/CPU to explore further by going down to the hour.
Once you get down to the hourly view, you see a noticeable spike in wait time in the morning hours when the application response time issue was occurring.
As you scroll down the page to the end where VM metrics are shown, you see the new co-stop metric where there’s a corresponding spike. By hovering over the annotation dots, you see during this time the VM was being moved via vMotion from one host to another.
Just as with snapshots, vMotion events can have a negative impact on the performance of the VM the database instance is running in. Without visibility into the virtualized infrastructure, it can be time consuming to find the culprit of poor performance. With DPA, you can easily line up all of resources for a specific time to pinpoint the problem as seen below.
With VMware’s 500,000 customers and tens of millions of VMs, virtualization is here to stay. Since many database on-premises to cloud migrations involve virtualization, e.g., Azure VM, many of the same challenges existing on-premises will exist in IaaS environments. DBA’s don’t have to be virtual admins, but they do need to be aware of the environment their database instances run in and the impact those environments have on database performance.
That said, I’ve discovered many DPA customers have no idea there’s a purpose-built option for VMware that can be added to the product. It’s easy to see if you have the option by looking for the VIRTUALIZATION tab on the home page.
* This all-in-one view lets you line up all your resources in a single view to look for problems on a specific date and time.
Our goal at SolarWinds is to listen to our customers which is why we’ve enhanced the VM option for DPA. If you are a DPA customer, be sure to utilize our THWACK® feature request page to request and vote on feature enhancements.
 Gartner Market Guide for Server Virtualization, Published 24 April 2019, ID G00350674
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community.
More than 150,000 members are here to solve problems, share technology and best practices, and directly
contribute to our product development process.