Virtualized Databases, Friend or Foe?

john.maxwell over 4 years ago 9 minute read time

Background

Fire-fighting mode for DBAs can be stressful when they have co-workers and managers breathing down their necks due to application slow-downs and/or outages. Logic says something changed, but what? In a worst-case scenario, the database instance itself looks fine, nothing changed within the database and the SQL being executed was running fine before. Of course, the SysAdmin says nothing is wrong with the physical server or storage which makes it even more questionable. Hmm, could you be running in a virtual machine (VM)? Is your VM resource starved and competing with other VMs?

According to Gartner’s Market Guide for Server Virtualization[1], “Hypervisor-based server virtualization is now mature, with 80% to 90% of server workloads running in a virtual machine (VM) for most midsize to large enterprises.” Additionally, anecdotal evidence states 70% of all databases are virtualized. In fact, here at SolarWinds, 50% of our database instances run in a VM. For all the benefits of virtualization like cost savings and ease of migrating workloads, the abstraction of the virtual layer from the physical hardware can introduce some challenges.

And let’s not forget the elephant in the room, snapshots. Many DBAs I’ve talked to are at a loss as to why SysAdmins and IT ops perform snapshots of their database instance VMs, which in turn can cause performance issues, especially if a memory snapshot is invoked which renders the VM inactive while the memory is written to disk. Database backups are best left to DBAs who ensure referential integrity is maintained to recover a database.

Which Metrics Matter?

If you find yourself running your database instances in a VMware VM, what do you need to look for to see if the VM your database is running in has problems? There are many metrics available, so let’s review the usual suspects.

CPU Ready

This metric indicates the VM (and the database trying to run inside it) was ready to run but instead sat idle waiting behind other VMs contending to control the same shared resources such as physical CPUs or memory.

For example, a vSphere host has six physical CPUs, and two VMs are configured to each require four virtual CPUs (vCPUs) before they can run. This situation means only one VM can run at a time. You can eliminate the VMs queueing behind each other by either moving a VM to another host or configuring both VMs to require three or fewer virtual CPUs.

- The term “oversubscription” simply means you’ve assigned more virtual resources than what physical resources exist to run all VMs concurrently. It may seem a bit strange but reducing the number of vCPUs may dramatically increase its performance. Generally, oversubscription should not go above 5%.With the SolarWindsDatabase Performance Analyzer (DPA) VM Option, an easy way to see how many physical CPUs your host server has is to view the Host tab on the VM CONFIG page.

VM CPU Usage

Actively used CPU as a percent of total available virtual CPU in the virtual machine.

Host CPU Usage

Actively used CPU as a percent of total available CPU on the machine. If this number is high you might see VMs with high CPU ready and/or co-stop.
- Active CPU is approximately equal to the ratio of the used CPU to the available CPU where: Available CPU = # of physical CPUs x clock rate.
- When your database instance is running in a VM, with the VM Option, DPA automatically expands the data in the CPU tab to include this information along with other VM specific metrics.

Co-Stop

The time a VM waits for a vCPU is due to scheduling (lack of resources). So basically, your VM can be waiting on physical CPU resources in use by other VMs. If you see high Host CPU Usage this is probably a sign there are too many VMs on this host and/or you need more physical CPU resources.

VM Memory Swap Rate

The “swap in” and “swap out” rates generally mean you have a shortage of physical memory on the host, so the memory is swapped out and in from disk.

VM Active Memory Usage

This is the memory in use as a percent of the memory configured for the VM.

Host Memory Usage

This is the memory usage on the host (consumed memory / total machine memory). If this is high (e.g., GT 90%) this could indicate host memory over-commit which could lead to high VM swap rates.

VM Memory Overhead

This is simply the amount of memory used to run the VM. Over-configuring memory (or excess vCPU for that matter) will unnecessarily increase overhead. That said, there’s memory needed by ESXi itself and the virtual machine (virtual machine frame buffer).

VM Memory Balloon

The balloon driver reclaims pages on the server considered less valuable. The crux of this VMware proprietary technique is to match the behavior of a guest OS. You should only see this when the host is running low or out of physical memory.
If you see the virtual machine your database instance is running in has a certain percent of memory claimed by the balloon driver, look for memory swapping which could affect your VM’s performance. However, if you don’t see any swapping issues you don’t and won’t necessarily have a performance problem.

VM Disk Commands

Number of disk commands executed is an indication of how busy the disks are. That said, unless you see large queues developing and commands start to be aborted there isn’t a problem.
If you see aborted disk commands, then your storage is severely overloaded and can lead to serious application response issues.

VM Disk Usage

Available if you aren’t using a NFS datastore, it will show the average disk I/O rates across all virtual disks on the VM.

VM Read / Write Rates

VM disk read rate is the average amount of data read from the disk each second during the collection interval. For a VM, this is the rate at which data is read from each virtual disk to the virtual machine.
VM disk write rate is the average amount of data written to disk each second during the collection interval—simply the rate data is written to each virtual disk on the VM.

Host Disk Device Read / Write Rates

The host disk read-and-write rate is the average read/write rate across all disks/LUNs on the host. The rate represents the read/write throughput at the host level across all disks/LUNs and VMs running on the host.
- If the database instance has I/O performance issues, you may have another VM on the same host causing the delays. Compare this metric to the physical I/O rate from the database instance. If the Host rate is higher, then it’s likely another VM is the problem. Otherwise, the VM your instance is running in may be causing too much of a demand on the underlying physical storage.

Host Max Disk Latency

This is the highest latency value across all disks used by this host.

Host Disk Latency

Read latency is the average amount of time to process a read command to a disk to the host (across all VMs). High disk latency indicates storage may be slow or overloaded.
Write latency is similar to read and is the average amount of time to process a write command from the specific disk across all VMs.
- Disk Write Latency = Kernel Write Latency + Device Write Latency
Expected disk latencies will depend on the nature of the storage like read/write mix, randomness and I/O size along with the capability of the storage subsystem.

In addition to these metrics being found in DPA, you can execute the “esxtop” command from your VMware ESXi host or look at various utilization metrics from the VMware ESXi console. SolarWinds Virtualization Manager also reports on all of these metrics and more in a friendlier format with both historical and real-time data.

esxtop screen capture.png ESXi Screen 2.PNG

Sample Nightmare Scenario Avoided

As I mentioned when I started off, a nightmare scenario could be when everything associated with the database instance seems fine—nothing changed. Since we’ve covered the essential VM metrics you should be monitoring, let’s walk through a hard-to-find problem for a database instance running in a VM using SolarWinds Database Performance Analyzer (DPA) with the VM Option. In the 2019.4 release of DPA, we expanded the VM option to go beyond the basic resource metrics to include additional HOST metrics and to make note of events, as seen in the DPA CPU tab in RESOURCES.

* Example of event logging in DPA 2019.4

Let’s walk through our sample “nightmare” scenario.

Problem ticket open for poor application performance response time
- Users complained the morning of Monday, December 2 “around 8 a.m.” they experienced abnormally long wait times.

No outages were recorded from the IT Ops group

You go to DPA to look at the Database instance supporting the application
- You notice a longer than normal wait occurrence on December 2, and the machine learning anomaly detection flags this time as a critical wait time delta from what is normally expected at this time of day.

You then look at the tab ADVISORS for additional data for this day.
1. As it turns out, a specific query accounts for the top amount of execution time.
You select this query to find out more about it and what occurred at the time. From the QUERY DETAIL page, you see the longest wait time was for memory/CPU from which you click on the green bar for memory/CPU to explore further by going down to the hour.

Once you get down to the hourly view, you see a noticeable spike in wait time in the morning hours when the application response time issue was occurring.
As you scroll down the page to the end where VM metrics are shown, you see the new co-stop metric where there’s a corresponding spike. By hovering over the annotation dots, you see during this time the VM was being moved via vMotion from one host to another.

Just as with snapshots, vMotion events can have a negative impact on the performance of the VM the database instance is running in. Without visibility into the virtualized infrastructure, it can be time consuming to find the culprit of poor performance.
With DPA, you can easily line up all of resources for a specific time to pinpoint the problem as seen below.

Summary

With VMware’s 500,000 customers and tens of millions of VMs, virtualization is here to stay. Since many database on-premises to cloud migrations involve virtualization, e.g., Azure VM, many of the same challenges existing on-premises will exist in IaaS environments. DBA’s don’t have to be virtual admins, but they do need to be aware of the environment their database instances run in and the impact those environments have on database performance.

That said, I’ve discovered many DPA customers have no idea there’s a purpose-built option for VMware that can be added to the product. It’s easy to see if you have the option by looking for the VIRTUALIZATION tab on the home page.

* This all-in-one view lets you line up all your resources in a single view to look for problems on a specific date and time.

Our goal at SolarWinds is to listen to our customers which is why we’ve enhanced the VM option for DPA. If you are a DPA customer, be sure to utilize our THWACK feature request page to request and vote on feature enhancements.

[1] Gartner Market Guide for Server Virtualization, Published 24 April 2019, ID G00350674

Top Comments

df112 over 4 years ago +1

I've had ok experiences with test workloads on VMs. Production workloads have been much more problematic. Latency, weird glitches, unexplained and intermittent slowness have been the order of the day…
ecklerwr1 over 4 years ago +1

I've also noticed with hosts... I always run out of memory before I run out of cores. Fewer fast cores and as much memory as I can afford is what I want in my hosts. I use dedicated VM's for Orion's SQL…
marcrobinson over 4 years ago +1

Great information, and I agree with ecklerwr1 about virtualizing SolarWinds SQL and Orion too. Virtualization shoudl not be painful - just remember that just 'cuz it is virtual, it does not stop using…

marcrobinson over 4 years ago

Great information, and I agree with ecklerwr1 about virtualizing SolarWinds SQL and Orion too. Virtualization shoudl not be painful - just remember that just 'cuz it is virtual, it does not stop using resources; it changes how you think about them.
One warning: "Hey lets put all of those physical servers into a single host! We will save space and I heard that we will not have any problems!" If you hear that - run. Run fast. :smileylol:The posts about learning your virtual environment and how to allocate resources properly should be re-read. Every datacenter has its own quirks.
Anyone have insight about (v)numa, its affect on virtual cpu usage/socket allocation/usage and SQL licensing? That topic gets interesting real fast. I have not found a good guide either.
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
ecklerwr1 over 4 years ago

I've also noticed with hosts... I always run out of memory before I run out of cores. Fewer fast cores and as much memory as I can afford is what I want in my hosts. I use dedicated VM's for Orion's SQL Server and Orion itself as well. Never had a problem. Putting Orion on a SQL Server with other databases isn't a great idea I've found though.
B
- Cancel
- Vote Up +1 Vote Down
- More
- Cancel
bobmarley over 4 years ago

Looks like I will still be on a VM but on a dedicated SQL server instead of on a shared SQL with many other non SolarWinds DB's running.
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
jkump over 4 years ago

Really helpful post. Thanks!
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel
zennifer over 4 years ago

Great information, I have already forwarded to my team! Thanks for taking the time to put this information together john.maxwell
- Cancel
- Vote Up 0 Vote Down
- More
- Cancel