cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Virtualized Databases, Friend or Foe?

Product Manager

Background

 

Fire-fighting mode for DBAs can be stressful when they have co-workers and managers breathing down their necks due to application slow-downs and/or outages. Logic says something changed, but what? In a worst-case scenario, the database instance itself looks fine, nothing changed within the database and the SQL being executed was running fine before. Of course, the SysAdmin says nothing is wrong with the physical server or storage which makes it even more questionable. Hmm, could you be running in a virtual machine (VM)? Is your VM resource starved and competing with other VMs?

 

According to Gartner’s Market Guide for Server Virtualization[1], “Hypervisor-based server virtualization is now mature, with 80% to 90% of server workloads running in a virtual machine (VM) for most midsize to large enterprises.” Additionally, anecdotal evidence states 70% of all databases are virtualized. In fact, here at SolarWinds, 50% of our database instances run in a VM. For all the benefits of virtualization like cost savings and ease of migrating workloads, the abstraction of the virtual layer from the physical hardware can introduce some challenges.

 

And let’s not forget the elephant in the room, snapshots. Many DBAs I’ve talked to are at a loss as to why SysAdmins and IT ops perform snapshots of their database instance VMs, which in turn can cause performance issues, especially if a memory snapshot is invoked which renders the VM inactive while the memory is written to disk. Database backups are best left to DBAs who ensure referential integrity is maintained to recover a database.

 

Which Metrics Matter?

 

If you find yourself running your database instances in a VMware VM, what do you need to look for to see if the VM your database is running in has problems? There are many metrics available, so let’s review the usual suspects.

 

CPU Ready

 

  • This metric indicates the VM (and the database trying to run inside it) was ready to run but instead sat idle waiting behind other VMs contending to control the same shared resources such as physical CPUs or memory.

    For example, a vSphere host has six physical CPUs, and two VMs are configured to each require four virtual CPUs (vCPUs) before they can run. This situation means only one VM can run at a time. You can eliminate the VMs queueing behind each other by either moving a VM to another host or configuring both VMs to require three or fewer virtual CPUs.

 

    • The term “oversubscription” simply means you’ve assigned more virtual resources than what physical resources exist to run all VMs concurrently. It may seem a bit strange but reducing the number of vCPUs may dramatically increase its performance. Generally, oversubscription should not go above 5%.With the SolarWinds®Database Performance Analyzer (DPA) VM Option, an easy way to see how many physical CPUs your host server has is to view the Host tab on the VM CONFIG page.

 

pastedImage_0.png

VM CPU Usage

  • Actively used CPU as a percent of total available virtual CPU in the virtual machine.

Host CPU Usage

 

  • Actively used CPU as a percent of total available CPU on the machine. If this number is high you might see VMs with high CPU ready and/or co-stop.
    • Active CPU is approximately equal to the ratio of the used CPU to the available CPU where: Available CPU = # of physical CPUs x clock rate.
    • When your database instance is running in a VM, with the VM Option, DPA automatically expands the data in the CPU tab to include this information along with other VM specific metrics.

 

pastedImage_5.png

Co-Stop

 

  • The time a VM waits for a vCPU is due to scheduling (lack of resources). So basically, your VM can be waiting on physical CPU resources in use by other VMs. If you see high Host CPU Usage this is probably a sign there are too many VMs on this host and/or you need more physical CPU resources.

 

VM Memory Swap Rate

 

  • The “swap in” and “swap out” rates generally mean you have a shortage of physical memory on the host, so the memory is swapped out and in from disk.

 

VM Active Memory Usage

  • This is the memory in use as a percent of the memory configured for the VM.

 

Host Memory Usage

  • This is the memory usage on the host (consumed memory / total machine memory). If this is high (e.g., GT 90%) this could indicate host memory over-commit which could lead to high VM swap rates.

VM Memory Overhead

  • This is simply the amount of memory used to run the VM. Over-configuring memory (or excess vCPU for that matter) will unnecessarily increase overhead. That said, there’s memory needed by ESXi itself and the virtual machine (virtual machine frame buffer).

 

VM Memory Balloon

  • The balloon driver reclaims pages on the server considered less valuable. The crux of this VMware proprietary technique is to match the behavior of a guest OS. You should only see this when the host is running low or out of physical memory.
  • If you see the virtual machine your database instance is running in has a certain percent of memory claimed by the balloon driver, look for memory swapping which could affect your VM’s performance. However, if you don’t see any swapping issues you don’t and won’t necessarily have a performance problem.

 

VM Disk Commands

  • Number of disk commands executed is an indication of how busy the disks are. That said, unless you see large queues developing and commands start to be aborted there isn’t a problem.
  • If you see aborted disk commands, then your storage is severely overloaded and can lead to serious application response issues.

 

VM Disk Usage

  • Available if you aren’t using a NFS datastore, it will show the average disk I/O rates across all virtual disks on the VM.

 

VM Read / Write Rates

  • VM disk read rate is the average amount of data read from the disk each second during the collection interval. For a VM, this is the rate at which data is read from each virtual disk to the virtual machine.
  • VM disk write rate is the average amount of data written to disk each second during the collection interval—simply the rate data is written to each virtual disk on the VM.

 

Host Disk Device Read / Write Rates

  • The host disk read-and-write rate is the average read/write rate across all disks/LUNs on the host. The rate represents the read/write throughput at the host level across all disks/LUNs and VMs running on the host.
    • If the database instance has I/O performance issues, you may have another VM on the same host causing the delays. Compare this metric to the physical I/O rate from the database instance. If the Host rate is higher, then it’s likely another VM is the problem. Otherwise, the VM your instance is running in may be causing too much of a demand on the underlying physical storage.

Host Max Disk Latency

  • This is the highest latency value across all disks used by this host.

 

Host Disk Latency

  • Read latency is the average amount of time to process a read command to a disk to the host (across all VMs). High disk latency indicates storage may be slow or overloaded.
  • Write latency is similar to read and is the average amount of time to process a write command from the specific disk across all VMs.
    • Disk Write Latency = Kernel Write Latency + Device Write Latency
  • Expected disk latencies will depend on the nature of the storage like read/write mix, randomness and I/O size along with the capability of the storage subsystem.

 

In addition to these metrics being found in DPA, you can execute the “esxtop” command from your VMware ESXi host or look at various utilization metrics from the VMware ESXi console. SolarWinds Virtualization Manager also reports on all of these metrics and more in a friendlier format with both historical and real-time data.

 

esxtop screen capture.pngESXi Screen 2.PNG

 

 

Sample Nightmare Scenario Avoided

 

As I mentioned when I started off, a nightmare scenario could be when everything associated with the database instance seems fine—nothing changed. Since we’ve covered the essential VM metrics you should be monitoring, let’s walk through a hard-to-find problem for a database instance running in a VM using SolarWinds Database Performance Analyzer (DPA) with the VM Option. In the 2019.4 release of DPA, we expanded the VM option to go beyond the basic resource metrics to include additional HOST metrics and to make note of events, as seen in the DPA CPU tab in RESOURCES.

 

pastedImage_10.png

* Example of event logging in DPA 2019.4

 

 

Let’s walk through our sample “nightmare” scenario.

 

  • Problem ticket open for poor application performance response time
    • Users complained the morning of Monday, December 2 “around 8 a.m.” they experienced abnormally long wait times.

 

  • No outages were recorded from the IT Ops group

 

  • You go to DPA to look at the Database instance supporting the application
    • You notice a longer than normal wait occurrence on December 2, and the machine learning anomaly detection flags this time as a critical wait time delta from what is normally expected at this time of day.

pastedImage_5.png

 

 

  • You then look at the tab ADVISORS for additional data for this day.

    1. As it turns out, a specific query accounts for the top amount of execution time.
      pastedImage_12.png

  • You select this query to find out more about it and what occurred at the time. From the QUERY DETAIL page, you see the longest wait time was for memory/CPU from which you click on the green bar for memory/CPU to explore further by going down to the hour.

    pastedImage_13.png

 

  • Once you get down to the hourly view, you see a noticeable spike in wait time in the morning hours when the application response time issue was occurring.
  • As you scroll down the page to the end where VM metrics are shown, you see the new co-stop metric where there’s a corresponding spike. By hovering over the annotation dots, you see during this time the VM was being moved via vMotion from one host to another.


pastedImage_14.png

pastedImage_16.png

  • Just as with snapshots, vMotion events can have a negative impact on the performance of the VM the database instance is running in. Without visibility into the virtualized infrastructure, it can be time consuming to find the culprit of poor performance. 
    With DPA, you can easily line up all of resources for a specific time to pinpoint the problem as seen below.

    pastedImage_17.png

 

Summary

 

With VMware’s 500,000 customers and tens of millions of VMs, virtualization is here to stay. Since many database on-premises to cloud migrations involve virtualization, e.g., Azure VM, many of the same challenges existing on-premises will exist in IaaS environments. DBA’s don’t have to be virtual admins, but they do need to be aware of the environment their database instances run in and the impact those environments have on database performance.

 

That said, I’ve discovered many DPA customers have no idea there’s a purpose-built option for VMware that can be added to the product. It’s easy to see if you have the option by looking for the VIRTUALIZATION tab on the home page.

 

pastedImage_18.png

*  This all-in-one view lets you line up all your resources in a single view to look for problems on a specific date and time.

 

Our goal at SolarWinds is to listen to our customers which is why we’ve enhanced the VM option for DPA. If you are a DPA customer, be sure to utilize our THWACK® feature request page to request and vote on feature enhancements. 

 

 

 

 


[1] Gartner Market Guide for Server Virtualization, Published 24 April 2019, ID G00350674

11 Comments
Level 16

Ran my Orion DB server for years on a physical SQL server with no problems. Was forced to go to a VM a few years ago and its been a headache ever since. Hoping to go back to a physical server again soon.

It's not just the issue of physical -vs- virtual, its also the access rights. When running it on a physical SQL server I could upgrade my installer ID to SA whenever doing installs and they always went great. Now the installs/upgrades

are done on a shared SQL instance and SA rights are no longer allowed. Many times the installer will hang with some erroneous error during the install and you are stuck with a corrupted system and database. Restore back to scratch and

try again.

Level 13

I've had ok experiences with test workloads on VMs.  Production workloads have been much more problematic.  Latency, weird glitches, unexplained and intermittent slowness have been the order of the day.  Given how affordable physical is we've moved all production workloads back to physical.

Level 14

About to put in a suite of Solarwinds products at the current client.  Looking forward to not having to use VMWare's vTop to looks for this stuff.  

Level 14

I've run some failry beefy SQL boxes in VMWare without issue.  I guess it depends on what else is running and how it is configured.  That said, my new Solarwinds installation will be on two physical boxes as it will be monitoring the virtual environments we have.

We find that the Data Center licensing of VM SQL workloads make far more sense to us, and offsets other costs that increase. I will not pretend that SQL on a VM is easier, there are simply more things to think about to get it optimized for your workloads. Now that SQL has been completely virtual for 3 years, the initial bumps have been ironed out. I can also say that since we have one model for SQL, we deliver much better service for the smaller databases then we would have otherwise. The heavy lifting for big stuff (and well funded stuff) rolls over to everything else. With standardization it is not more work to bring the performance up for everyone, and overall has been a much smoother transition than I imagined. 

Level 12

As a non-DBA I appreciate the information shared here.

MVP
MVP

Great information, I have already forwarded to my team!  Thanks for taking the time to put this information together john.maxwell

Level 15

Really helpful post.  Thanks!

Level 16

Looks like I will still be on a VM but on a dedicated SQL server instead of on a shared SQL with many other non SolarWinds DB's running. 

Level 20

I've also noticed with hosts... I always run out of memory before I run out of cores.  Fewer fast cores and as much memory as I can afford is what I want in my hosts.  I use dedicated VM's for Orion's SQL Server and Orion itself as well.  Never had a problem.  Putting Orion on a SQL Server with other databases isn't a great idea I've found though.

B

Level 11

Great information, and I agree with @ecklerwr1 about virtualizing SolarWinds SQL and Orion too. Virtualization shoudl not be painful - just remember that just 'cuz it is virtual, it does not stop using resources; it changes how you think about them.

One warning: "Hey lets put all of those physical servers into a single host! We will save space and I heard that we will not have any problems!" If you hear that - run. Run fast. Smiley LOLThe posts about learning your virtual environment and how to allocate resources properly should be re-read. Every datacenter has its own quirks. 

Anyone have insight about (v)numa, its affect on virtual cpu usage/socket allocation/usage and SQL licensing? That topic gets interesting real fast. I have not found a good guide either.