VM Snapshots and Performance or: How I Learned to Love My Martian Data Center
It takes a radio signal about 1.28 seconds to get to the Moon (about 239,000 miles away), and about 2.5 seconds for round trip communication between our secret moon base and Earth. So, therefore this common SQL Server error message number 833...
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\SQL\database.mdf] in database [database]. The OS file handle is 0x0000000000000000. The offset of the latest long I/O is: 0x00000000000000
...implies that the round trip time is over 15 seconds, so using 7.5 seconds (as a minimum estimate, we really don't know how long it is taking) we see the underlying SAN disks are over 1,396,500 miles away, or about 5.8 times as far away as the Moon. No, I don't have any idea how they got there, either. But how else to explain this error? For all I know this SAN could be on Mars!
Now, I've seen this error message many times in my career. The traditional answers you find on the internet tell you to look at your queries and try to figure out which ones are causing you the I/O bottleneck. In my experience, this guidance was wrong more than 95% of time. In fact, this is the type of guidance that usually results in people wanting to just throw hardware at the problem. I've seen that error message appear with little to no workload being run against the instance.
In my experience the true answer was almost always "shared storage" or "the nature of the beast that is known as a SAN". Turns out that when several servers share the same storage arrays you can end up being a victim to what is commonly called a "noisy neighbor". One workload, on one particular server, causing performance pain for a seemingly unrelated server elsewhere.
What's more frustrating is that sometimes the only hint of the issue is with the SQL Server error message. Often the conventional tools used to monitor the SAN don't necessarily show the problem, as they are focusing on the overall health of the SAN and not on the health of specific applications, database servers, or end-user experience.
And just when I thought I had seen it all when it comes to the error message above, along comes something new for me to learn.
Snapshot and Checkpoints
No, they aren't what's new. I've been involved with the virtualization of database servers for more than eight years now and the concept of snapshots and checkpoints are not recent revelations for me. I've used them from time to time when building personal VMs for demos and I've seen them used sparingly in production environments. Why the two names? To avoid confusion, of course. (Too late.)
The concept of a snapshot or checkpoint is simple: to create a copy of the virtual machine at a point in time. The reason for wanting this point in time copy is simple as well: recovery. You want to be able to quickly put the virtual machine back to the point in time created by the snapshot or checkpoint. Think of things like upgrades or service packs. Take a snapshot of the virtual machine, apply changes, sign off that everything looks good, and remove the snapshot. Brilliant!
How do they work?
For snapshots in VMWare, the documentation is very clear:
When you create a snapshot, the system creates a delta disk file for that snapshot in the datastore and writes any changes to that delta disk.
So, that means the original file(s) used for the virtual machine become read-only, and this new delta file stores all of the changes. To me, I liken this to the similar "copy-on-write" technology in database snapshots inside of SQL Server. In fact, this VMWare KB article explains the process in the same way:
The child disk, which is created with a snapshot, is a sparse disk. Sparse disks employ the copy-on-write (COW) mechanism, in which the virtual disk contains no data in places, until copied there by a write.
OK, so we know how they work, so let's talk about their performance impact.
Are they bad?
Not at first, no. But just like meat left out overnight they can become bad, yes. And the reason why should be very clear: the longer you have them, the more overhead you will have as the delta disk keeps track of all the changes. Snapshots and checkpoints are meant to be a temporary thing, not something you would keep around. In fact, VMware suggests that you keep a snapshot for no more than 72 hours, due to the performance impact. Here's a brief summary of other items from the "Best practices for virtual machine snapshots in the VMware environment" KB article:
- Snapshots are not backups, and do not contain all the info needed to restore the VM. If you delete the original disk files, the snapshot is useless.
- The delta files can grow to be the same size as the original files. Plan accordingly.
- Up to 32 snapshots are supported (unless you run out of disk), but you are crazy to use more than 2-3 at any one time.
- Rarely, if ever, should you use a snapshot on high-transaction virtual machines such as email and database servers.
- Snapshots should only be left unattended for 24-72 hours, and don't feed them after midnight, ever.
OK, I made that last part up. You aren't supposed to feed them, ever, otherwise they become like your in-laws at Christmas and they will never leave.
So, snapshots and checkpoints can have an adverse affect on performance! And I found out about it through this Spiceworks thread, then from other articles on the internet that detailed this very same issue.
So this performance issue wasn't exactly an unknown, but rather new to me since I hadn't come across issues related to snapshots or thought to check for them in production. And, from what I can tell, most people don't have this experience either, hence the reason for scratching our heads when we see the affects of snapshots and checkpoints on our database servers.
Do I have one?
I don't know, you'll need to look for yourself. For VMware, you have three methods as detailed in this KB article:
1. Using vSphere
2. Using the virtual machine Snapshot Manager
3. Viewing the virtual machine configuration file
4. Viewing the virtual machine configuration file on the ESX host
Yes, that's four things. I didn't write the KB article. You can read it for yourself. Consider number 4 to be a bonus option or something. Or maybe they meant to combine the last two. Again, I didn't write the article, I'm just pointing you to what it says.
Now, for Hyper-V, we can look at the Hyper-V Manager GUI as well, which is essentially similar to using vSphere. But we could also use the Hyper-V Powershell cmdlets as listed here. In fact, this little piece of code is all you really need:
PS C:\> get-vm "vmName" | get-vmsnapshot
Snapshots and checkpoints are fine tools for you to use, but when you are done with them you should get rid of them, especially for a database server. Otherwise you can expect to see a lot of disk latency and high CPU as a result. And should you see such things but your server team reports back that everything looks normal, I hope this post will stick in your head enough for you to remember to go looking for any rogue snapshots that may exist.