What is VM sprawl ?
VM sprawl is defined as a waste of resources (compute : CPU cycles and RAM consumption) as well as storage capacity due to a lack of oversight and control over VM resource provisioning. Because of its uncontrolled nature, VM sprawl has adverse effects on your environment’s performance at best, and can lead to more serious complications (including downtime) in constrained environments.
VM Sprawl and its consequences
Lack of management and control over the environment will cause VMs to be created in an uncontrolled way. This means not only the total number of VMs in a given environment, but also how resources are allocated to these VMs. You could have a large environment with minimal sprawl, but a smaller environment with considerable sprawl.
Here are some of the factors that cause VM sprawl:
- Oversized VMs: VMs which were allocated more resources than they really need. Consequences:
- Waste of compute and/or storage resources
- Over-allocation of RAM will cause ballooning and swapping to disk if the environment falls under memory pressure, which will result in performance degradation
- Over-allocation of virtual CPU will cause high co-stops, which means that the more vCPUs a VM has, the more it needs to wait for CPU cycles to be available on all the physical cores at the same moment. The more vCPUs a VM has, the less likely it is that all the cores will be available at the same time
- The more RAM and vCPU a VM has, the higher is the RAM overhead required by the hypervisor.
- Idle VMs: VMs up and running, not necessarily oversized, but being unused and having no activity. Consequences:
- Waste of computer and/or storage resources + RAM overhead at the hypervisor level
- Resources wasted by Idle VMs may impact CPU scheduling and RAM allocation while the environment is under contention
- Powered Off VMs and orphaned VMDKs eat up space resources
How to Manage VM sprawl
Controlling and containing VM sprawl relies on process and operational aspects. The former covers how one prevents VM sprawl from happening, while the latter covers how to tackle sprawl that happens regardless of controls set up at the process level.
On the process side, IT should define standards and implement policies:
- Role Based Access Control which defines roles & permissions on who can do what. This will greatly help reduce the creation of rogue VMs and snapshots.
- Define VM categories and acceptable maximums: while not all the VMs can fit in one box, standardizing on several VM categories (application, databases, etc.) will help filter out bizarre or oversized requests. Advanced companies with self-service portals may want to restrict/categorize what VMs can be created by which users or business units
- Challenge any oversized VM request and demand justification for potentially oversized VMs
- Allocate resources based on real utilization. You can propose a policy where a VM resources will be monitored during 90 days after which IT can adjust resource allocation if the VM is undersized or oversized.
- Implement policies on snapshots lifetime and track snapshot creation requests if possible
In certain environments where VMs and their allocated resources are chargeable, you should contact your customers to let them know that a VM needs to be resized or was already resized (based on your policies and rules of engagement) to ensure they are not billed incorrectly. It is worthwhile to formalize your procedures for how VM sprawl management activities will be covered, and to agree with stakeholders on pre-defined downtime windows that will allow you to seamlessly carry any right-sizing activities.
Even with the controls above, sprawl can still happen. It can be caused by a variety of factors. For example, you could have a batch of VMs provisioned for one project, but while they passed through the process controls, they can sit idle for months eating up resources because the project could end up being delayed or cancelled and no one informed the IT team.
In VMware environments where storage is thin provisioned at the array level, and where Storage DRS is enabled on datastore clusters it’s also important to monitor the storage consumption at the array level. While storage capacity will appear to be freed up at the datastore level after a VM is moved around or deleted, it will not be released on the array and this can lead to out-of-storage conditions. A manual triggering of the VAAI Unmap primitive will be required, ideally outside of business hours, to reclaim unallocated space. It’s thus important to have, as a part of your operational procedures, a capacity reclamation process that is triggered regularly.
The usage of virtual infrastructure management tools with built-in resource analysis & reclamation capabilities, such as Solarwinds Virtualization Manager, is a must. By leveraging software capabilities, these tedious analysis and reconciliation tasks are no longer required and dashboards present IT teams with immediately actionable results.
Even with all the good will in the world, VM sprawl will happen. Although you may have the best policies in place, your environment is dynamic and in the rush that IT Operations are, you just can’t have an eye on everything. And this is coming from a guy whose team successfully recovered 22 TB of space previously occupied by orphaned VMDKs earlier this year.