Skip navigation

Geek Speak

4 Posts authored by: maxmortillaro

Following my review of Solarwinds Virtualization Manager 6.3, the fair folks at Solarwinds gave me the opportunity to put my hands on their next planned release, namely VMAN 6.4. While there is no official release date yet, I would bet on an announcement within Q4-2016. The version I tested is 6.4 Beta 2. So what’s new with this release?

 

From a UI perspective, VMAN 6.4 is very similar to its predecessor. Like with VMAN 6.3, you install the appliance and either install VIM (Virtual Infrastructure Monitor component) on a standalone Windows Server, or integrate with an existing Orion deployment if you already use other Solarwinds products. You’d almost think that no changes have happened until you head over to the « Virtualisation Summary » page. The new, killer feature of VMAN 6.4 is called « Recommendations » and while it seems like a minor UI improvement there’s much more to it than it looks like.

 

While in VMAN 6.3 you are presented with a list of items requiring your attention (over/under-provisioned VMs, idle VMs, orphan VMDK files, snapshots etc. – see my previous review), in VMAN 6.4 all of these items are aggregated in the « Recommendations » view.

 

Two types of recommendations exist: Active or Predicted. Active Recommendations are immediate recommendations that are correlated with issues that are currently showing up in your environment. If you are experimenting memory pressure on a given host, an active recommendation would propose you to move one or more VMs to another host to balance the pressure. Predicted recommendations, on the other hand, focus on proactively identifying potential issues before they become a concern, based on usage history in your environment.

 

The « Recommendations » feature is very pleasant to use and introduces a few elements that are quite important from a virtualisation administrator perspective:

 

  • First of all, administrators have the possibility to apply a recommendation immediately or schedule it for a later time (out of business hours, change windows, etc.)
  • Secondly, an option is offered to either power down a VM to apply the recommendation or to attempt to apply the recommendation without any power operations. This features comes in handy if you need to migrate VMs, as you may run into cases where a Power Off/Power On is required, while in other cases a vMotion / live migration will suffice
  • Last but not least, the « Recommendations » module will check if the problem still exists before actually applying a recommendation. This makes particularly sense in the case of active recommendations that may no longer be relevant by the time you decide to apply the recommendation (for example if you decide to schedule a recommendation but the issue is no longer reported by the scheduled time)

 

A nice and welcome touch in the UI is a visual aid that shows up when hovering your mouse over the proposed recommendations. You will see a simple & readable graphical view / simulation of the before & after status of any given object (cluster, datastore, etc.) in case you decide to apply the recommendation.

 

Max’s take

 

The “Recommendations” function, while apparently modest from an UI perspective, is in fact an important improvement that goes beyond the capacity reclamation and VM sprawl controls included in VMAN 6.3. Administrators are now presented with actionable recommendations that are relevant not only in the context of immediate operational issues, but also as countermeasures to prevent future bottlenecks and capacity issues.

 

A few side notes: if you plan to test the beta version, reach out to the Solarwinds engineers. The new “Recommendations” function is still being fine-tuned and you may not be able to see it if you integrate it with your current VIM or Orion environment. Once you install VMAN 6.4, you should let it run for approximately a week in order to get accurate recommendations.

In my previous posts, I have covered how to manage VM sprawl and how to do proper Capacity Planning. In this post, I would like to share my experience of Solarwinds Virtualization Manager 6.3.

 

Today’s virtualized data centers are dynamic environments where a myriad of changes (provisioning, snapshotting, deletions etc.) executed by numerous people are difficult to track. Furthermore, virtualization has lifted the traditional physical limits in terms of resource consumption where a workload was bound to a physical server. For consumers, even private data centers have turned into magical clouds where unicorns graze and endlessly expand existing capacity in flash of rainbow. Unfortunately, down to earth administrators know that unlike the universe, data centers have a finite amount of resources available for consumption.

 

With this, maintaining a healthy data center environment while attempting to satisfy consumers is a challenge for many administrators. Let’s see how Solarwinds Virtualization Manager 6.3 helps tackle this challenge.

 

VM Sprawl Defeated

As highlighted in my previous article, even with the best intentions in the world there is still some organic sprawl that will make its way in your data center because no matter how carefully you cover your back with processes. The VM Sprawl console of VMAN 6.3 allows administrators to immediately see sprawl-related issues and address them before they start causing serious performance issues.

 

The VM sprawl dashboard covers the following sprawl issues :

  • Oversized / Undersized VMs
  • VMs with large snapshots
  • Orphaned VMDKs (leftover VMDK files not linked to any existing VM)
  • VMs suffering from high co-stops
  • Idle VMs
  • Powered Off VMs

 

While it’s good to detect sprawl issues, it’s even better to address them as soon as possible. What I find clever with VMAN 6.3 is that for all of the issues enumerated above, administrators can remediate these from within VMAN, without having to jump from the monitoring tool to their vSphere client or PowerShell. The amount of information provided in each panel is adequate and thus there are no misunderstandings about identifying the culprits and remediating the problems.

 

I like the fact that everything is presented in a single view, there is no need to run reports here and there to determine how the VMs should be right sized as well as no treasure hunting to find orphaned vmdk files.

 

 

Doing Capacity Planning

VMAN 6.3 has a dedicated Capacity Planning dashboard that will highlight current resource consumption, trends/expected depletion date for CPU, RAM and Storage as well as network I/O usage. Here again, a simple but complete view of what matters: do I still have enough capacity? Is a shortage in sight? When should I start making preparations to procure additional capacity?

 

Besides the Capacity Planning dashboard, VMAN 6.3 is equipped with a Capacity Planner function that enables administrators to simulate the outcome of a wide variety of “what-if” scenarios, with the necessary granularity. I do appreciate the ability to use three options for modeling: peak, 95th percentile and 75th percentile. Peak will take in consideration usage spikes, which can be in some cases necessary if the workloads cannot tolerate any contention/resource constraint situation. The two latter make it possible to “smoothen” the data used for modeling, by eliminating usage spikes in the calculation. While the benefit may not be immediately apparent in smaller environments, it can have a decisive financial impact on larger clusters.

 

Corollary to the Capacity Planning activities is the Showback dashboard. Provided that you have organized your resources in folders, you are able to show users what they are actually consuming. You can also run chargeback reports where you can define pricing for consumed resources. These can be helpful not only from a financial perspective but also from a political one as they help, in most mentally stable environments, to bring back a level of awareness and accountability into how resources are consumed. If a division has successfully deployed their new analytics software which ends up starving the entire environment, showback/chargeback will be decisive to explain the impact of their deployment (and obtain or coerce their eventual financial contribution to expanding capacity).

 

Going farther

Time Travel, a feature which correlates alerts with time flow, is a powerful aid in troubleshooting and performing root cause analysis. By snapshotting at regular intervals metrics from the environment, you are able to understand what events were happening at a given point in time. The sudden performance degradation of a VM becomes easier to investigate by reviewing what happened in parallel. Now you can determine whether the issue was caused by intensive I/O on a shared storage volume or if there was extremely high network traffic that caused congestion problems.

 

VMAN 6.3: The Chosen One?

VMAN 6.3 provides an end-to-end virtualization management experience that covers not only analysis, correlation and reporting but also actionable insights. It empowers administrators with the necessary tools to have a full overview of their data center health. Last but not least, the integration with Solarwinds Orion platform and other management software from Solarwinds (Network Performance Monitor, Database Performance Analyzer, etc.) provides enterprises with a true and unique single pane of glass experience (a term I use extremely rarely due to its abuse) to monitor their entire data center infrastructure.

 

So is Solarwinds the Chosen One that will bring balance to the data center? No, you – the administrator- are the Chosen One. But you will need the help of the Solarwinds Force to make the prophecy become a reality. Use it wisely.

Capacity Planning 101

The objective of Capacity Planning is to adequately anticipate current and future capacity demand (resource consumption requirements) for a given environment. This helps to accurately evaluate demand growth, identify growth drivers and proactively trigger any procurement activities (purchase, extension, upgrade etc.).

 

Capacity planning is based primarily on two items. The first one is analyzing historical data to obtain organic consumption and growth trends. The second one is predicting the future by analyzing the pipeline of upcoming projects, taking also in consideration migrations and hardware refreshes. IT and Business must work hand-in-hand to ensure that any upcoming projects are well-known in advance.

 

The Challenges with Capacity Planning or “the way we’ve always done it”

 

Manual capacity planning by running scripts here and there, exporting data, compiling data and leveraging Excel formulas can work. However, there are limits of one’s time availability, and at the expense of not focusing into higher priority issues.

 

The time spent on manually parsing data, reconciling and reviewing can be nothing short of a huge challenge, if not a waste of time. The larger an environment grows, the larger the dataset will be, the longer it will take to prepare capacity reports. And the more manual the work is, the more it is prone to human errors.  While it’s safe to assume that any person with Excel skills and a decent set of instruction can generate capacity reports, the question remains about their accuracy. It’s also important to point out that new challenges have emerged for those who like manual work.

 

Space saving technologies like deduplication and compression have complicated things. What used to be a fairly simple calculation of linear growth based on growth trends and YoY estimates is now complicated by non-linear aspects such as compression and dedupe savings. Since both compression and deduplication ratios are dictated by the type of data as well as the specifics of the technology (see in-line vs. at-rest deduplication, as well as block size), it becomes extremely complicated to factor this into a manual calculation process. Of course, you could “guesstimate” compression and/or deduplication factors for each of your servers. But the expected savings can also fail to materialize for a variety of reasons.

 

Typical mistakes in capacity management and capacity planning involve space reclamation activities at the storage array level. Rather, the lack of  awareness and  activities on the matter. Monitoring storage consumption at the array level without relating with the way storage has been provisioned at the hypervisor level may result in discrepancies. For example, not running Thin Provisioning Block Space Reclamation (through the VMware VAAI UNMAP primitive) on VMware environments may lead some individuals to believe that a storage array is reaching critical capacity levels while in fact a large portion of the allocated blocks is no longer active and can be reclaimed.

 

Finally, in manual capacity planning, any attempt to run “What-If” scenarios (adding n number of VMs with a given usage profile for a new project) are wild guesses at best. Even while having the best intentions and focus, you are likely to end up either with an under-provisioned environment and resource pressure, or with an over-provisioned environment with idle resources. While the latter is preferable, this is still a waste of money that might’ve been invested anywhere else.

 

Capacity Planning – Doing It Right

 

As we’ve seen above, the following factors can cause incorrect capacity planning:

  • Multiple sources of data collected in different ways
  • Extremely large datasets to be processed/aggregated manually
  • Manual, simplistic data analysis
  • Key technological improvements not taken into account
  • No simple way to determine effects of a new project into infrastructure expansion plans

 

Additionally, all of the factors above are also prone to human errors.

 

Because the task of processing data manually is nearly impossible and also highly inefficient, precious allies such as Solarwinds Virtualization Manager are required to identify real-time issues, bottlenecks, potential noisy neighbors as well as wasted resources. Once these wasted resources are reclaimed, capacity planning can provide a better evaluation of the actual estimated growth in your environment.

 

Capacity planning activities are not just about looking into the future, but also about managing the environment as it is now. The link between Capacity Planning and Capacity Reclamation activities is crucial. Just as you want to keep your house tidy before planning an extension or improving it with new furniture, the same needs to be done with your virtual infrastructure.

 

Proper capacity planning should factor in the following items:

  • Central, authoritative data source (all the data is collected by a single platform)
  • Automated data aggregation and processing through software engine
  • Advanced data analysis based on historical trends and usage patterns
  • What-If scenarios engine for proper measurement of upcoming projects
  • Capacity reclamation capabilities (Managing VM sprawl)

 

Conclusion

 

Enterprises must consider whether capacity planning done “the way we’ve always done it” is adding any value to their business or rather being the Achilles heel of their IT strategy. Because of its criticality, capacity planning should not be considered as a recurring manual data collection/aggregation chore that is assigned to “people who know Excel”. Instead, it should be run as a central, authoritative function that measures current usage, informs about potential issues and provides key insights to plan future investments in time.

What is VM sprawl ?

VM sprawl is defined as a waste of resources (compute : CPU cycles and RAM consumption) as well as storage capacity due to a lack of oversight and control over VM resource provisioning. Because of its uncontrolled nature, VM sprawl has adverse effects on your environment’s performance at best, and can lead to more serious complications (including downtime) in constrained environments.

 

VM Sprawl and its consequences

Lack of management and control over the environment will cause VMs to be created in an uncontrolled way. This means not only the total number of VMs in a given environment, but also how resources are allocated to these VMs. You could have a large environment with minimal sprawl, but a smaller environment with considerable sprawl.

 

Here are some of the factors that cause VM sprawl:

 

  • Oversized VMs: VMs which were allocated more resources than they really need. Consequences:
    • Waste of compute and/or storage resources
    • Over-allocation of RAM will cause ballooning and swapping to disk if the environment falls under memory pressure, which will result in performance degradation
    • Over-allocation of virtual CPU will cause high co-stops, which means that the more vCPUs a VM has, the more it needs to wait for CPU cycles to be available on all the physical cores at the same moment. The more vCPUs a VM has, the less likely it is that all the cores will be available at the same time
    • The more RAM and vCPU a VM has, the higher is the RAM overhead required by the hypervisor.

 

  • Idle VMs: VMs up and running, not necessarily oversized, but being unused and having no activity. Consequences:
  • Waste of computer and/or storage resources + RAM overhead at the hypervisor level
  • Resources wasted by Idle VMs may impact CPU scheduling and RAM allocation while the environment is under contention
  • Powered Off VMs and orphaned VMDKs eat up space resources

 

 

How to Manage VM sprawl

Controlling and containing VM sprawl relies on process and operational aspects. The former covers how one prevents VM sprawl from happening, while the latter covers how to tackle sprawl that happens regardless of controls set up at the process level.

 

Process

On the process side, IT should define standards and implement policies:

 

  • Role Based Access Control which defines roles & permissions on who can do what. This will greatly help reduce the creation of rogue VMs and snapshots.
  • Define VM categories and acceptable maximums: while not all the VMs can fit in one box, standardizing on several VM categories (application, databases, etc.) will help filter out bizarre or oversized requests. Advanced companies with self-service portals may want to restrict/categorize what VMs can be created by which users or business units
  • Challenge any oversized VM request and demand justification for potentially oversized VMs
  • Allocate resources based on real utilization. You can propose a policy where a VM resources will be monitored during 90 days after which IT can adjust resource allocation if the VM is undersized or oversized.
  • Implement policies on snapshots lifetime and track snapshot creation requests if possible

 

In certain environments where VMs and their allocated resources are chargeable, you should contact your customers to let them know that a VM needs to be resized or was already resized (based on your policies and rules of engagement) to ensure they are not billed incorrectly. It is worthwhile to formalize your procedures for how VM sprawl management activities will be covered, and to agree with stakeholders on pre-defined downtime windows that will allow you to seamlessly carry any right-sizing activities.

 

Operational

Even with the controls above, sprawl can still happen. It can be caused by a variety of factors. For example, you could have a batch of VMs provisioned for one project, but while they passed through the process controls, they can sit idle for months eating up resources because the project could end up being delayed or cancelled and no one informed the IT team.

 

In VMware environments where storage is thin provisioned at the array level, and where Storage DRS is enabled on datastore clusters it’s also important to monitor the storage consumption at the array level. While storage capacity will appear to be freed up at the datastore level after a VM is moved around or deleted, it will not be released on the array and this can lead to out-of-storage conditions. A manual triggering of the VAAI Unmap primitive will be required, ideally outside of business hours, to reclaim unallocated space. It’s thus important to have, as a part of your operational procedures, a capacity reclamation process that is triggered regularly.

 

The usage of virtual infrastructure management tools with built-in resource analysis & reclamation capabilities, such as Solarwinds Virtualization Manager, is a must. By leveraging software capabilities, these tedious analysis and reconciliation tasks are no longer required and dashboards present IT teams with immediately actionable results.

 

Conclusion

Even with all the good will in the world, VM sprawl will happen. Although you may have the best policies in place, your environment is dynamic and in the rush that IT Operations are, you just can’t have an eye on everything. And this is coming from a guy whose team successfully recovered 22 TB of space previously occupied by orphaned VMDKs earlier this year.

Filter Blog

By date: By tag: