Geek Speak

9 Posts authored by: maxmortillaro

In the age of exploration, cartographers used to navigate around the world and map the coastlines of unexplored continents. The coastline of IT, and moreover the inner landscapes and features, has become much more complex than a decade ago. The cost and effort needed to perform adequate mapping the old way has gone way upwards, and manual mapping is no longer an affordable endeavor, save for a productive one. Organizations and administrators need a solution to the problem, but where to start from?


To continue on this analogy, explorers of old had a few things to help themselves: maps of the known world, navigation instruments and the stars. They also set sail to discover the vast world and uncover its riches, at the price that most of us know now. Back to our modern world: our goal is to understand which services are critical to a business service, and the reason why we want to understand this is clear. We want to ensure the delivery of IT services with the best possible uptime and performance, without disruptions if possible.


It’s essential to start from the business service view. We need to base ourselves, like explorers of old, on existing maps and features as a reference point. Each organization will have its own way of documenting (hopefully), but the most probable starting point would be a service Business Impact Assessment (BIA). The BIA would give a description of upstream and downstream dependencies of a given service, application platforms (and eventually named systems) involved in supporting the service. From there, we can eventually be led to documentation that describes an application, its components, architecture, and systems.


Creating and maintaining a catalog of business impact assessments diverges from the usual kind of works IT personnel does. It might not even be a purely IT endeavor, as compliance departments in larger organizations may own the process. Nevertheless, it is essential that IT is involved because a BIA is the ideal place to capture criticality requirements. It helps articulate how a given process or service impacts the organization’s ability to conduct business operations, assess how the organization is impacted in case of failure, and determine the steps to recover the service. Capturing adverse impact is a key activity because it helps to classify the criticality of the service itself in case of failure. Impact can be financial (loss of revenue, loss of business), reputational (loss of trust from investors/ customers/partners, press scrutiny), or regulatory (loss of trust from regulatory bodies/legislative authorities, regulatory scrutiny, regulatory audits, and eventually even revocation of license to operate in a given country/region for regulated businesses).


The inconvenience with any BIA or written document is that they are a point-in-time description of a service, which is cast in stone until the next documentation revision date. Therefore it is a necessity to engage with the business process owners, and eventually with application teams, to understand if any changes were introduced. While this allows for a better view of the current state, it has the disadvantage of being a manual process with a lot of back-and-forth interactions. Another challenge we might encounter is that the BIA strictly covers a single process, without mentioning any of the upstream/downstream dependencies, or perhaps mentioning them, but without referring to any document (because there was no BIA done for another service, for example). It might also be impossible to even get one done, because a given process could rely on a third-party service or data source, over which we have no control.


There’s also another challenge looming: Shadow IT. Shadow IT broadly characterizes any IT systems that support an organization’s business objectives, but fall outside of IT scope either by omission or by a deliberate will to conceal the existence of such systems to IT. Because these systems exist outside of a formally documented scope, or are not known to IT organizations, it is very difficult to assert their criticality, at least from an IT standpoint. Portions of business processes or entire business divisions may be leveraging external or third-party services, upon which IT has no oversight or control, and yet IT would be held responsible in case of failure.


How can IT understand the criticality of a given application service in the context of a business service when the view is incomplete or even unknown?


  • From a business perspective, the organization leadership should assert or reassert IT’s role in the organization’s digital strategy, by making IT the one-stop shop for all IT related matters. Roles and responsibilities must be well established, and the organization’s leadership (CIO / CTO) should take an official stance on how to handle shadow IT projects.
  • From a compliance perspective, clear processes must be established about services & systems documentation. The necessity to document business processes and underlying technical systems / platforms is evident, critical services from a business perspective should be documented via Business Impact Analysis and collected/regularly reviewed in the documentation that covers the organization business continuity strategy (usually a Business Continuity Plan).
  • From a technical perspective, the IT organization should be involved into compliance / documentation processes not only for review purposes but also to provide the technical standpoint and provide the necessary technical steps that fall under the Business Continuity/Disaster Recovery strategy.


To encompass these three perspectives, regular checkpoints, meetings or review can help maintain the consistency of the view and the strategy. Is this however sufficient? Unfortunately, not always. Those concepts work perfectly with consistent and stateful processes/systems, but the gradual advent of ephemeral workloads that can be spinned up or scaled down on demand becomes difficult to keep full track.


While a well-defined documentation framework is necessary to establish processes that must be adhered to, and while documented processes with prioritization and criticality levels are essential, it is also necessary to complement this approach with a dynamic and real-time view of the systems.


Modern IT operations management tools should allow the grouping of assets not only by category or location, but also by logical constructs, such as an application view or even a process view. These capabilities have existed in the past, but were always performed manually. Advanced management platforms should leverage traffic flow monitoring capabilities to understand which systems are interacting together, and logically group them based on traffic types. This requires a certain level of intelligence built into the tool. For example, in a Windows-based environment, many systems will communicate with the Active Directory domain controllers, or with a Microsoft Systems Center Configuration Manager installation. The existence of traffic between multiple servers and these servers doesn’t necessarily imply an application dependency. The same could be said on a Linux environment where traffic happens between many servers and an NTP server or a yum repository. On the other hand, traffic via other ports could hint at application relationships. A web server communicating with another server via port 3306 would probably mean a MySQL database is being accessed and would constitute plausible evidence of an application dependency.


Knowing which services are critical to a business service doesn’t require the use of a Palantir. It should be a wise blend of relying on solid business processes and on modern IT operations management platforms, with a holistic view of interactions between multiple systems and intelligent categorization capabilities.

No, it’s not the latest culinary invention from a famous Italian chef: spaghetti cabling (a nice wording for cabling inferno) is a sour dish we’d rather not eat. Beyond this unsavory term hides the complexity of many environments that have grown organically, where “quick fixes” have crystallized into permanent solutions, and where data center racks are entangled in cables, as if they had become a modern version of Shelob’s Lair.


These cabling horrors are not an act of art. Instead, they prosaically connect systems together to form the backbone of infrastructures that support many organizations. Having had experience in the past with spaghetti cabling, I can very vividly remember the endless back-and-forth discussions with my colleagues. This usually happened when one was trying to identify the switch port to patch panel connectivity while the other was checking if the system network interface is up or down. That then resulted in trying to figure out if patch panel ports were correctly mapped with wall outlet plug identification. All of this to troubleshoot a problem that would be very trivial if it wasn’t for careless and unprofessional cabling.


The analogy with other infrastructure assets is very similar: it can be very difficult for administrators to find a needle in the haystack, especially when the asset is not physical and the infrastructure is large. Multi-tiered architectures, or daisy-chained business processes relying on multiple sources of data, increase potential failure points in the data processing stream. This sometimes makes troubleshooting a far more complex endeavor than it used to be due to upstream or downstream dependencies.


One would expect that upstream dependencies would impact a system in such a way that it is no longer able to process data, and thus come to a halt without impact to downstream systems. While this can be a safe assumption, there are also cases where the issue isn’t a hard stop. Instead, the issue becomes data corruption. Either by handing over incorrect data or by handing over only fragments of usable data. In such occurrences, it is also necessary to identify the downstream systems and stop them to avoid further damage until the core issue has been investigated and fixed.


Thus, there is a real need for mapping the upstream and downstream dependencies of an application. There are cases in which it’s preferable to bring an entire system to a halt rather than risk financial losses (and eventually litigation, not to mention sanctions), if incorrect data makes its way into production systems. In that case, it would ultimately impact the quality of a manufactured product (think critical products, such as medicines, food, etc.) or a data batch meant for further consumption by a third party (financial reconciliation data, credit ratings, etc.).


Beyond troubleshooting, it’s crucial for organizations to have an end-to-end vision of their systems and assets, preferably into a System of Record. This could be for inventory purposes or for management processes, whether based on ITIL or not. The IT view is not always the same as the business view, however both are bound by the same common goal: to help the organization deliver on its business objectives. The service owner will focus on the business and process outcomes, while the IT organization will usually focus on uptime and quality of service. Understanding how assets are grouped and interact together is key in maintaining fast reaction capabilities, if not acting proactively to avoid outages.


There is no magical recipe to untangle the webs of spaghetti cabling, however advanced detection/mapping capabilities. Existing information in the organization should help IT and business obtain a precise map of existing systems, and understand how data flows in and out of the system with a little detective work.


In our view, the following activities are key enablers to obtain full-view clarity on the infrastructure:

  • Business service view: the business service view is essential in understanding the dependencies between assets, systems, and processes. Existing service maps and documentation, such as business impact assessments, should ideally contain enough information to capture the process view and system dependencies.


  • Infrastructure view: it is advisable to rely on infrastructure monitoring tools with advanced mapping / relationship / traffic-flow analysis capabilities. These can be used to complement/validate existing business service views listed above (for lucky administrators / IT departments), or as a starting point to map traffic flows first, then reach out to business stakeholders to formalize the views and system relationships.


  • Impact conditions and parent-child relationships: these usually would be captured in a System of Record, such as a CMDB, but might eventually be also available on a monitoring system. An event impacting a parent asset would usually cascade down to child assets.


  • Finally, regular service mapping review sessions between IT and business stakeholders are advised to assert any changes.


Taken in its tightest interpretation, the inner circle of handling “spaghetti cabling” problems should remain within the sphere of IT Operations Management. However, professional and conscious system administrators will always be looking at how to improve things, and will likely expand to the other activities described above.


In our view, it is an excellent way to further develop one’s skills. First, by going above and beyond one’s scope of activities, it can help us build a track record of dependability and reliability. Second, engaging with the business can help us foster our communication skills and move from a sometimes tense and frail relationship to building bridges of trust. And finally, the ability to understand how IT can contribute to the resolution of business challenges can help us move our vision from a purely IT-centric view to a more holistic understanding of how organizations work, and how our prioritization of certain actions can help better grease the wheels.

I love watching those modern movies where IT works magically. In these movies, any average Joe with access to a computer or terminal can instantly access anything with seamless effort. Everything is clear and neat, we’re presented with impeccably clean data centers, long alleys of servers with no spaghetti cables lying around, and the operators knows pretty much every single system, address and login information.


For a sizeable portion of the IT workforce though, this idyllic vision of the IT world is only a piece of fiction. I see here an interesting parallel with the IT world that we know. No matter how well intended we are, or how much meticulous we are with tracking our infrastructure and virtual machines, we will eventually lose track at some point in time, especially if we do not have a kind of automated solution that at the very least regularly scans our infrastructure.


In my previous post, I was discussing about Business Services and their challenges in complex environments and explaining how critical it is to properly map business service dependencies through a Business Impact Analysis (BIA) among other tools. To be able to look at systems from a business service perspective, we have to readjust our vision to see the hidden wavelength of the business services spectrum. Let’s put on our X-Ray business vision glasses to look beyond the visible IT infrastructure spectrum.


Why is it so complicated to keep track of business systems? Shouldn’t we just have a map of everything, automatically generated and up-to-date? And shouldn’t each business service be accountable for their own service / system map?


In an idyllic world the answer should be yes. However, with our busy lives and a focus on delivering, attention sometimes takes a toll and priorities get adjusted accordingly, especially in reactive environments. Lack of documentation, high turnover rates in personnel, as well as no proper handover / knowledge transfer sessions can be contributing factors to losing track of existing systems. But even in well intentioned cases, we may have some misses. It could be that new project which your deputy forgot to tell you about while you were on holidays, where a couple dozen systems were on-boarded into IT support, because they were too busy firefighting. Or it could be that recently purchased smaller company, where nobody informed IT that some new IT systems must be supported. Or again, this division which now runs the entire order processing system directly on a public cloud provider infrastructure.


Finger pointing aside, and looking at the broader scope, there has to be a way to do a better job at identifying those unknown systems, especially in the context of an IT organization that is oriented towards supporting business services. Mapping business service dependencies should be a collaborative effort between IT and business stakeholders, where the goal is to cover end-to-end all of the IT systems that participate in a given business process or function. In an ideal world, this mapping activity should be conducted through various interviews with individual stakeholders, service line owners and key contributors to a given process.


It is, however, difficult to obtain 100% success in such activities. A real-life example I’ve encountered during my career was the infamous “hidden innocuous desktop computer under a table”: A regular-looking machine running a crazy Excel macro which was pumping multicast Reuters financial feeds and pushing the data into an SQL server which would then be queried by an entire business division. This innocuous desktop was a critical component for a business activity which had a turnaround of cca. 500 million USD per day. With this computer off, the risk was regulatory fines, loss of reputation and loss of business from disgruntled customers. This component was eventually virtualized once we figured out that it existed. But for one found, how many are left lying around in cabinets and under tables?


Organizations have evolved, and long gone is the time where a monolithic core IT system was used across the company. Nowadays, a single business service may rely on multiple applications, systems and processes, and the opposite is also true: one application may also service multiple business services. Similarly, the traditional boundaries between on-premises and off-premises systems have been obliterated. Multi-tiered applications may run on different infrastructures, with portions sometimes out sight from an IT infrastructure perspective.


In this complex, entangled, and dynamic world, the ability to document and maintain one-to-one relationships between systems requires exponentially more time and is at best a difficult endeavor. So how do we cope with this issue and where do we go next?


Not all business services and processes are created equal, some are critical to the organization and others are secondary. A process that requires data to be collated and analyzed once every two weeks for regulatory compliance has a lower priority than another process which handles manufacturing and shipping. This classification is essential because it is fundamentally different from the IT infrastructure view. In IT Operations, a P1 incident often indicates downtime and service unavailability for multiple stakeholders. That incident classification already derives from multiple inputs such as the number of affected users, whether the environment is production or not. But with additional context such as business service priority, it becomes easier to effectively assess the impact and better manage expectations, resource assignment and resolution.


Automated application mapping and traffic flow identification capabilities in monitoring systems are essential for mapping system dependencies (and thus business service dependencies) and avoid similar situations as the ones described above. But moreover, tools which allow the organization to break away from the classical IT infrastructure view and incorporate a business services view are the most likely to succeed.

Business services and infrastructure services have divergent interests and requirements: business services are not focusing on IT. They may leverage IT, but their role is to be a core enabler for the organization to execute on its business strategy, i.e. delivering tangible business outcomes to internal or external customers that help the organization move forward. A business service could focus on the timely analysis & delivery of market data within an organization to drive its strategy, another business service could be to allow external customers to make online purchases.


Infrastructure services will instead focus on providing and managing a stable and resilient infrastructure platform to run workloads. It will not necessarily matter to the organization whether these are running on-premises or off-premises. What the organization leadership expects from infrastructure services (i.e. IT) is to ensure business services can leverage the infrastructure to execute whatever is needed without any performance or stability impact.

Considering that the audience is very familiar with infrastructure services, we will focus the discussion here on what business services are and what makes them so sensitive to any IT outages or performance degradation.


Business services, while seemingly independent, are very often interconnected with other organization IT systems, and sometimes even with third-party interfaces.  A business service can thus be seen (from an IT perspective and at an abstract level) as a collection of systems expecting inputs from either humans or other sources of information, performing processing activities and delivering outputs (once again either to humans or to other sources of information).


One of the challenges with business services lies with the partitioning of its software components: not everybody may know the “big picture” of what components are required to make the entire service/process work. Within the business service, there will be usually a handful of individuals who’ve been around long enough to know the big picture, but this may not always be properly documented. The impossibility, inability or even lack of awareness that upstream and downstream dependencies of an entire business service must be documented properly is often the culprit to extended downtimes with laborious investigation and recovery activities.


In the author’s view, one of the ways to properly map the dependencies of a given business service is to perform a Business Impact Analysis (BIA) exercise. The BIA is interesting because it covers exactly the business service from a business perspective: what is the financial and reputational impact, how much money would be lost, what happens to employees, will the organization be fined or even worse have its business license revoked?


Beyond these questions it also delves down to identifying all of the dependencies that are required to make the business service run. These might be the availability of infrastructure services and qualified service personnel, but also the availability of upstream sources such as data streams that are necessary for the business service to execute its processes. Finally, the BIA also looks at the broader picture. If a location is lost because of a major disaster, perhaps it makes no longer sense to “care” about a given business service or process, when priorities have now shifted somewhere else.


Depending on the size of the organization, its business focus and the variety of business services it delivers, the ability to map dependencies will greatly vary. Smaller organizations that operate in a single industry vertical might have a simplified business services structure and thus a simpler underlying services map, coupled with easier processes. Larger organizations, and especially regulated ones (think of the financial or pharmaceutical sectors, for example), will have much more complex structures which impact business services.


Keeping in mind the focus is on business services in the context of upstream/downstream dependencies, complexities can be induced by the following:

  • organizational structure (local sites vs. headquarter)
  • regulatory requirements (necessity to take into account in business processes the requirement to provide outputs to their regulatory body)
  • environmental requirements - production processes depend on external factors (temperature/humidity, quality grade of raw materials, etc.)
  • availability of upstream data sources & dependency on other processes (inability to invest if market data is not available, inability to manufacture drugs if environmental info is missing, inability to reconcile transaction settlements etc.)


In these complex environments, the cause of a disruption to a business service may not be immediately evident and therefore an adequate service mapping will help, especially in the context of a BIA. Needless to say, it may not always be an easy walk in the park to get this done, especially if key members in the organization which were the only ones to understand the full context are gone. It might even be much worse in the case of a disaster or an unfortunate life incident (the author has experienced this in at least two organizations).


What about IT / infrastructure services, and how can they help with the challenges of business services? It would be wrong to assume that IT is the panacea to all problems and the all-seeing-eye of an organization. There is however a tendency to assume that because business services execute on top of infrastructure services, IT has an all-encompassing view of which application servers are interacting with which databases, and this leads organizations to believe that only IT can fully map a business service.


The belief holds partially true: IT organizations that leverage advanced monitoring solutions are able to map a majority of infrastructure/application dependencies and view traffic flows between systems. In our view, these solutions should always be leveraged because they drastically improve the MTTR (Mean-Time-To-Resolution) of an incident. Nevertheless, in the context of a BIA and of the business view of services, we believe that while IT should definitely be a contributor to business service mapping, it should not be the owner of such plans. The full view on business services requires the organization not only to incorporate IT’s inputs, but also to gather the entire process flow for any given business process, to understand which inputs are required and which outputs are provided, as those may not always end in an handshake with an IT infrastructure service process.

The use of cloud technology and services--especially public cloud--has become nearly ubiquitous. For example, it has made its way into even the most conservative organizations. Despite the fact that some find it challenging to support the service following adoption, the supportability resides with the public cloud provider. The business unit that decides to leverage public cloud is on their own. And while we’re at it, well done for them, because they didn’t want to use our own internal infrastructure or private cloud, if we’re a more advanced organization).


Sometimes It Isn't Up to IT

But to what extent does this binary (and somehow logical) vision of things hold true? The old adage that says, "If it has knobs, it’s supported by our internal IT departments" is once again proving to be correct. Even with public cloud, an infrastructure that is (hopefully) managed by a third-party provider, there are very meager chances that our organization will exonerate us from the burden of supporting any applications that run in the cloud. Chances are even slimmer for IT to push back on management decisions: they may seem inconsiderate from an IT perspective, but make sense (for better or worse) from a business perspective.


Challenges Ahead

With business units’ entitlement to leverage cloud services comes the question about which public clouds will be leveraged, or rather the probability that multiple cloud providers will be used without any consideration of IT supportability of the service. This makes it very difficult for IT to support and monitor the availability of services without having IT operations jump from monitoring console on cloud provider A to their on-premises solution, and then back to cloud provider B’s own panel of glass.


With that comes the question of onboarding IT personnel into each of the public cloud providers' IAM (Identity & Access Management) platforms, manage different sets of permissions for each of the applications and each of the platforms. This adds heavy and unnecessary management overhead on top of IT responsibilities.


And finally comes the relevance of monitoring the off-premises infrastructure with off-premises tools, such as those provided by public cloud operators. One potential issue, although unlikely, is the unavailability of the off-premises monitoring platform, or a major outage at the public cloud provider. Another issue could be, in the case where an internal process relies on an externally hosted application, that the off-premises application reports as being up and running at the public cloud provider, and yet is unreachable from the internal network.


The option of running an off-premises monitoring function exists, but it presents several risks. Beyond the operational risk of being oblivious to what is going on in case of a network outage/dysfunction (either because access to the off-premises platform is unavailable, or because the off-premises solution cannot see the on-premises infrastructure) is the more serious and insidious threat because it exposes an organization’s entire network and systems topology to a third-party. While this may be a minor problem for smaller companies, larger organizations operating in regulating markets may think twice about exposing their assets and will generally favor on-premises solutions.


Getting Cloud Monitoring Right

Cloud monitoring doesn’t differ from traditional on-premises infrastructure monitoring, and shouldn’t constitute a separate discipline. In the context of hybrid IT, where boundaries between on-premises and off-premises infrastructures dissolve to place applications at the crossroads of business and IT interests, there is intrinsic value to be found with on-premises monitoring of cloud-based assets.


A platform-agnostic approach to monitoring on-premises and cloud assets via a unified interface, backed by the consistent naming of metrics and attributes across platforms will help IT operators instantly understand what is happening, regardless of the infrastructure in which the issue is happening, and without necessarily having to understand or learn the taxonomy imposed by a given cloud provider.


IT departments can thus attain a holistic view that goes beyond infrastructure silos or inherent differences between clouds, and focus on delivering the value that business expects from them. Guarantee the availability and performance of business systems, regardless of their location, and ensure the monitoring function is not impacted by external events while respecting SLAs and maintaining control over their infrastructure.

Following my review of Solarwinds Virtualization Manager 6.3, the fair folks at Solarwinds gave me the opportunity to put my hands on their next planned release, namely VMAN 6.4. While there is no official release date yet, I would bet on an announcement within Q4-2016. The version I tested is 6.4 Beta 2. So what’s new with this release?


From a UI perspective, VMAN 6.4 is very similar to its predecessor. Like with VMAN 6.3, you install the appliance and either install VIM (Virtual Infrastructure Monitor component) on a standalone Windows Server, or integrate with an existing Orion deployment if you already use other Solarwinds products. You’d almost think that no changes have happened until you head over to the « Virtualisation Summary » page. The new, killer feature of VMAN 6.4 is called « Recommendations » and while it seems like a minor UI improvement there’s much more to it than it looks like.


While in VMAN 6.3 you are presented with a list of items requiring your attention (over/under-provisioned VMs, idle VMs, orphan VMDK files, snapshots etc. – see my previous review), in VMAN 6.4 all of these items are aggregated in the « Recommendations » view.


Two types of recommendations exist: Active or Predicted. Active Recommendations are immediate recommendations that are correlated with issues that are currently showing up in your environment. If you are experimenting memory pressure on a given host, an active recommendation would propose you to move one or more VMs to another host to balance the pressure. Predicted recommendations, on the other hand, focus on proactively identifying potential issues before they become a concern, based on usage history in your environment.


The « Recommendations » feature is very pleasant to use and introduces a few elements that are quite important from a virtualisation administrator perspective:


  • First of all, administrators have the possibility to apply a recommendation immediately or schedule it for a later time (out of business hours, change windows, etc.)
  • Secondly, an option is offered to either power down a VM to apply the recommendation or to attempt to apply the recommendation without any power operations. This features comes in handy if you need to migrate VMs, as you may run into cases where a Power Off/Power On is required, while in other cases a vMotion / live migration will suffice
  • Last but not least, the « Recommendations » module will check if the problem still exists before actually applying a recommendation. This makes particularly sense in the case of active recommendations that may no longer be relevant by the time you decide to apply the recommendation (for example if you decide to schedule a recommendation but the issue is no longer reported by the scheduled time)


A nice and welcome touch in the UI is a visual aid that shows up when hovering your mouse over the proposed recommendations. You will see a simple & readable graphical view / simulation of the before & after status of any given object (cluster, datastore, etc.) in case you decide to apply the recommendation.


Max’s take


The “Recommendations” function, while apparently modest from an UI perspective, is in fact an important improvement that goes beyond the capacity reclamation and VM sprawl controls included in VMAN 6.3. Administrators are now presented with actionable recommendations that are relevant not only in the context of immediate operational issues, but also as countermeasures to prevent future bottlenecks and capacity issues.


A few side notes: if you plan to test the beta version, reach out to the Solarwinds engineers. The new “Recommendations” function is still being fine-tuned and you may not be able to see it if you integrate it with your current VIM or Orion environment. Once you install VMAN 6.4, you should let it run for approximately a week in order to get accurate recommendations.

In my previous posts, I have covered how to manage VM sprawl and how to do proper Capacity Planning. In this post, I would like to share my experience of Solarwinds Virtualization Manager 6.3.


Today’s virtualized data centers are dynamic environments where a myriad of changes (provisioning, snapshotting, deletions etc.) executed by numerous people are difficult to track. Furthermore, virtualization has lifted the traditional physical limits in terms of resource consumption where a workload was bound to a physical server. For consumers, even private data centers have turned into magical clouds where unicorns graze and endlessly expand existing capacity in flash of rainbow. Unfortunately, down to earth administrators know that unlike the universe, data centers have a finite amount of resources available for consumption.


With this, maintaining a healthy data center environment while attempting to satisfy consumers is a challenge for many administrators. Let’s see how Solarwinds Virtualization Manager 6.3 helps tackle this challenge.


VM Sprawl Defeated

As highlighted in my previous article, even with the best intentions in the world there is still some organic sprawl that will make its way in your data center because no matter how carefully you cover your back with processes. The VM Sprawl console of VMAN 6.3 allows administrators to immediately see sprawl-related issues and address them before they start causing serious performance issues.


The VM sprawl dashboard covers the following sprawl issues :

  • Oversized / Undersized VMs
  • VMs with large snapshots
  • Orphaned VMDKs (leftover VMDK files not linked to any existing VM)
  • VMs suffering from high co-stops
  • Idle VMs
  • Powered Off VMs


While it’s good to detect sprawl issues, it’s even better to address them as soon as possible. What I find clever with VMAN 6.3 is that for all of the issues enumerated above, administrators can remediate these from within VMAN, without having to jump from the monitoring tool to their vSphere client or PowerShell. The amount of information provided in each panel is adequate and thus there are no misunderstandings about identifying the culprits and remediating the problems.


I like the fact that everything is presented in a single view, there is no need to run reports here and there to determine how the VMs should be right sized as well as no treasure hunting to find orphaned vmdk files.



Doing Capacity Planning

VMAN 6.3 has a dedicated Capacity Planning dashboard that will highlight current resource consumption, trends/expected depletion date for CPU, RAM and Storage as well as network I/O usage. Here again, a simple but complete view of what matters: do I still have enough capacity? Is a shortage in sight? When should I start making preparations to procure additional capacity?


Besides the Capacity Planning dashboard, VMAN 6.3 is equipped with a Capacity Planner function that enables administrators to simulate the outcome of a wide variety of “what-if” scenarios, with the necessary granularity. I do appreciate the ability to use three options for modeling: peak, 95th percentile and 75th percentile. Peak will take in consideration usage spikes, which can be in some cases necessary if the workloads cannot tolerate any contention/resource constraint situation. The two latter make it possible to “smoothen” the data used for modeling, by eliminating usage spikes in the calculation. While the benefit may not be immediately apparent in smaller environments, it can have a decisive financial impact on larger clusters.


Corollary to the Capacity Planning activities is the Showback dashboard. Provided that you have organized your resources in folders, you are able to show users what they are actually consuming. You can also run chargeback reports where you can define pricing for consumed resources. These can be helpful not only from a financial perspective but also from a political one as they help, in most mentally stable environments, to bring back a level of awareness and accountability into how resources are consumed. If a division has successfully deployed their new analytics software which ends up starving the entire environment, showback/chargeback will be decisive to explain the impact of their deployment (and obtain or coerce their eventual financial contribution to expanding capacity).


Going farther

Time Travel, a feature which correlates alerts with time flow, is a powerful aid in troubleshooting and performing root cause analysis. By snapshotting at regular intervals metrics from the environment, you are able to understand what events were happening at a given point in time. The sudden performance degradation of a VM becomes easier to investigate by reviewing what happened in parallel. Now you can determine whether the issue was caused by intensive I/O on a shared storage volume or if there was extremely high network traffic that caused congestion problems.


VMAN 6.3: The Chosen One?

VMAN 6.3 provides an end-to-end virtualization management experience that covers not only analysis, correlation and reporting but also actionable insights. It empowers administrators with the necessary tools to have a full overview of their data center health. Last but not least, the integration with Solarwinds Orion platform and other management software from Solarwinds (Network Performance Monitor, Database Performance Analyzer, etc.) provides enterprises with a true and unique single pane of glass experience (a term I use extremely rarely due to its abuse) to monitor their entire data center infrastructure.


So is Solarwinds the Chosen One that will bring balance to the data center? No, you – the administrator- are the Chosen One. But you will need the help of the Solarwinds Force to make the prophecy become a reality. Use it wisely.

Capacity Planning 101

The objective of Capacity Planning is to adequately anticipate current and future capacity demand (resource consumption requirements) for a given environment. This helps to accurately evaluate demand growth, identify growth drivers and proactively trigger any procurement activities (purchase, extension, upgrade etc.).


Capacity planning is based primarily on two items. The first one is analyzing historical data to obtain organic consumption and growth trends. The second one is predicting the future by analyzing the pipeline of upcoming projects, taking also in consideration migrations and hardware refreshes. IT and Business must work hand-in-hand to ensure that any upcoming projects are well-known in advance.


The Challenges with Capacity Planning or “the way we’ve always done it”


Manual capacity planning by running scripts here and there, exporting data, compiling data and leveraging Excel formulas can work. However, there are limits of one’s time availability, and at the expense of not focusing into higher priority issues.


The time spent on manually parsing data, reconciling and reviewing can be nothing short of a huge challenge, if not a waste of time. The larger an environment grows, the larger the dataset will be, the longer it will take to prepare capacity reports. And the more manual the work is, the more it is prone to human errors.  While it’s safe to assume that any person with Excel skills and a decent set of instruction can generate capacity reports, the question remains about their accuracy. It’s also important to point out that new challenges have emerged for those who like manual work.


Space saving technologies like deduplication and compression have complicated things. What used to be a fairly simple calculation of linear growth based on growth trends and YoY estimates is now complicated by non-linear aspects such as compression and dedupe savings. Since both compression and deduplication ratios are dictated by the type of data as well as the specifics of the technology (see in-line vs. at-rest deduplication, as well as block size), it becomes extremely complicated to factor this into a manual calculation process. Of course, you could “guesstimate” compression and/or deduplication factors for each of your servers. But the expected savings can also fail to materialize for a variety of reasons.


Typical mistakes in capacity management and capacity planning involve space reclamation activities at the storage array level. Rather, the lack of  awareness and  activities on the matter. Monitoring storage consumption at the array level without relating with the way storage has been provisioned at the hypervisor level may result in discrepancies. For example, not running Thin Provisioning Block Space Reclamation (through the VMware VAAI UNMAP primitive) on VMware environments may lead some individuals to believe that a storage array is reaching critical capacity levels while in fact a large portion of the allocated blocks is no longer active and can be reclaimed.


Finally, in manual capacity planning, any attempt to run “What-If” scenarios (adding n number of VMs with a given usage profile for a new project) are wild guesses at best. Even while having the best intentions and focus, you are likely to end up either with an under-provisioned environment and resource pressure, or with an over-provisioned environment with idle resources. While the latter is preferable, this is still a waste of money that might’ve been invested anywhere else.


Capacity Planning – Doing It Right


As we’ve seen above, the following factors can cause incorrect capacity planning:

  • Multiple sources of data collected in different ways
  • Extremely large datasets to be processed/aggregated manually
  • Manual, simplistic data analysis
  • Key technological improvements not taken into account
  • No simple way to determine effects of a new project into infrastructure expansion plans


Additionally, all of the factors above are also prone to human errors.


Because the task of processing data manually is nearly impossible and also highly inefficient, precious allies such as Solarwinds Virtualization Manager are required to identify real-time issues, bottlenecks, potential noisy neighbors as well as wasted resources. Once these wasted resources are reclaimed, capacity planning can provide a better evaluation of the actual estimated growth in your environment.


Capacity planning activities are not just about looking into the future, but also about managing the environment as it is now. The link between Capacity Planning and Capacity Reclamation activities is crucial. Just as you want to keep your house tidy before planning an extension or improving it with new furniture, the same needs to be done with your virtual infrastructure.


Proper capacity planning should factor in the following items:

  • Central, authoritative data source (all the data is collected by a single platform)
  • Automated data aggregation and processing through software engine
  • Advanced data analysis based on historical trends and usage patterns
  • What-If scenarios engine for proper measurement of upcoming projects
  • Capacity reclamation capabilities (Managing VM sprawl)




Enterprises must consider whether capacity planning done “the way we’ve always done it” is adding any value to their business or rather being the Achilles heel of their IT strategy. Because of its criticality, capacity planning should not be considered as a recurring manual data collection/aggregation chore that is assigned to “people who know Excel”. Instead, it should be run as a central, authoritative function that measures current usage, informs about potential issues and provides key insights to plan future investments in time.

What is VM sprawl ?

VM sprawl is defined as a waste of resources (compute : CPU cycles and RAM consumption) as well as storage capacity due to a lack of oversight and control over VM resource provisioning. Because of its uncontrolled nature, VM sprawl has adverse effects on your environment’s performance at best, and can lead to more serious complications (including downtime) in constrained environments.


VM Sprawl and its consequences

Lack of management and control over the environment will cause VMs to be created in an uncontrolled way. This means not only the total number of VMs in a given environment, but also how resources are allocated to these VMs. You could have a large environment with minimal sprawl, but a smaller environment with considerable sprawl.


Here are some of the factors that cause VM sprawl:


  • Oversized VMs: VMs which were allocated more resources than they really need. Consequences:
    • Waste of compute and/or storage resources
    • Over-allocation of RAM will cause ballooning and swapping to disk if the environment falls under memory pressure, which will result in performance degradation
    • Over-allocation of virtual CPU will cause high co-stops, which means that the more vCPUs a VM has, the more it needs to wait for CPU cycles to be available on all the physical cores at the same moment. The more vCPUs a VM has, the less likely it is that all the cores will be available at the same time
    • The more RAM and vCPU a VM has, the higher is the RAM overhead required by the hypervisor.


  • Idle VMs: VMs up and running, not necessarily oversized, but being unused and having no activity. Consequences:
  • Waste of computer and/or storage resources + RAM overhead at the hypervisor level
  • Resources wasted by Idle VMs may impact CPU scheduling and RAM allocation while the environment is under contention
  • Powered Off VMs and orphaned VMDKs eat up space resources



How to Manage VM sprawl

Controlling and containing VM sprawl relies on process and operational aspects. The former covers how one prevents VM sprawl from happening, while the latter covers how to tackle sprawl that happens regardless of controls set up at the process level.



On the process side, IT should define standards and implement policies:


  • Role Based Access Control which defines roles & permissions on who can do what. This will greatly help reduce the creation of rogue VMs and snapshots.
  • Define VM categories and acceptable maximums: while not all the VMs can fit in one box, standardizing on several VM categories (application, databases, etc.) will help filter out bizarre or oversized requests. Advanced companies with self-service portals may want to restrict/categorize what VMs can be created by which users or business units
  • Challenge any oversized VM request and demand justification for potentially oversized VMs
  • Allocate resources based on real utilization. You can propose a policy where a VM resources will be monitored during 90 days after which IT can adjust resource allocation if the VM is undersized or oversized.
  • Implement policies on snapshots lifetime and track snapshot creation requests if possible


In certain environments where VMs and their allocated resources are chargeable, you should contact your customers to let them know that a VM needs to be resized or was already resized (based on your policies and rules of engagement) to ensure they are not billed incorrectly. It is worthwhile to formalize your procedures for how VM sprawl management activities will be covered, and to agree with stakeholders on pre-defined downtime windows that will allow you to seamlessly carry any right-sizing activities.



Even with the controls above, sprawl can still happen. It can be caused by a variety of factors. For example, you could have a batch of VMs provisioned for one project, but while they passed through the process controls, they can sit idle for months eating up resources because the project could end up being delayed or cancelled and no one informed the IT team.


In VMware environments where storage is thin provisioned at the array level, and where Storage DRS is enabled on datastore clusters it’s also important to monitor the storage consumption at the array level. While storage capacity will appear to be freed up at the datastore level after a VM is moved around or deleted, it will not be released on the array and this can lead to out-of-storage conditions. A manual triggering of the VAAI Unmap primitive will be required, ideally outside of business hours, to reclaim unallocated space. It’s thus important to have, as a part of your operational procedures, a capacity reclamation process that is triggered regularly.


The usage of virtual infrastructure management tools with built-in resource analysis & reclamation capabilities, such as Solarwinds Virtualization Manager, is a must. By leveraging software capabilities, these tedious analysis and reconciliation tasks are no longer required and dashboards present IT teams with immediately actionable results.



Even with all the good will in the world, VM sprawl will happen. Although you may have the best policies in place, your environment is dynamic and in the rush that IT Operations are, you just can’t have an eye on everything. And this is coming from a guy whose team successfully recovered 22 TB of space previously occupied by orphaned VMDKs earlier this year.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.