Geek Speak

13 Posts authored by: mwpreston

With the influx of natural disasters, hacks, and increasingly more common ransomware, being able to recover from a disaster is quickly moving up the priority list for IT departments across the globe. In addition to awareness, we are seeing our data centers move from a very static deployment to an ever-changing environment. Each day we see more and more applications getting deployed, either on-premises or in the cloud, and each day we, as IT professionals, have the due diligence to ensure that when disaster strikes we can recover these applications. Without the proper procedures in place to consistently update our DR plans, no matter how well-crafted or detailed they are, the confidence in completing successful failovers decreases. So what now?


We’ve already discussed the first step in our DR process: creating our plan. We’ve also touched on the second step, which is to make it a living document to accommodate for data center change. But there is one more step we need to put in place for a successful failover, and that's testing. It boosts the confidence in the IT department and the organization as a whole.


Testing our DR plan - We learn by doing!


When thinking of DR plan testing, I always like to compare it to a child. I know, a weird analogy, but if we think about how children learn and get better, it begins to make sense. Children learn by doing; they learn to talk by talking, learn to play sports by playing, etc. The point is that by “walking the walk,” we tend to improve ourselves. The same applies to our DR plans. We can have as many details and processes laid out on paper as we want, but if we can't restore when we need to, we've failed. Essentially, our DR plans are set up for success by also walking the walk, aka testing.


Start small, get bigger!


I’m not recommending going and pulling the plug on your data center tomorrow to see if your plan works. That would certainly be a career-changing move. Instead, you should start small. Take a couple key services as defined in your DR plan and begin to draft a plan on how to test a failover of the components and servers contained within them. Just as when creating our DR plan, details and coordination are the key to success when creating our testing plan.  Know exactly what you are testing for. Don’t simply acknowledge that the servers have booted as a success. Instead, go deeper. Can you log into the application? Can you use the application? Can a member of the department that owns the application sign off stating that it is indeed functioning normally? By knowing exactly what the end goal is you can sign off on a successful test, or on the flip side, take the failures which have occurred and learn from them, updating our plan to reflect any changes, and be prepared for the next testing cycle.


Once you have a couple services defined go ahead and begin to integrate more and ensure that recurring time has been set aside and defined within the DR plan to carry out these tests. A full-scale DR test is not something that can be performed on a regular basis, but we can carry out smaller tests on a monthly or quarterly basis. Without a consistent schedule and attention to detail we can almost guarantee that items like configuration drift will soon creep up and cause our DR testing to fail, or worse, our DR execution to fail.


I’ve mentioned before that not keeping our DR plans up to date is perhaps one the biggest flaws in the whole DR process. However, not applying a consistent testing plan trumps this. Disaster Recovery, in my opinion, cannot be classified as a project. It cannot have an end date and a closing. We must always ensure, when deploying new services and changing existing applications, that we revisit the DR plan, updating both the process of recovering and the process for testing said recovery. Testing our DR plan is a key component in ensuring that all that work we have done in creating our plan will be successful when the plan is most needed. Let’s face it. A failed recovery will put a blemish on the entire DR planning process and all the work that has gone into it. Test and test often to make sure this doesn't happen to you.


I’d love to hear from all of you regarding how you go about testing, or if you even do? Are there any specific starting points for tests that you recommend? Do you start small and then expand? Do you utilize any specific pieces of software, resources or tools to help test your recovery? If you do test, how often? And finally, let’s hear those horror/success stories of any incidents gone bad (or extremely well) as it relates directly to your DR testing procedures. Thanks for reading!

Thus far, we have gone over how to classify our disasters and how to have some of those difficult conversations with our organization regarding Disaster Recovery (DR). We've also briefly touched on Business Continuity, an important piece of disaster recovery. Now the time has come to gather all our information and put together something formal in terms of a Disaster Recovery plan. As easy as it sounds, it can be quite a daunting task once you begin. DR plans, just like their disasters, come in all forms, and you can go as broad or as detailed as you like. There is no real “set in stone” template or set of instructions for DR plan creation. For example, some DR plans may just cover how to get services back up and going at the 100-foot level, maybe focusing on more of a server level. Others may contain application-specific instructions for restoring services, while others cover how to recover from yet another disaster at your secondary site. The point is that it’s your organization's DR plan, so you can do as you like. Just remember that it might not be you, or even your IT department, executing the failover, so the more details the better. That said, I mentioned that once we begin to create our DR plan, it can become quite overwhelming. That is why I always recommend starting at that 100-foot level and circling back to input details later.


So, with all that said, we can conclude that our DR plans can be structured however we wish, and that’s true. A quick Google search will yield hundreds of different templates for DR plans, each unique in their own way. However, to have a legible, solid, successful DR plan, there are five sections it needs to contain.




The introduction of a DR plan is as important as one found in a textbook. Basically, this is where you summarize both the objectives and the scope of the plan. A good introduction will include all the IT services and locations that are protected, as well as the RTOs and RPOs associated with each. Aside from the technical aspect, the introduction should also contain the testing schedule and maintenance scope for the plan, as well as a history of revisions that have been made to the plan.


Roles and Responsibilities


We have talked a lot in this series about including stakeholders and application owners outside of the IT department in our primary discussions. This is the section of the plan where you will formally list all your internal and external departments and personnel who are key to each DR process that has been covered in our DR plan. Remember, execution of this plan is normally run under the event of a disaster, so names are not enough. You need brief descriptions of their duties, contact information, and even alternate contact information to ensure that no one is left in the dark.


Incident Response(s)


This is where you will include how a disaster event is being declared, who has the power to do so, and the chain of communication that shall immediately follow. Remember, we can have many different types of disasters, therefore we can also have many different types of disaster declarations and incident responses. For instance, a major fire will yield a different incident response than that of an attempted ransomware attack. We need to know who is making the declaration, how they are doing so, and whom will be contacted, so on and so forth, down the chain of command.


DR Procedures


Once your disaster has been declared, those outlined within the Roles and Responsibilities can begin to act on steps to bring the production environment back up within your secondary location. This is where those procedures and instructions are laid out, step by step, for each service that is identified within the plans’ scope. A lot of IT departments will jump right into this step, and this where our plan creation can tend to get out of control. A rule of thumb is to really start broad with your process, define any prerequisites, and then dive into details. Once you are done with that, you can circle back for yet another round of details.


For example, “Recover Accounting Services” may be a good place to start. You then can dive into the individual servers that support the service as a whole, listing out all the servers (names, IPs, etc.) you need to have available. You can then get into finer details about how to get each server up and running to support the service as a whole. Even further, you may need to make changes to the application for it to run at your secondary location (maybe you have a different IP scheme, different networks, etc.), or have support for external hardware, such as a fax server to send out purchase orders.




This is where you place a collection of any other documents that may be of value to your organization in the event of a disaster. Vendor contacts, insurance policies, support contracts, can all go into an appendix. If there is a certain procedure to recover a server (for example, you use the same piece of software to protect all services), and you've already provided--in the DR Procedures section--an exhaustive list of instructions, you can always add it here as well, and simply reference it from within the DR plan.


With these five sections filled out, you should be certain that your organization is covered in the event of a disaster. A challenge, however, may be keeping your document up to date as your production environment changes. Today’s data centers are far from the static providers they once were. We are always spinning up new services, retiring old ones, moving things to and from the cloud. Every time that happens--to be successful in DR--we need to reassess that service within our DR plan. It needs to be a living document, right from its creation, and must always be kept up to date! And remember, it’s your DR plan, so include any other documents or sections that you or your organization wants to. At the end of the day, it’s better to have more information available than not enough, especially if you aren’t the person responsible for executing it! Also, please store a copy of this at your secondary location and/or in the cloud. I’ve heard too many stories of organizations losing their DR plan along with their production site.


I’d love to hear your thoughts about all this! How do you structure your DR plans? Are you more detailed or broader in terms of laying out the instructions to recover? Have you ever had to execute a DR plan you weren’t a part of? If so, how did that change your views on creating these types of procedures and documents? Thanks for reading!

All too often, especially if disaster recovery (DR) is driven and pushed by the IT department, organizations can fall into the common mistake of assuming that they are “good to go” in the event disaster hits. While IT departments can certainly handle the technical side of things, ensuring services are up and running if production goes down, they are not necessarily the key stakeholder in ensuring that business processes and services can also be maintained. These business processes and activities can really be summed up in one key term that goes hand in hand with DR - business continuity (BC). Essentially, business continuity oversees the processes and procedures that are carried out in the event of a disaster to help ensure that business functions continue to operate as normal – the key here being business functions. Sure, following the procedures with our disaster recovery plan is a very big piece of our business continuity plan (BCP), but true BCP’s will encompass much more in terms of dealing with a disaster.


BCP: Just a bunch of little DR plans!


When organizations embark on tackling business continuity, it's sometimes easier to break it all down into a bunch of little disaster recovery plans – think DR for IT, DR for accounting, DR for human resources, DR for payroll, etc. The whole point of business continuity is to keep the business running. Sometimes, if it is IT pushing for this, we fall into the trap of just looking at the technical aspects, when really it needs to involve the whole organization! So, with that said, what should really be included in a BCP? Below, we will look at what I feel are four major components that a solid BCP should consider.


Where to go?


Our DR plan does a great job of ensuring that our data and services are up and running in the event disaster hits. However, often what we don’t consider is how employees will access that data. Our employees are used to coming in, sitting down, and logging into a secure internal network. Now that we have restored operations, does a secondary location offer the same benefit that's available to our end-users? Are there enough seats, DHCP, switches to handle all of this? Or, if we have utilized some sort of DRaaS, do they offer seats or labs in the event we need them? Furthermore, depending on the type of disaster incurred, for instance, say it is was a flood, will our employees even be able to travel to alternate locations at all?


Essential Equipment


We know we need to get our servers back up and running. That’s a no brainer! But what about everything else our organization uses to carry out its day-to-day business? It’s the items we take for granted that tend to be forgotten. Photocopiers, fax machines, desks, chairs, etc. Can ALL essential departments maintain their “business as usual” at our secondary site, either fully or in some sort of limited fashion? And aside from equipment, do we need to think of the infrastructure within our secondary site, as well? Are there phone lines installed? And can that be expanded in the event of long-term use of the facility? Even if these items are not readily available, having a plan on how to obtain them will save valuable time in the restoration process. Have a look around you at all the things on your desk and ask yourself if the same is available at your designated DR facility.




Here’s the reality: your building is gone, along with everything that was inside of it! Do you have plans on how to keep in touch with key stakeholders during this time? A good BCP will have lists upon lists of key employees with their contact information, both current and emergency. Even if it is as simple of having employees home/cell phone numbers listed, and possibly, if you host your own email servers, alternate email addresses that are checked on a regular basis. The last thing you want to have is a delay in the process of executing your BCP because you can’t get the go-ahead from someone because you are simply unable to contact them.


Updated Organizational Charts


While having an updated org chart is great to include within a BCP, it is equally, or perhaps even more important, to have alternate versions of these charts in the event that someone is not available. We may not want to think about it, but the possibility of losing someone within the disaster itself is not far-fetched. And since the key function of the BCP is to maintain business processes, we will need to know exactly who to contact if someone else is unavailable. The last thing we need at times like these is staff arguing, or worse, not knowing who will make certain key decisions. Having alternate org charts prepared and ready is critical to ensuring that recovery personnel has the information they need to proceed.


These four items are just the tip of the iceberg when it comes to properly grafting a BCP. But there is much more out there that needs to be considered. Paper records, back-up locations, insurance contacts, emergency contacts, vendor contacts, payroll, banking; essentially every single aspect of our business needs to have a Plan B to ensure that you have an effective, holistic, and more importantly, successful Business Continuity Plan in place. While we as IT professionals might not find these things as “sexy” as implanting SAN replication and metro clusters, the fact of the matter is that we are often called upon when businesses begin their planning around BC and DR. That’s not to say that BC is an IT-related function, because it most certainly is not. But due to our major role in the technical portion of it, we really need to be able to push BC back onto other departments and organizations to ensure that the lights aren’t just on, but that there are people working below them as well.


I’d love to hear from some of you that do have a successful BCP in place. Was it driven by IT to begin with, or was IT just called upon as a portion of it? How detailed (or not) is your plan? Is it simply, “Employees shall report to a certain location,” or does it go as far as prioritizing the employees who gain access? What else might you have inside your plan that isn’t covered here? If you don’t have a plan, why not? Budget? Time? Resources?


Thank you so much for all of the recent comments on the first two articles. Let's keep this conversation going!

Getting right into the technical nitty gritty of the Disaster Recovery (DR) plan is probably my favorite part of the whole process. I mean, as an IT Professional this is our specialty – developing requirements, evaluating solutions, and implementing products. And while this basic process of deploying software and solutions may work great for single task-oriented, department type applications, we will find that in terms of DR there are many more road blocks and challenges that seem to pop up along the way. And if we don’t properly analyze and dissect our existing production environments, or we fail to involve many of the key stakeholders at play, our DR plan will inevitably fail – and failure during a disaster event could be catastrophic to our organizations and, quite possibly, our careers.


So how do we get started?


Before even looking at software and solutions we really should have a solid handle on the requirements and expectations of our key stakeholders. If your organization already has Service Level Agreements (SLA’s) then you are well on your way to completing this first step. However, if you don’t, then you have a lot of work and conversations ahead of you. In terms of disaster recovery, SLA will drive both the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). An RPO essentially dictates the maximum amount of time in which an organization can incur data loss. For instance, if a service has an RPO of 4 hours we would need to ensure that no matter what we can always restore our service with no more than 4 hours of data loss, meaning we would have to ensure that restore points are created on a 4-hour (or smaller) interval. An RTO dictates the amount of time it takes to get our service restored and running after a failure. Thus, an RTO of 4 hours would essentially mean we have 4 hours to get the service up and running after the notification of a failure before we would begin to massively impact our business objectives.


Determining both RTO and RPO can become a very challenging process and really needs to involve all key stakeholders within the business. Our application owners and users will certainly always demand lower RPO and RTO values, however IT departments may inject a bit of realization into the process when a dollar value is placed on meeting those low RPO/RTOs. The point of the exercise though is to really define what’s right for the organization, what can be afforded, and create formal expectations for the organization.


Once our SLA’s, RTOs, and RPOs have been determined then IT can really get started on determining a technical solution to ensure that these requirements can be met. Hopefully we can begin to see the importance of having the expectations set beforehand. For instance, if we had a mission-critical business service with RTO of 10 minutes then we would most likely not rely on a tape backup to protect that service as it would take much longer than that restore from tape, instead, we would most likely implement some form of replication. On the flip side, a file server, or more specifically the data on the file server, may have an RTO of say 10 hours, at which point it could be cost effective to rely on backup to protect this service. My point is, having RTO and RPO set before beginning any technical discovery is key to getting a proper, cost-effective solution.


What else is there to consider?


Ten years ago, we would be pretty much done our preliminary work for a DR plan by simply determining RTO and RPO and could begin investigating solutions – but in today’s modern datacenters that’s simply not the case. We have a lot more at play. What about cloud?  What about SaaS? What about remote workers? Today’s IT deployments don’t just operate within the 4 walls of our datacenters and are most often stretched into all corners of the world – and we need to protect and adhere to our SLA policies no matter where the workload runs. What if Office 365 suddenly had an outage for 3 hours? Is this acceptable to your organization? Do we need to archive the mail somewhere else so at the very least the CEO can get that important message he needed? Same goes with our workloads that may be running in public clouds like Amazon or Azure – we need to ensure that we are doing all we can to protect and restore these workloads.


The upfront work of looking at our environments holistically, determining our SLAs, and developing RTO and RPO’s really do set IT up for success when it comes time to evaluate a technical solution. Quite often we won’t find just one solution that fits our needs – and in most deployments, we will see many different solutions deployed to satisfy a well-built DR plan. We may have one solution that handles backup of cloud and another that handles on-premises workloads. We could also have one solution that replicates to could, and another that moves workloads to our designated DR site. The point being that by focusing most of our time on the development of RPO, RTO, and business practices really lets the organization, and not IT, drive the disaster recovery processes – which in turn lets IT focus on the technical deployment and solutions built around it.


Thus far we have had two posts regarding developing our DR plan which dictate taking a step back and having discussions with our organizations before even beginning to evaluate and implement anything technical. I’d love to hear feedback on this. How do you begin your DR plans? Have you had those conversations with your organization around developing SLA’s? If so, what challenges present themselves? Quite often organization will look to IT for answers that should really be dictated by the business requirements and processes – what are your feelings on this? Leave me a comment below with your thoughts. Thanks for reading!

When organizations first take on the challenge of setting up a disaster recovery plan, it’s almost always based on the premise that a complete failure will occur. With that in mind, we take the approach of planning for a complete recovery. We replicate our services and VMs to some sort of secondary site and go through the processes of documenting how to bring them all up again. While this may be the basis of the technical recovery portion of a DR plan, it’s important to take a step back before jumping right into the assumption of having to recover from a complete failure. Disasters come in all shapes, forms, and sizes, and a great DR plan will accommodate for as many types of disasters possible. For example, we wouldn’t use the same “runbook” to recover from simple data loss that we would use to recover from the total devastation of a hurricane. This just wouldn’t make sense. So even before beginning the recovery portions of our disaster recovery plans we really should focus on the disaster portion.


Classifying Disasters


As mentioned above, the human mind always seems to jump into planning for the worst-case scenario when hearing the words disaster recovery: a building burning down, flooding, etc. What we fail to plan for is other, minor, less significant disasters, such as temporary loss of power or loss of entrance due to quarantine. So, with that said, let’s begin to classify disasters. For the most part, we can lump a disaster into two main categories:


Natural Disasters – these are the most recognized types of disasters. Think of events such as a hurricane, flooding, fire, earthquake, lightning, water damage, etc. When planning for a natural disaster, we can normally go under the assumption that we will be performing a complete recovery or avoidance scenario to a secondary location.


Man-made Disasters – These are the types of disasters that are lesser known to organizations when looking at DR. Think about things such as temporary loss of power, cyberattacks, ransomware, protests, etc. While these intentional and unintentional acts are not as commonly approached, a good disaster recovery plan will address some of these as the recovery from them is often much different from that of a natural disaster.


Once we have classified our disaster into one of these two categories, we can then move on by further drilling down on the disasters. Performing a risk and impact assessment of the disaster scenarios themselves is a great next step. Answers to questions like the ones listed below should be considered when performing our risk assessment because it allows us to further classify our disasters, and, in turn, define expectations and appropriate responses accordingly.


  • Do we still have access to our main premises?
  • Have we lost any data?
  • Has any IT function been depleted or lost?
  • Do we have loss of skill set?


How these questions are answered as it pertains to a disaster can completely change our recovery scenarios. For example, if we have had a fire in the data center and lost data, we would most likely be failing over to another building in a designated amount of time. However, if we had also lost employees, more specifically IT employees in that fire, as well, then the time to recover will certainly be extended as we most likely would have lost skill sets and talent to execute the DR plan. Another great example comes in the form of ransomware. While we still would have physical access to our main premises, the data loss scenario could be much greater due to widespread encryption form the ransomware itself. If our backups were not air-gapped or separate from our infrastructure, then we may also have encrypted backups, meaning we have lost an IT function, thus provoking a possible failover scenario even with physical access to the building.  On the flip side, our risks may not even be technical in nature. What is the impact of losing physical access to our building in the result of protests or chemical spills?  Some disasters like this may not even require a recovery process at all, but still pose a threat due to the loss of access to the hardware.


Disaster recovery is a major undertaking, no matter what size the company or IT infrastructure, and can take copious amounts of time and resources to get it off the ground. With that said, don’t make the mistake of only planning for those big natural disasters. While it may be a great starting point, it’s best to really list out some of the more common, more probable types of disasters as well, document the risks and recovery steps in turn. In the end, you are more likely to be battling cyber attacks, power loss, and data corruption then you are to be fighting off a hurricane. The key takeaway is – classify many different disaster types, document them, and in the end, you will have a more robust, more holistic plan you can use when the time comes. I would love to hear from you in regards to your journeys with DR. How do you begin to classify disasters or construct a DR plan? Have you experienced any "uncommon" scenarios which your DR plan has or hasn't addressed? Leave some comments below and let's keep this conservation going.

Dropping into an SSH session and running esxtop on an ESXi host can be a daunting task!  With well over 300 metrics available, esxtop can throw numbers and percentages at sysadmins all day long – but without completely understanding them they will prove to be quite useless to troubleshooting issues.  Below are a handful of metrics that I find useful when analyzing performance issues with esxtop.




Usage (%USED) - CPU is usually not the bottleneck when it comes to performance issues within VMware but it is is still a good idea to keep an eye on the average usage of both the host and the VMs that reside on it.  High CPU usage levels on a VM may be an indicator of a requirement for more vCPU’s or an sign of something that has gone awry within the OS.  Chronic high CPU usage on the host may indicate the need for more resources in terms of either additional cores or more ESXi hosts needed within the cluster.


Ready (%RDY) - CPU Ready (%RDY) is a very important metric that is brought up in nearly every single blog post dealing with VMware and performance.  To be simply, CPU Ready measures the amount of time that the VM is ready to process on physical CPUs, but is waiting for the ESXi CPU scheduler to find the time to do so.  Normally this is caused by other VMs competing for the same resources.  VMs experiencing a high %RDY will definitely experience some performance implications and may indicate the need for more physical cores, or can sometimes be solved for removing un-needed vCPU’s from VMs that do not require more than one.


Co-Stop (%CSTP) - Similar to ready Co-Stop measures the amount of time the VM was incurring delay due to the ESXi CPU Scheduler – the difference being Co-Stop only applies to those VMs with multiple vCPU’s and %RDY can apply to VMs with a  single vCPU.  A high number of VMs with a high Co-Stop may indicate the need for more physical cores within your ESXi host, too high of a consolidation ration, or quite simply, too many multiple vCPU VMs.




Active (%ACTV) - Just as it’s a good idea to monitor the average CPU usage on both hosts and VMs it’s also the same for active memory.  Although we cannot necessarily use this metric for right sizing due to the the way it is calculated it can be used to see which VMs are actively and aggressively touching memory pages.


Swapping (SWR/s,SWW/s,SWTGT,SWCUR) - Memory swapping is a very important metric to watch.  Essentially if we see this metric anywhere above 0 it means that we are actively swapping out memory pages and processes to the swap file that is create upon VM power on.  This means instead of paging memory to RAM, we are using much slower disk to do so.  If we see swap occurring we may be in the market for more memory on our physical hosts, or looking to migrate certain VMs to other hosts with free physical RAM.


Balloon (MEMCTLGT) - Ballooning isn’t necessarily a bad metric for memory consumption but can definitely be used as an early warning symptom for swapping.  When a value is reported for ballooning it basically states that the host cannot satisfy the VMs memory requirements, and is essentially reclaiming unused memory back from other virtual machines.  Once we are through reclaiming memory from the balloon driver then swapping is the next logical step, which can be very very detrimental on performance.




Latency (DAVG, GAVG, KAVG, QAVG) - When it comes to monitoring disk i/o latency is king.  Within a virtualized environment there are many different areas where latency may occur though, from leaving the VM, going through the VMkernel, HBA, and storage array.  To help understand total latency we can look at the following metrics.

  • KAVG – This the amount of time that the I/O spends within the VMkernel
  • QAVG – This is the amount to time that the I/O spends in the HBA driver after leaving the VMkernel
  • DAVG – This is the amount of time the I/O takes to leave the HBA, get to the storage array and return back.
  • GAVG- We can think of GAVG (Guest Average) as the sum of all three metrics (KAVG, QAVG, DAVG) – essentially the total amount of latency as it pertains to the applications within the VM.


As you might be able to determine a high QAVG/KAVG can most certainly be a result of too small of a queue depth within your HBA – that or possibly your host is way too busy and VMs need to be migrated to others.  A high DAVG (>20ms) normally indicates an issue with the actual storage array, either it is incorrectly configured and/or too busy to handle the load.




Dropped packets (DRPTX/DRPRX) - As far as network performance there are only a couple of metrics in which we can monitor from a host level.  The DRPTX/RX monitor the packets which are dropped either on the transmit or receive end respectively.  When we begin to see this metric go above 1 we may come to the conclusion that we have very high network utilization and may need to either increase our bandwidth out of the host, or possible somewhere along the path the packets are taking.


As I mentioned earlier there are over 300 metrics within esxtop – the above are simply the core ones I use when troubleshooting performance.  Certainly having a third party monitoring solution can help  you to baseline your environment and utilize these stats to more to your advantage by summarizing them in more visually appealing ways.  For this week I’d love to hear about some of your real life situations -   When was there a time where you noticed a metric was “out of whack” and what did you do to fix it?  What are some of your favorite performance metrics that you watch and why?   Do you use esxtop or do you have a favorite third-party solution you like to utilize?


Thanks for reading!

Let’s face it!  We live in a world now where we are seeing a heavy reliance on software instead of hardware.  With Software Defined Everything popping up all over the place we are seeing traditional hardware oriented tasks being built into software – this provides an extreme amount of flexibility and portability on how we chose to deploy and configure various pieces of our environments.


With this software management layer taking hold of our virtualized datacenters we are going through a phase where technologies such as private and hybrid cloud are now within our grasp.  As the cloud descends upon us there is one key player that we need to focus on – the automation and orchestration that quietly executes in the background, the key component to providing the flexibility, efficiency, and simplicity that we as sysadmins are expected to provide to our end users.


To help drive home the importance and reliance of automation let’s take a look at a simple task – that of deploying a VM.  When we do this in the cloud, mainly public,  it’s just a matter of swiping a credit card, providing some information in regards to a name and network configuration, waiting a few minutes/seconds and away we go. Our end users can have a VM setup almost instantaneously!


The ease of use and efficiency of the public cloud, such as the above scenario is putting extended pressure on IT within their respective organizations – we are now expected to create, deliver and maintain these flexible like services within our businesses, and do so with the same efficiency and simplicity that cloud brings to the table.  Virtualization certainly provides a decent starting point for this, but it is automation and orchestration that will take us to the finish line.


So how do we do it?


Within our enterprise I think we can all agree that we don’t simply just create a VM and call it “done”!  There are many other steps that come after we power up that new VM.  We have server naming to contend with, networking configuration (IP, DNS, Firewall, etc).  We have monitoring solutions that need to be configured in order to properly monitor and respond to outages and issues that may pop up, as well as I’m pretty certain we will want to include our newly created VM within some sort of backup or replication job in order to protect it.  With more and more software vendors exposing public API’s we are now living in a world where its possible to tie all of these different pieces of our datacenter together.


Automation and orchestration doesn’t stop at just creating VMs either – there’s room for it throughout the whole VM life cycle.  The concept of the self-healing datacenter comes to mind – having scripts and actions performed automatically by monitoring software in efforts to fix issues within your environment as the occur – this is all made possible by automation.


So with this I think we can all conclude that automation is a key player within our environments but the questions always remains – should I automate task x?  Meaning, will the time savings and benefits of creating the automation supersede the efforts and resources it will take to create the process?  So with all this in mind I have a few questions- Do you use automation and orchestration within your environment?   If so what tasks have you automated thus far?  Do you have a rule of thumb that dictates when you will automate a certain task?  Believe it or not there are people within this world that are somewhat against automation, whether it be in fear of their jobs or simply not adapting – how do you help “push” these people down the path of automation?

More often then not application owners look to their vendors to provide a list of requirements for a new project, the vendor forwards specifications that were developed around maximum load and in the age of physical servers.  These requirements eventually make their way on to the virtualization administrators desk.  8 Intel CPUs 2Ghz or higher, 16 GB Memory.   The virtualization administrator is left to fight the good fight with both the app owner and the vendors.  Now we all cringe when we see requirements such as above – we’ve worked hard to build out our virtualization clusters, pooling CPU and Memory to present back to our organizations, and we constantly monitor our environments to ensure that resources are available to our applications when they need them – and then a list of requirements like above come across through a server build form or email.


So whats the big deal?


We have lots of resources right?!? Why not just give the people what they want?  Before you just start giving away resources let’s take a look at both CPU and Memory and see just how VMware handles its scheduling and over commitment of both..




One of the biggest selling points of virtualization is the fact that we can have more vCPUs attached to VMs then we have physical CPUs on our hosts and let the ESXi scheduler take care of scheduling the VMs time on CPU.  So why don’t we just go ahead and give every VM 4 or 8 vCPUs?  You would think granting a VM more vCPUs would increase its performance – which it most certainly could – but the problem is you can actually hurt performance as well – not just on that single VM, but on other VMs running on the host as well.   Since the physical CPUs are shared there may be times where the scheduler will have to place CPU instructions on hold, or wait for physical cores to become available.  For instance, a VM with 4 vCPU’s will have to wait until 4 physical cores are available before the scheduler and execute its’ instructions, where-as a VM with 1 vCPU only has to wait for 1 logical core.  As you can tell, having multiple VMs each containing multiple vCPUs could in fact end up with a lot of queuing, waiting, and CPU ready time on the host, resulting in a significant impact to our performance.  Although VMware has made strides in CPU scheduling by implementing “Relaxed Co-scheduling”, it still only allows for a certain time drift between the execution of instructions across cores, and does not completely solve the issues around scheduling and CPU Ready – It’s always best practice to right size your VMs in terms of number of vCPUs to avoid as many scheduling conflicts as possible.




vSphere deploys many techniques when managing our virtual machine memory – VMs can share memory pages with each other, eliminating redundant copies of the same memory pages.  vSphere can also compress memory as well as deploy ballooning techniques which will allow on VM to essentially borrow allocated memory to another.  This built in intelligence almost masks away any performance or issues we might see with overcommitting RAM to our virtual machines.  That said memory is often one of the resources we run out of first and we should still take precautions in order to right-size our VMs to prevent waste.  The first thing to consider is overhead – by assigning additional un-needed memory to our VMs we increase the amount of overhead memory that is utilized by the hypervisor in order to run the virtual machine, which in turns takes memory from our pool available to other VMs.  The amount of overhead is determined by the amount of assigned memory as well as the number of vCPUs on the VM, and although this number is small (roughly 150MB for 16GB RAM/2vCPU VM) it can begin to add up as our consolidation ratios increase.  Aside from memory waste overcommitted memory also causes unnecessary waste on our storage end of things as well.  Each time a VM is powered on a swap file is created on disk of equal size to the allocated memory.  Again, this may not seem like a lot of wasted space at the time but as we create more and more VMs it can certainly add up to quite a bit of capacity.  Keep in mind that if there is not enough free space available to create this swap file, the VM will not be able to be powered on.


Certainly these are not the only impacts that oversized virtual machines have on our environment.  They can also impact certain features such as HA, vMotion times, DRS actions, etc but these are some of the bigger ones.  Right-sizing is not something that’s done once either – it’s important to constantly monitor your infrastructure and go back and forth with application owners as things change from day to day, month to month.  Certainly there are a lot of applications and monitoring systems out there that can perform this analysis for us so use them!


All that said though discovering our over and under sized VMs within our infrastructure is probably the easiest leg of the journey of reclamation.  Once we have some solid numbers and metrics in hands we need to somehow present these to other business units and application owners to try and claw back resources – this is where the real challenge begins.  Placing a dollar figure on everything, and utilizing features such as showback and chargeback may help, but again, its hard to take something back after its been given.  So my questions to leave you with this time are as follows – First up, how do you ensure your VMs are right-sized?  Do you use a monitoring solution available today to do so?  And if so, how often are you evaluating your infrastructure for right-sized VMs (Monthly, Yearly, etc.)?  Secondly, what do you for see as the biggest challenges in trying to claw back resources from your business?  Do you find that it’s more of a political challenge or simply an educational challenge?

In the past few years, there has been a lot of conversation around the “hypervisor becoming a commodity." It has been said that the underlying virtualization engines, whether they be ESXi, Hyper-V, KVM etc. are essentially insignificant, stressing the importance of the management and automation tools that sit on top of them.


These statements do hold some truthfulness: in its basic form, the hypervisor simply runs a virtual machine. As long as end-users have the performance they need, there's nothing else to worry about. In truth, though, the three major hypervisors on the market today (ESXi, Hyper-V, KVM) do this, and they do it well, so I can see how the “hypervisor becoming a commodity” works in these cases. But to SysAdmins, the people managing everything behind the VM, the commoditized hypervisor theory isn't bought quite so easily.


When we think about the word commodity in terms of IT, it’s usually defined as a product or service that is indistinguishable to it’s competitors, except for maybe price. With that said, if the hypervisors were a commodity, we shouldn’t care what hypervisor our applications are running on. We should see no difference between the VMs that are sitting inside an ESXi cluster or a Hyper-V cluster. In fact, in order to be commodity, these VMs should be able to migrate between hypervisors. The fact is that VMs today are not interchangeable between hypervisors, at least not without changing their underlying anatomy. While it is possible to migrate between hypervisors, the fact of the matter is that there is a process that we have to follow, including configurations, disks, etc. The files that make up that VM are all proprietary to the hypervisor they are running on and cannot simply be migrated and run by another hypervisor in their native forms.


Also, we stressed earlier the importance of the management tools that lie above the hypervisor, and how the hypervisor didn’t matter as much as the management tools did. This is partly true. The management and automation tools put in place are the heart of our virtual infrastructures, but the problem is that these management tools often create a divide in the features they support on different hypervisors. Take, for instance, a storage array providing support for VVOLs, VMware’s answer to per-vm-based policy storage provisioning. This is a standard that allows us to completely change the way we deploy storage, eliminating LUNs and making VMs and their disk first-class citizens on their subsequent storage arrays. That said, these are storage arrays that are connected to ESXi hosts, not Hyper-V hosts.  Another example, this time in favor of Microsoft, is in the hybrid cloud space. With Azure stack coming down the pipe, organizations will be able to easily deploy and deliver services from their own data centers, but with azure-like agility. The VMware solution, which is similar, involving vCloud Air and vCloud Connector, is simply not at the same level as Azure when it comes to simplicity, in my opinion. They are two very different feature-sets that are only available on their respective hypervisors.


So with all that, is the hypervisor a commodity?  My take: No! While all the major hypervisors on the market today do one thing – virtualize x86 instructions and provide abstraction to the VMs running on top of them - there are simply two many discrepancies between the compatible 3rd-party tools, features, and products that manage these hypervisors for me to call them commoditized. So I’ll leave you with a few questions. Do you think the hypervisor is a commodity?  When/if the hypervisor fully becomes a commodity, what do you foresee our virtual environments looking like? Single or multi-hypervisor? Looking forward to your comments.


To the cloud!

Posted by mwpreston Sep 28, 2014

Private, Public, Hybrid, Infrastructure as a Service, Database as a Service, Software Defined Datacenter; call it what you will but for the sake of this post I’m going to sum it all up as cloud.   When virtualization started to become mainstream we seen a lot of enterprises adopt a “virtualization first” strategy, meaning new services and applications introduced to the business will first be considered to be virtualized unless a solid case for acquiring physical hardware can be made.   As the IT world shifts we are seeing this strategy move more towards a “cloud first” strategy.  Companies are asking themselves questions such as “Are there security policies stating we must run this inside of our datacenter?”, “Will cloud provide a more highly available platform for this service?”, and “Is it cost effective for us to place this service elsewhere?”.


Honestly, for a lot of services the cloud makes sense!  But is your database environment one of them?  From my experiences I’ve seen database environments stay relatively static.  The database sat on different pieces of physical hardware and watched us implement our “virtualization first” strategies.  We’ve long virtualized web front ends, the application servers and all the other pieces of our infrastructure but have yet to make the jump on the database.  Sometimes it’s simply due to performance, but with the advances in hypervisors as of late we can’t necessarily blame it on metrics anymore.  And now we are seeing cloud solutions such as DBaaS and IaaS present themselves to us.  Most of the time, the database is the heart of the company.  The main revenue driver for our business and customers, and it gets so locked up in change freezes that we have a hard time touching it.  But today, let’s pretend that the opportunity to move “to the cloud” is real.


When we look at running our databases in the cloud we really have two main options; DBaaS (Database functionalities delivered directly to us) and IaaS (The same database functionality being provided, but allowing us to control a portion of the infrastructure underneath it.)  No matter the choice we make, to me, the whole “database in the cloud” scenario is one big trade off.  We trade away our control and ownership of the complete stack in our datacenters to gain the agility and mobility that cloud can provide us with.


Think about it!  Currently, we have the ability to monitor the complete stack that our database lives on.  We see all traffic coming into the environment, all traffic going out, we can monitor every single switch, router, and network device that is inside of our four datacenter walls.  We can make BIOS changes to the servers our database resides on.  We utterly have complete and ??? control over how our database performs (with the exception of closed vendor code )  In a cloudy world, we hand over that control to our cloud provider.  Sure, we can usually still monitor performance metrics based on the database operations, but we don’t necessarily know what else is going on in the environment.  We don’t know who our “neighbors” are or if what they are doing is affecting us in anyway.  We don’t know what changes or tweaks might be going on below the stack that hosts our database.  On the flip side though, do we care?  We’ve paid good money for these services and SLAs and put our trust in the cloud provider to take care of this for us.  In return, we get agility.  We get functionality such as faster deployment times.  We aren’t waiting anymore for servers to arrive or storage to be provisioned.  In the case of DBaaS we get embedded best practices.  A lot of DBaaS providers do one thing and one thing alone; make databases efficient, fast, resilient and highly available.  Sometimes the burden of DR and recovery is taken care of for us.  We don’t need to buy two of everything.  Perhaps the biggest advantage though is the fact that we only pay for what we use.  As heavy resource peaks emerge we can burst and scale up, automatically.   When those periods of time are over we can retract and scale back down.


So thoughtful remarks for the week – What side of the “agility vs control” tradeoff do you or your business take?  Have you already made a move to hosting a database in the cloud?  What do you see as the biggest benefit/drawback to using something like DBaaS?   How has cloud changed the way you monitor and run your database infrastructure?


There is definitely no right or wrong answers this week – I’m really just looking for stories.  And these stories may vary depending your cloud provider of choice.  To some providers, this trade-off may not even exist.  To those doing private cloud, you may have the best of both worlds.


As this is my fourth and final post with the ambassador title I just wanted to thank everyone for the comments over the past month...  This is a great community with tons of engagement and you can bet that I won’t be going anywhere, ambassador or not!

For the most part most database performance monitoring tools do a great job at real-time monitoring – by that I mean alerting us when certain counter thresholds are reached, such as Page Life Expectancy below 300 or Memory Pages per Second is too high.  Although this is definitely crucial to have setup within our environment, having hard alerts does pose a problem of its own.  How do we know that reaching a page life expectancy of 300 is a problem?   Maybe this is normal for a certain period of time such as month end processing.


This is where the baseline comes into play.  A baseline, by definition is a minimum or starting point used for comparisons.  In the database performance analysis world, it’s a snapshot or how our databases and servers are performing when not experiencing any issues for a given point of time.  We can then take these performance snapshots and use them as a starting point when troubleshooting performance issues.  For instance, take into consideration a few of the following questions…


  1. Is my database running slower now than it was last week?
  2. Has my database been impacted by the latest disk failure and RAID rebuild?
  3. Has the new SAN migration impacted my database services in any way?
  4. Has the latest configuration change/application update impacted my servers in any way?
  5. How have the addition of 20 VMs into my environment impacted my database?


With established baselines we are able to quickly see by comparison the answer to all of these questions.  But, let’s take this a step further, and use question 5 in the following scenario.


Jim is currently comparing how his database server is performing now against a baseline he had taken a few months back.  This, being after adding 20 new VMs into his environment.  He concludes, with the data to back him up, that his server is indeed running slower.  He is seeing increased read/write latency and increased CPU usage.  So is the blame really to be placed on the newly added VMs?   Well, this all depends – What if something else was currently going on that is causing the latency to increase?  Say month end processing and backups are happening now and weren't during the snapshot of the older baseline.


We can quickly see that baselines, while they are important, are really only as good as the time that you take them.  Comparing a  period of increased activity to a baseline taken during a period of normal activity is really not very useful at all.


So this week I ask you to simply tell me about how you tackle baselines.

  1. Do you take baselines at all?  How many?  How often?
  2. What counters/metrics do you collect?
  3. Do you baseline your applications during peak usage?  Low usage?  Month end?
  4. Do you rely solely on your monitoring solution for baselining?  Does it show you trending over time?
  5. Can your monitoring solution tell you, based on previous data, what is normal for this period of time in your environment?


You don’t have to stick to these questions – let's just have a conversation about baselining!

A lot of times as administrators or infrastructure people we all too often get stuck “keeping the lights on”.  What I mean by this, is we have our tools and scripts in place to monitor all of our services and databases, we have notification set up to alert us when they are down or experiencing trouble, and we have our troubleshooting methodologies and exercises that we go through in order to get everything back up and running.

The problem being, that's where our job usually ends.  We simply fix the issue and must move on to the next issue in order to keep the business up.  Not that often do we get the chance to research a better way of monitoring or a better way of doing things.  And when we do get that time, how do we get these projects financially backed by a budget?


Throughout my career there have been plenty of times where I have mentioned the need for better or faster storage, more memory, more compute, and different pieces of software to better support me in my role.  However the fact of the matter without proof on how these upgrades or greenfield deployments will impact the business, or better yet, how the business will be impacted without them, there's a pretty good chance that the answer will always be no.


So I’m constantly looking for that silver bullet if you will – something that I can take to my CTO/CFO in order to validate my budget requests.  The problem being, most performance monitoring applications spit out reports dealing with very technical metrics.  My CTO/CFO do not care about the average response time of a query.  They don’t care about table locking and blocking numbers.  What they want to see is how what I’m asking for can either save them money or make them money.


So this is where I struggle and I’m asking you, the thwack community for your help on this one – leave a comment with your best tip or strategy on using performance data and metrics to get budgets and projects approved.  Below are a few questions to help you get started.


  • How do you present your case to a CTO/CFO?  Do you have some go to metrics that you find they understand more than others?
  • Do you correlate performance data with other groups of financial data to show a bottom line impact of a performance issue or outage?
  • Do you map your performance data directly to SLA’s that might be in place?  Does this help in selling your pitch?
  • Do you have any specific metrics or performance reports you use to show your business stakeholders the impact on customer satisfaction or brand reputation?


Thanks for reading – I look forward to hearing from you.

We all have them lurking around in our data centers.  Some virtual, some physical.  Some small, some large.  At times we find them consolidated onto one server, other times we see many of them scattered across multiple servers.  There’s no getting away from it – I’m talking about the database.  Whether its’ relational or somewhat flattened the database is perhaps one of the oldest and most misunderstood technologies that we find inside businesses IT infrastructure today.  When applications slow down, it’s usually the database in which the fingers get pointed at – and we as IT professionals need to know how to pin point and solve issues inside these mystical structures of data, as missing just a few transactions could potentially result in a lot of loss revenue for our company.


I work for an SMB, where we don’t have teams of specialists or DBA’s to look after these things.  This normally results in my time and effort focusing on being proactive by automating things such as index rebuilds and database defragmentation.  That said we still experience issues and when we do seeing as I have a million other things to take care of I don’t have the luxury of taking my time when troubleshooting database issues.  So my questions for everyone on first week of partaking as a thwack ambassador are.


  1. Which database application is mostly used within your environment (SQL Server, Oracle, MySQL, DB2, etc)?
  2. Do you have a team and/or a person dedicated solely as a DBA, monitoring performance and analyzing databases?  Or is this left to the infrastructure teams to take care of?
  3. What applications/tools/scripts do you use to monitor your database performance and overall health?
  4. What types of automation and orchestration do you put in place to be proactive in tuning your databases (things such as re-indexing, re-organizing, defragmentation, etc).  And how do you know when the right time is to kick these off?


Thanks for your replies and I can’t wait to see what answers come in.


Related resources:


Article: Hardware or code? SQL Server Performance Examined — Most database performance issues result not from hardware constraint, but rather from poorly written queries and inefficiently designed indexes. In this article, database experts share their thoughts on the true cause of most database performance issues.


Whitepaper: Stop Throwing Hardware at SQL Server Performance — In this paper, Microsoft MVP Jason Strate and colleagues from Pragmatic Works discuss some ways to identify and improve performance problems without adding new CPUs, memory or storage.


Infographic: 8 Tips for Faster SQL Server Performance — Learn 8 things you can do to speed SQL Server performance without provisioning new hardware.

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.