Skip navigation

Geek Speak

8 Posts authored by: mwpreston

Dropping into an SSH session and running esxtop on an ESXi host can be a daunting task!  With well over 300 metrics available, esxtop can throw numbers and percentages at sysadmins all day long – but without completely understanding them they will prove to be quite useless to troubleshooting issues.  Below are a handful of metrics that I find useful when analyzing performance issues with esxtop.

 

CPU

 

Usage (%USED) - CPU is usually not the bottleneck when it comes to performance issues within VMware but it is is still a good idea to keep an eye on the average usage of both the host and the VMs that reside on it.  High CPU usage levels on a VM may be an indicator of a requirement for more vCPU’s or an sign of something that has gone awry within the OS.  Chronic high CPU usage on the host may indicate the need for more resources in terms of either additional cores or more ESXi hosts needed within the cluster.

 

Ready (%RDY) - CPU Ready (%RDY) is a very important metric that is brought up in nearly every single blog post dealing with VMware and performance.  To be simply, CPU Ready measures the amount of time that the VM is ready to process on physical CPUs, but is waiting for the ESXi CPU scheduler to find the time to do so.  Normally this is caused by other VMs competing for the same resources.  VMs experiencing a high %RDY will definitely experience some performance implications and may indicate the need for more physical cores, or can sometimes be solved for removing un-needed vCPU’s from VMs that do not require more than one.

 

Co-Stop (%CSTP) - Similar to ready Co-Stop measures the amount of time the VM was incurring delay due to the ESXi CPU Scheduler – the difference being Co-Stop only applies to those VMs with multiple vCPU’s and %RDY can apply to VMs with a  single vCPU.  A high number of VMs with a high Co-Stop may indicate the need for more physical cores within your ESXi host, too high of a consolidation ration, or quite simply, too many multiple vCPU VMs.

 

Memory

 

Active (%ACTV) - Just as it’s a good idea to monitor the average CPU usage on both hosts and VMs it’s also the same for active memory.  Although we cannot necessarily use this metric for right sizing due to the the way it is calculated it can be used to see which VMs are actively and aggressively touching memory pages.

 

Swapping (SWR/s,SWW/s,SWTGT,SWCUR) - Memory swapping is a very important metric to watch.  Essentially if we see this metric anywhere above 0 it means that we are actively swapping out memory pages and processes to the swap file that is create upon VM power on.  This means instead of paging memory to RAM, we are using much slower disk to do so.  If we see swap occurring we may be in the market for more memory on our physical hosts, or looking to migrate certain VMs to other hosts with free physical RAM.

 

Balloon (MEMCTLGT) - Ballooning isn’t necessarily a bad metric for memory consumption but can definitely be used as an early warning symptom for swapping.  When a value is reported for ballooning it basically states that the host cannot satisfy the VMs memory requirements, and is essentially reclaiming unused memory back from other virtual machines.  Once we are through reclaiming memory from the balloon driver then swapping is the next logical step, which can be very very detrimental on performance.

 

Disk

 

Latency (DAVG, GAVG, KAVG, QAVG) - When it comes to monitoring disk i/o latency is king.  Within a virtualized environment there are many different areas where latency may occur though, from leaving the VM, going through the VMkernel, HBA, and storage array.  To help understand total latency we can look at the following metrics.

  • KAVG – This the amount of time that the I/O spends within the VMkernel
  • QAVG – This is the amount to time that the I/O spends in the HBA driver after leaving the VMkernel
  • DAVG – This is the amount of time the I/O takes to leave the HBA, get to the storage array and return back.
  • GAVG- We can think of GAVG (Guest Average) as the sum of all three metrics (KAVG, QAVG, DAVG) – essentially the total amount of latency as it pertains to the applications within the VM.

 

As you might be able to determine a high QAVG/KAVG can most certainly be a result of too small of a queue depth within your HBA – that or possibly your host is way too busy and VMs need to be migrated to others.  A high DAVG (>20ms) normally indicates an issue with the actual storage array, either it is incorrectly configured and/or too busy to handle the load.

 

Network

 

Dropped packets (DRPTX/DRPRX) - As far as network performance there are only a couple of metrics in which we can monitor from a host level.  The DRPTX/RX monitor the packets which are dropped either on the transmit or receive end respectively.  When we begin to see this metric go above 1 we may come to the conclusion that we have very high network utilization and may need to either increase our bandwidth out of the host, or possible somewhere along the path the packets are taking.

 

As I mentioned earlier there are over 300 metrics within esxtop – the above are simply the core ones I use when troubleshooting performance.  Certainly having a third party monitoring solution can help  you to baseline your environment and utilize these stats to more to your advantage by summarizing them in more visually appealing ways.  For this week I’d love to hear about some of your real life situations -   When was there a time where you noticed a metric was “out of whack” and what did you do to fix it?  What are some of your favorite performance metrics that you watch and why?   Do you use esxtop or do you have a favorite third-party solution you like to utilize?

 

Thanks for reading!

Let’s face it!  We live in a world now where we are seeing a heavy reliance on software instead of hardware.  With Software Defined Everything popping up all over the place we are seeing traditional hardware oriented tasks being built into software – this provides an extreme amount of flexibility and portability on how we chose to deploy and configure various pieces of our environments.

 

With this software management layer taking hold of our virtualized datacenters we are going through a phase where technologies such as private and hybrid cloud are now within our grasp.  As the cloud descends upon us there is one key player that we need to focus on – the automation and orchestration that quietly executes in the background, the key component to providing the flexibility, efficiency, and simplicity that we as sysadmins are expected to provide to our end users.

 

To help drive home the importance and reliance of automation let’s take a look at a simple task – that of deploying a VM.  When we do this in the cloud, mainly public,  it’s just a matter of swiping a credit card, providing some information in regards to a name and network configuration, waiting a few minutes/seconds and away we go. Our end users can have a VM setup almost instantaneously!

 

The ease of use and efficiency of the public cloud, such as the above scenario is putting extended pressure on IT within their respective organizations – we are now expected to create, deliver and maintain these flexible like services within our businesses, and do so with the same efficiency and simplicity that cloud brings to the table.  Virtualization certainly provides a decent starting point for this, but it is automation and orchestration that will take us to the finish line.

 

So how do we do it?

 

Within our enterprise I think we can all agree that we don’t simply just create a VM and call it “done”!  There are many other steps that come after we power up that new VM.  We have server naming to contend with, networking configuration (IP, DNS, Firewall, etc).  We have monitoring solutions that need to be configured in order to properly monitor and respond to outages and issues that may pop up, as well as I’m pretty certain we will want to include our newly created VM within some sort of backup or replication job in order to protect it.  With more and more software vendors exposing public API’s we are now living in a world where its possible to tie all of these different pieces of our datacenter together.

 

Automation and orchestration doesn’t stop at just creating VMs either – there’s room for it throughout the whole VM life cycle.  The concept of the self-healing datacenter comes to mind – having scripts and actions performed automatically by monitoring software in efforts to fix issues within your environment as the occur – this is all made possible by automation.

 

So with this I think we can all conclude that automation is a key player within our environments but the questions always remains – should I automate task x?  Meaning, will the time savings and benefits of creating the automation supersede the efforts and resources it will take to create the process?  So with all this in mind I have a few questions- Do you use automation and orchestration within your environment?   If so what tasks have you automated thus far?  Do you have a rule of thumb that dictates when you will automate a certain task?  Believe it or not there are people within this world that are somewhat against automation, whether it be in fear of their jobs or simply not adapting – how do you help “push” these people down the path of automation?

More often then not application owners look to their vendors to provide a list of requirements for a new project, the vendor forwards specifications that were developed around maximum load and in the age of physical servers.  These requirements eventually make their way on to the virtualization administrators desk.  8 Intel CPUs 2Ghz or higher, 16 GB Memory.   The virtualization administrator is left to fight the good fight with both the app owner and the vendors.  Now we all cringe when we see requirements such as above – we’ve worked hard to build out our virtualization clusters, pooling CPU and Memory to present back to our organizations, and we constantly monitor our environments to ensure that resources are available to our applications when they need them – and then a list of requirements like above come across through a server build form or email.

 

So whats the big deal?

 

We have lots of resources right?!? Why not just give the people what they want?  Before you just start giving away resources let’s take a look at both CPU and Memory and see just how VMware handles its scheduling and over commitment of both..

 

CPU

 

One of the biggest selling points of virtualization is the fact that we can have more vCPUs attached to VMs then we have physical CPUs on our hosts and let the ESXi scheduler take care of scheduling the VMs time on CPU.  So why don’t we just go ahead and give every VM 4 or 8 vCPUs?  You would think granting a VM more vCPUs would increase its performance – which it most certainly could – but the problem is you can actually hurt performance as well – not just on that single VM, but on other VMs running on the host as well.   Since the physical CPUs are shared there may be times where the scheduler will have to place CPU instructions on hold, or wait for physical cores to become available.  For instance, a VM with 4 vCPU’s will have to wait until 4 physical cores are available before the scheduler and execute its’ instructions, where-as a VM with 1 vCPU only has to wait for 1 logical core.  As you can tell, having multiple VMs each containing multiple vCPUs could in fact end up with a lot of queuing, waiting, and CPU ready time on the host, resulting in a significant impact to our performance.  Although VMware has made strides in CPU scheduling by implementing “Relaxed Co-scheduling”, it still only allows for a certain time drift between the execution of instructions across cores, and does not completely solve the issues around scheduling and CPU Ready – It’s always best practice to right size your VMs in terms of number of vCPUs to avoid as many scheduling conflicts as possible.

 

Memory

 

vSphere deploys many techniques when managing our virtual machine memory – VMs can share memory pages with each other, eliminating redundant copies of the same memory pages.  vSphere can also compress memory as well as deploy ballooning techniques which will allow on VM to essentially borrow allocated memory to another.  This built in intelligence almost masks away any performance or issues we might see with overcommitting RAM to our virtual machines.  That said memory is often one of the resources we run out of first and we should still take precautions in order to right-size our VMs to prevent waste.  The first thing to consider is overhead – by assigning additional un-needed memory to our VMs we increase the amount of overhead memory that is utilized by the hypervisor in order to run the virtual machine, which in turns takes memory from our pool available to other VMs.  The amount of overhead is determined by the amount of assigned memory as well as the number of vCPUs on the VM, and although this number is small (roughly 150MB for 16GB RAM/2vCPU VM) it can begin to add up as our consolidation ratios increase.  Aside from memory waste overcommitted memory also causes unnecessary waste on our storage end of things as well.  Each time a VM is powered on a swap file is created on disk of equal size to the allocated memory.  Again, this may not seem like a lot of wasted space at the time but as we create more and more VMs it can certainly add up to quite a bit of capacity.  Keep in mind that if there is not enough free space available to create this swap file, the VM will not be able to be powered on.

 

Certainly these are not the only impacts that oversized virtual machines have on our environment.  They can also impact certain features such as HA, vMotion times, DRS actions, etc but these are some of the bigger ones.  Right-sizing is not something that’s done once either – it’s important to constantly monitor your infrastructure and go back and forth with application owners as things change from day to day, month to month.  Certainly there are a lot of applications and monitoring systems out there that can perform this analysis for us so use them!

 

All that said though discovering our over and under sized VMs within our infrastructure is probably the easiest leg of the journey of reclamation.  Once we have some solid numbers and metrics in hands we need to somehow present these to other business units and application owners to try and claw back resources – this is where the real challenge begins.  Placing a dollar figure on everything, and utilizing features such as showback and chargeback may help, but again, its hard to take something back after its been given.  So my questions to leave you with this time are as follows – First up, how do you ensure your VMs are right-sized?  Do you use a monitoring solution available today to do so?  And if so, how often are you evaluating your infrastructure for right-sized VMs (Monthly, Yearly, etc.)?  Secondly, what do you for see as the biggest challenges in trying to claw back resources from your business?  Do you find that it’s more of a political challenge or simply an educational challenge?

In the past few years, there has been a lot of conversation around the “hypervisor becoming a commodity." It has been said that the underlying virtualization engines, whether they be ESXi, Hyper-V, KVM etc. are essentially insignificant, stressing the importance of the management and automation tools that sit on top of them.

 

These statements do hold some truthfulness: in its basic form, the hypervisor simply runs a virtual machine. As long as end-users have the performance they need, there's nothing else to worry about. In truth, though, the three major hypervisors on the market today (ESXi, Hyper-V, KVM) do this, and they do it well, so I can see how the “hypervisor becoming a commodity” works in these cases. But to SysAdmins, the people managing everything behind the VM, the commoditized hypervisor theory isn't bought quite so easily.

 

When we think about the word commodity in terms of IT, it’s usually defined as a product or service that is indistinguishable to it’s competitors, except for maybe price. With that said, if the hypervisors were a commodity, we shouldn’t care what hypervisor our applications are running on. We should see no difference between the VMs that are sitting inside an ESXi cluster or a Hyper-V cluster. In fact, in order to be commodity, these VMs should be able to migrate between hypervisors. The fact is that VMs today are not interchangeable between hypervisors, at least not without changing their underlying anatomy. While it is possible to migrate between hypervisors, the fact of the matter is that there is a process that we have to follow, including configurations, disks, etc. The files that make up that VM are all proprietary to the hypervisor they are running on and cannot simply be migrated and run by another hypervisor in their native forms.

 

Also, we stressed earlier the importance of the management tools that lie above the hypervisor, and how the hypervisor didn’t matter as much as the management tools did. This is partly true. The management and automation tools put in place are the heart of our virtual infrastructures, but the problem is that these management tools often create a divide in the features they support on different hypervisors. Take, for instance, a storage array providing support for VVOLs, VMware’s answer to per-vm-based policy storage provisioning. This is a standard that allows us to completely change the way we deploy storage, eliminating LUNs and making VMs and their disk first-class citizens on their subsequent storage arrays. That said, these are storage arrays that are connected to ESXi hosts, not Hyper-V hosts.  Another example, this time in favor of Microsoft, is in the hybrid cloud space. With Azure stack coming down the pipe, organizations will be able to easily deploy and deliver services from their own data centers, but with azure-like agility. The VMware solution, which is similar, involving vCloud Air and vCloud Connector, is simply not at the same level as Azure when it comes to simplicity, in my opinion. They are two very different feature-sets that are only available on their respective hypervisors.

 

So with all that, is the hypervisor a commodity?  My take: No! While all the major hypervisors on the market today do one thing – virtualize x86 instructions and provide abstraction to the VMs running on top of them - there are simply two many discrepancies between the compatible 3rd-party tools, features, and products that manage these hypervisors for me to call them commoditized. So I’ll leave you with a few questions. Do you think the hypervisor is a commodity?  When/if the hypervisor fully becomes a commodity, what do you foresee our virtual environments looking like? Single or multi-hypervisor? Looking forward to your comments.

mwpreston

To the cloud!

Posted by mwpreston Sep 28, 2014

Private, Public, Hybrid, Infrastructure as a Service, Database as a Service, Software Defined Datacenter; call it what you will but for the sake of this post I’m going to sum it all up as cloud.   When virtualization started to become mainstream we seen a lot of enterprises adopt a “virtualization first” strategy, meaning new services and applications introduced to the business will first be considered to be virtualized unless a solid case for acquiring physical hardware can be made.   As the IT world shifts we are seeing this strategy move more towards a “cloud first” strategy.  Companies are asking themselves questions such as “Are there security policies stating we must run this inside of our datacenter?”, “Will cloud provide a more highly available platform for this service?”, and “Is it cost effective for us to place this service elsewhere?”.

 

Honestly, for a lot of services the cloud makes sense!  But is your database environment one of them?  From my experiences I’ve seen database environments stay relatively static.  The database sat on different pieces of physical hardware and watched us implement our “virtualization first” strategies.  We’ve long virtualized web front ends, the application servers and all the other pieces of our infrastructure but have yet to make the jump on the database.  Sometimes it’s simply due to performance, but with the advances in hypervisors as of late we can’t necessarily blame it on metrics anymore.  And now we are seeing cloud solutions such as DBaaS and IaaS present themselves to us.  Most of the time, the database is the heart of the company.  The main revenue driver for our business and customers, and it gets so locked up in change freezes that we have a hard time touching it.  But today, let’s pretend that the opportunity to move “to the cloud” is real.

 

When we look at running our databases in the cloud we really have two main options; DBaaS (Database functionalities delivered directly to us) and IaaS (The same database functionality being provided, but allowing us to control a portion of the infrastructure underneath it.)  No matter the choice we make, to me, the whole “database in the cloud” scenario is one big trade off.  We trade away our control and ownership of the complete stack in our datacenters to gain the agility and mobility that cloud can provide us with.

 

Think about it!  Currently, we have the ability to monitor the complete stack that our database lives on.  We see all traffic coming into the environment, all traffic going out, we can monitor every single switch, router, and network device that is inside of our four datacenter walls.  We can make BIOS changes to the servers our database resides on.  We utterly have complete and ??? control over how our database performs (with the exception of closed vendor code )  In a cloudy world, we hand over that control to our cloud provider.  Sure, we can usually still monitor performance metrics based on the database operations, but we don’t necessarily know what else is going on in the environment.  We don’t know who our “neighbors” are or if what they are doing is affecting us in anyway.  We don’t know what changes or tweaks might be going on below the stack that hosts our database.  On the flip side though, do we care?  We’ve paid good money for these services and SLAs and put our trust in the cloud provider to take care of this for us.  In return, we get agility.  We get functionality such as faster deployment times.  We aren’t waiting anymore for servers to arrive or storage to be provisioned.  In the case of DBaaS we get embedded best practices.  A lot of DBaaS providers do one thing and one thing alone; make databases efficient, fast, resilient and highly available.  Sometimes the burden of DR and recovery is taken care of for us.  We don’t need to buy two of everything.  Perhaps the biggest advantage though is the fact that we only pay for what we use.  As heavy resource peaks emerge we can burst and scale up, automatically.   When those periods of time are over we can retract and scale back down.

 

So thoughtful remarks for the week – What side of the “agility vs control” tradeoff do you or your business take?  Have you already made a move to hosting a database in the cloud?  What do you see as the biggest benefit/drawback to using something like DBaaS?   How has cloud changed the way you monitor and run your database infrastructure?

 

There is definitely no right or wrong answers this week – I’m really just looking for stories.  And these stories may vary depending your cloud provider of choice.  To some providers, this trade-off may not even exist.  To those doing private cloud, you may have the best of both worlds.

 

As this is my fourth and final post with the ambassador title I just wanted to thank everyone for the comments over the past month...  This is a great community with tons of engagement and you can bet that I won’t be going anywhere, ambassador or not!

For the most part most database performance monitoring tools do a great job at real-time monitoring – by that I mean alerting us when certain counter thresholds are reached, such as Page Life Expectancy below 300 or Memory Pages per Second is too high.  Although this is definitely crucial to have setup within our environment, having hard alerts does pose a problem of its own.  How do we know that reaching a page life expectancy of 300 is a problem?   Maybe this is normal for a certain period of time such as month end processing.

 

This is where the baseline comes into play.  A baseline, by definition is a minimum or starting point used for comparisons.  In the database performance analysis world, it’s a snapshot or how our databases and servers are performing when not experiencing any issues for a given point of time.  We can then take these performance snapshots and use them as a starting point when troubleshooting performance issues.  For instance, take into consideration a few of the following questions…

 

  1. Is my database running slower now than it was last week?
  2. Has my database been impacted by the latest disk failure and RAID rebuild?
  3. Has the new SAN migration impacted my database services in any way?
  4. Has the latest configuration change/application update impacted my servers in any way?
  5. How have the addition of 20 VMs into my environment impacted my database?

 

With established baselines we are able to quickly see by comparison the answer to all of these questions.  But, let’s take this a step further, and use question 5 in the following scenario.

 

Jim is currently comparing how his database server is performing now against a baseline he had taken a few months back.  This, being after adding 20 new VMs into his environment.  He concludes, with the data to back him up, that his server is indeed running slower.  He is seeing increased read/write latency and increased CPU usage.  So is the blame really to be placed on the newly added VMs?   Well, this all depends – What if something else was currently going on that is causing the latency to increase?  Say month end processing and backups are happening now and weren't during the snapshot of the older baseline.

 

We can quickly see that baselines, while they are important, are really only as good as the time that you take them.  Comparing a  period of increased activity to a baseline taken during a period of normal activity is really not very useful at all.

 

So this week I ask you to simply tell me about how you tackle baselines.

  1. Do you take baselines at all?  How many?  How often?
  2. What counters/metrics do you collect?
  3. Do you baseline your applications during peak usage?  Low usage?  Month end?
  4. Do you rely solely on your monitoring solution for baselining?  Does it show you trending over time?
  5. Can your monitoring solution tell you, based on previous data, what is normal for this period of time in your environment?

 

You don’t have to stick to these questions – let's just have a conversation about baselining!

A lot of times as administrators or infrastructure people we all too often get stuck “keeping the lights on”.  What I mean by this, is we have our tools and scripts in place to monitor all of our services and databases, we have notification set up to alert us when they are down or experiencing trouble, and we have our troubleshooting methodologies and exercises that we go through in order to get everything back up and running.


The problem being, that's where our job usually ends.  We simply fix the issue and must move on to the next issue in order to keep the business up.  Not that often do we get the chance to research a better way of monitoring or a better way of doing things.  And when we do get that time, how do we get these projects financially backed by a budget?

 

Throughout my career there have been plenty of times where I have mentioned the need for better or faster storage, more memory, more compute, and different pieces of software to better support me in my role.  However the fact of the matter without proof on how these upgrades or greenfield deployments will impact the business, or better yet, how the business will be impacted without them, there's a pretty good chance that the answer will always be no.

 

So I’m constantly looking for that silver bullet if you will – something that I can take to my CTO/CFO in order to validate my budget requests.  The problem being, most performance monitoring applications spit out reports dealing with very technical metrics.  My CTO/CFO do not care about the average response time of a query.  They don’t care about table locking and blocking numbers.  What they want to see is how what I’m asking for can either save them money or make them money.

 

So this is where I struggle and I’m asking you, the thwack community for your help on this one – leave a comment with your best tip or strategy on using performance data and metrics to get budgets and projects approved.  Below are a few questions to help you get started.

 

  • How do you present your case to a CTO/CFO?  Do you have some go to metrics that you find they understand more than others?
  • Do you correlate performance data with other groups of financial data to show a bottom line impact of a performance issue or outage?
  • Do you map your performance data directly to SLA’s that might be in place?  Does this help in selling your pitch?
  • Do you have any specific metrics or performance reports you use to show your business stakeholders the impact on customer satisfaction or brand reputation?

 

Thanks for reading – I look forward to hearing from you.

We all have them lurking around in our data centers.  Some virtual, some physical.  Some small, some large.  At times we find them consolidated onto one server, other times we see many of them scattered across multiple servers.  There’s no getting away from it – I’m talking about the database.  Whether its’ relational or somewhat flattened the database is perhaps one of the oldest and most misunderstood technologies that we find inside businesses IT infrastructure today.  When applications slow down, it’s usually the database in which the fingers get pointed at – and we as IT professionals need to know how to pin point and solve issues inside these mystical structures of data, as missing just a few transactions could potentially result in a lot of loss revenue for our company.

 

I work for an SMB, where we don’t have teams of specialists or DBA’s to look after these things.  This normally results in my time and effort focusing on being proactive by automating things such as index rebuilds and database defragmentation.  That said we still experience issues and when we do seeing as I have a million other things to take care of I don’t have the luxury of taking my time when troubleshooting database issues.  So my questions for everyone on first week of partaking as a thwack ambassador are.

 

  1. Which database application is mostly used within your environment (SQL Server, Oracle, MySQL, DB2, etc)?
  2. Do you have a team and/or a person dedicated solely as a DBA, monitoring performance and analyzing databases?  Or is this left to the infrastructure teams to take care of?
  3. What applications/tools/scripts do you use to monitor your database performance and overall health?
  4. What types of automation and orchestration do you put in place to be proactive in tuning your databases (things such as re-indexing, re-organizing, defragmentation, etc).  And how do you know when the right time is to kick these off?

 

Thanks for your replies and I can’t wait to see what answers come in.

 

Related resources:

 

Article: Hardware or code? SQL Server Performance Examined — Most database performance issues result not from hardware constraint, but rather from poorly written queries and inefficiently designed indexes. In this article, database experts share their thoughts on the true cause of most database performance issues.

 

Whitepaper: Stop Throwing Hardware at SQL Server Performance — In this paper, Microsoft MVP Jason Strate and colleagues from Pragmatic Works discuss some ways to identify and improve performance problems without adding new CPUs, memory or storage.

 

Infographic: 8 Tips for Faster SQL Server Performance — Learn 8 things you can do to speed SQL Server performance without provisioning new hardware.

Filter Blog

By date: By tag: