Skip navigation

Geek Speak

9 Posts authored by: mwpreston

When organizations first take on the challenge of setting up a disaster recovery plan, it’s almost always based on the premise that a complete failure will occur. With that in mind, we take the approach of planning for a complete recovery. We replicate our services and VMs to some sort of secondary site and go through the processes of documenting how to bring them all up again. While this may be the basis of the technical recovery portion of a DR plan, it’s important to take a step back before jumping right into the assumption of having to recover from a complete failure. Disasters come in all shapes, forms, and sizes, and a great DR plan will accommodate for as many types of disasters possible. For example, we wouldn’t use the same “runbook” to recover from simple data loss that we would use to recover from the total devastation of a hurricane. This just wouldn’t make sense. So even before beginning the recovery portions of our disaster recovery plans we really should focus on the disaster portion.

 

Classifying Disasters

 

As mentioned above, the human mind always seems to jump into planning for the worst-case scenario when hearing the words disaster recovery: a building burning down, flooding, etc. What we fail to plan for is other, minor, less significant disasters, such as temporary loss of power or loss of entrance due to quarantine. So, with that said, let’s begin to classify disasters. For the most part, we can lump a disaster into two main categories:

 

Natural Disasters – these are the most recognized types of disasters. Think of events such as a hurricane, flooding, fire, earthquake, lightning, water damage, etc. When planning for a natural disaster, we can normally go under the assumption that we will be performing a complete recovery or avoidance scenario to a secondary location.

 

Man-made Disasters – These are the types of disasters that are lesser known to organizations when looking at DR. Think about things such as temporary loss of power, cyberattacks, ransomware, protests, etc. While these intentional and unintentional acts are not as commonly approached, a good disaster recovery plan will address some of these as the recovery from them is often much different from that of a natural disaster.

 

Once we have classified our disaster into one of these two categories, we can then move on by further drilling down on the disasters. Performing a risk and impact assessment of the disaster scenarios themselves is a great next step. Answers to questions like the ones listed below should be considered when performing our risk assessment because it allows us to further classify our disasters, and, in turn, define expectations and appropriate responses accordingly.

 

  • Do we still have access to our main premises?
  • Have we lost any data?
  • Has any IT function been depleted or lost?
  • Do we have loss of skill set?

 

How these questions are answered as it pertains to a disaster can completely change our recovery scenarios. For example, if we have had a fire in the data center and lost data, we would most likely be failing over to another building in a designated amount of time. However, if we had also lost employees, more specifically IT employees in that fire, as well, then the time to recover will certainly be extended as we most likely would have lost skill sets and talent to execute the DR plan. Another great example comes in the form of ransomware. While we still would have physical access to our main premises, the data loss scenario could be much greater due to widespread encryption form the ransomware itself. If our backups were not air-gapped or separate from our infrastructure, then we may also have encrypted backups, meaning we have lost an IT function, thus provoking a possible failover scenario even with physical access to the building.  On the flip side, our risks may not even be technical in nature. What is the impact of losing physical access to our building in the result of protests or chemical spills?  Some disasters like this may not even require a recovery process at all, but still pose a threat due to the loss of access to the hardware.

 

Disaster recovery is a major undertaking, no matter what size the company or IT infrastructure, and can take copious amounts of time and resources to get it off the ground. With that said, don’t make the mistake of only planning for those big natural disasters. While it may be a great starting point, it’s best to really list out some of the more common, more probable types of disasters as well, document the risks and recovery steps in turn. In the end, you are more likely to be battling cyber attacks, power loss, and data corruption then you are to be fighting off a hurricane. The key takeaway is – classify many different disaster types, document them, and in the end, you will have a more robust, more holistic plan you can use when the time comes. I would love to hear from you in regards to your journeys with DR. How do you begin to classify disasters or construct a DR plan? Have you experienced any "uncommon" scenarios which your DR plan has or hasn't addressed? Leave some comments below and let's keep this conservation going.

Dropping into an SSH session and running esxtop on an ESXi host can be a daunting task!  With well over 300 metrics available, esxtop can throw numbers and percentages at sysadmins all day long – but without completely understanding them they will prove to be quite useless to troubleshooting issues.  Below are a handful of metrics that I find useful when analyzing performance issues with esxtop.

 

CPU

 

Usage (%USED) - CPU is usually not the bottleneck when it comes to performance issues within VMware but it is is still a good idea to keep an eye on the average usage of both the host and the VMs that reside on it.  High CPU usage levels on a VM may be an indicator of a requirement for more vCPU’s or an sign of something that has gone awry within the OS.  Chronic high CPU usage on the host may indicate the need for more resources in terms of either additional cores or more ESXi hosts needed within the cluster.

 

Ready (%RDY) - CPU Ready (%RDY) is a very important metric that is brought up in nearly every single blog post dealing with VMware and performance.  To be simply, CPU Ready measures the amount of time that the VM is ready to process on physical CPUs, but is waiting for the ESXi CPU scheduler to find the time to do so.  Normally this is caused by other VMs competing for the same resources.  VMs experiencing a high %RDY will definitely experience some performance implications and may indicate the need for more physical cores, or can sometimes be solved for removing un-needed vCPU’s from VMs that do not require more than one.

 

Co-Stop (%CSTP) - Similar to ready Co-Stop measures the amount of time the VM was incurring delay due to the ESXi CPU Scheduler – the difference being Co-Stop only applies to those VMs with multiple vCPU’s and %RDY can apply to VMs with a  single vCPU.  A high number of VMs with a high Co-Stop may indicate the need for more physical cores within your ESXi host, too high of a consolidation ration, or quite simply, too many multiple vCPU VMs.

 

Memory

 

Active (%ACTV) - Just as it’s a good idea to monitor the average CPU usage on both hosts and VMs it’s also the same for active memory.  Although we cannot necessarily use this metric for right sizing due to the the way it is calculated it can be used to see which VMs are actively and aggressively touching memory pages.

 

Swapping (SWR/s,SWW/s,SWTGT,SWCUR) - Memory swapping is a very important metric to watch.  Essentially if we see this metric anywhere above 0 it means that we are actively swapping out memory pages and processes to the swap file that is create upon VM power on.  This means instead of paging memory to RAM, we are using much slower disk to do so.  If we see swap occurring we may be in the market for more memory on our physical hosts, or looking to migrate certain VMs to other hosts with free physical RAM.

 

Balloon (MEMCTLGT) - Ballooning isn’t necessarily a bad metric for memory consumption but can definitely be used as an early warning symptom for swapping.  When a value is reported for ballooning it basically states that the host cannot satisfy the VMs memory requirements, and is essentially reclaiming unused memory back from other virtual machines.  Once we are through reclaiming memory from the balloon driver then swapping is the next logical step, which can be very very detrimental on performance.

 

Disk

 

Latency (DAVG, GAVG, KAVG, QAVG) - When it comes to monitoring disk i/o latency is king.  Within a virtualized environment there are many different areas where latency may occur though, from leaving the VM, going through the VMkernel, HBA, and storage array.  To help understand total latency we can look at the following metrics.

  • KAVG – This the amount of time that the I/O spends within the VMkernel
  • QAVG – This is the amount to time that the I/O spends in the HBA driver after leaving the VMkernel
  • DAVG – This is the amount of time the I/O takes to leave the HBA, get to the storage array and return back.
  • GAVG- We can think of GAVG (Guest Average) as the sum of all three metrics (KAVG, QAVG, DAVG) – essentially the total amount of latency as it pertains to the applications within the VM.

 

As you might be able to determine a high QAVG/KAVG can most certainly be a result of too small of a queue depth within your HBA – that or possibly your host is way too busy and VMs need to be migrated to others.  A high DAVG (>20ms) normally indicates an issue with the actual storage array, either it is incorrectly configured and/or too busy to handle the load.

 

Network

 

Dropped packets (DRPTX/DRPRX) - As far as network performance there are only a couple of metrics in which we can monitor from a host level.  The DRPTX/RX monitor the packets which are dropped either on the transmit or receive end respectively.  When we begin to see this metric go above 1 we may come to the conclusion that we have very high network utilization and may need to either increase our bandwidth out of the host, or possible somewhere along the path the packets are taking.

 

As I mentioned earlier there are over 300 metrics within esxtop – the above are simply the core ones I use when troubleshooting performance.  Certainly having a third party monitoring solution can help  you to baseline your environment and utilize these stats to more to your advantage by summarizing them in more visually appealing ways.  For this week I’d love to hear about some of your real life situations -   When was there a time where you noticed a metric was “out of whack” and what did you do to fix it?  What are some of your favorite performance metrics that you watch and why?   Do you use esxtop or do you have a favorite third-party solution you like to utilize?

 

Thanks for reading!

Let’s face it!  We live in a world now where we are seeing a heavy reliance on software instead of hardware.  With Software Defined Everything popping up all over the place we are seeing traditional hardware oriented tasks being built into software – this provides an extreme amount of flexibility and portability on how we chose to deploy and configure various pieces of our environments.

 

With this software management layer taking hold of our virtualized datacenters we are going through a phase where technologies such as private and hybrid cloud are now within our grasp.  As the cloud descends upon us there is one key player that we need to focus on – the automation and orchestration that quietly executes in the background, the key component to providing the flexibility, efficiency, and simplicity that we as sysadmins are expected to provide to our end users.

 

To help drive home the importance and reliance of automation let’s take a look at a simple task – that of deploying a VM.  When we do this in the cloud, mainly public,  it’s just a matter of swiping a credit card, providing some information in regards to a name and network configuration, waiting a few minutes/seconds and away we go. Our end users can have a VM setup almost instantaneously!

 

The ease of use and efficiency of the public cloud, such as the above scenario is putting extended pressure on IT within their respective organizations – we are now expected to create, deliver and maintain these flexible like services within our businesses, and do so with the same efficiency and simplicity that cloud brings to the table.  Virtualization certainly provides a decent starting point for this, but it is automation and orchestration that will take us to the finish line.

 

So how do we do it?

 

Within our enterprise I think we can all agree that we don’t simply just create a VM and call it “done”!  There are many other steps that come after we power up that new VM.  We have server naming to contend with, networking configuration (IP, DNS, Firewall, etc).  We have monitoring solutions that need to be configured in order to properly monitor and respond to outages and issues that may pop up, as well as I’m pretty certain we will want to include our newly created VM within some sort of backup or replication job in order to protect it.  With more and more software vendors exposing public API’s we are now living in a world where its possible to tie all of these different pieces of our datacenter together.

 

Automation and orchestration doesn’t stop at just creating VMs either – there’s room for it throughout the whole VM life cycle.  The concept of the self-healing datacenter comes to mind – having scripts and actions performed automatically by monitoring software in efforts to fix issues within your environment as the occur – this is all made possible by automation.

 

So with this I think we can all conclude that automation is a key player within our environments but the questions always remains – should I automate task x?  Meaning, will the time savings and benefits of creating the automation supersede the efforts and resources it will take to create the process?  So with all this in mind I have a few questions- Do you use automation and orchestration within your environment?   If so what tasks have you automated thus far?  Do you have a rule of thumb that dictates when you will automate a certain task?  Believe it or not there are people within this world that are somewhat against automation, whether it be in fear of their jobs or simply not adapting – how do you help “push” these people down the path of automation?

More often then not application owners look to their vendors to provide a list of requirements for a new project, the vendor forwards specifications that were developed around maximum load and in the age of physical servers.  These requirements eventually make their way on to the virtualization administrators desk.  8 Intel CPUs 2Ghz or higher, 16 GB Memory.   The virtualization administrator is left to fight the good fight with both the app owner and the vendors.  Now we all cringe when we see requirements such as above – we’ve worked hard to build out our virtualization clusters, pooling CPU and Memory to present back to our organizations, and we constantly monitor our environments to ensure that resources are available to our applications when they need them – and then a list of requirements like above come across through a server build form or email.

 

So whats the big deal?

 

We have lots of resources right?!? Why not just give the people what they want?  Before you just start giving away resources let’s take a look at both CPU and Memory and see just how VMware handles its scheduling and over commitment of both..

 

CPU

 

One of the biggest selling points of virtualization is the fact that we can have more vCPUs attached to VMs then we have physical CPUs on our hosts and let the ESXi scheduler take care of scheduling the VMs time on CPU.  So why don’t we just go ahead and give every VM 4 or 8 vCPUs?  You would think granting a VM more vCPUs would increase its performance – which it most certainly could – but the problem is you can actually hurt performance as well – not just on that single VM, but on other VMs running on the host as well.   Since the physical CPUs are shared there may be times where the scheduler will have to place CPU instructions on hold, or wait for physical cores to become available.  For instance, a VM with 4 vCPU’s will have to wait until 4 physical cores are available before the scheduler and execute its’ instructions, where-as a VM with 1 vCPU only has to wait for 1 logical core.  As you can tell, having multiple VMs each containing multiple vCPUs could in fact end up with a lot of queuing, waiting, and CPU ready time on the host, resulting in a significant impact to our performance.  Although VMware has made strides in CPU scheduling by implementing “Relaxed Co-scheduling”, it still only allows for a certain time drift between the execution of instructions across cores, and does not completely solve the issues around scheduling and CPU Ready – It’s always best practice to right size your VMs in terms of number of vCPUs to avoid as many scheduling conflicts as possible.

 

Memory

 

vSphere deploys many techniques when managing our virtual machine memory – VMs can share memory pages with each other, eliminating redundant copies of the same memory pages.  vSphere can also compress memory as well as deploy ballooning techniques which will allow on VM to essentially borrow allocated memory to another.  This built in intelligence almost masks away any performance or issues we might see with overcommitting RAM to our virtual machines.  That said memory is often one of the resources we run out of first and we should still take precautions in order to right-size our VMs to prevent waste.  The first thing to consider is overhead – by assigning additional un-needed memory to our VMs we increase the amount of overhead memory that is utilized by the hypervisor in order to run the virtual machine, which in turns takes memory from our pool available to other VMs.  The amount of overhead is determined by the amount of assigned memory as well as the number of vCPUs on the VM, and although this number is small (roughly 150MB for 16GB RAM/2vCPU VM) it can begin to add up as our consolidation ratios increase.  Aside from memory waste overcommitted memory also causes unnecessary waste on our storage end of things as well.  Each time a VM is powered on a swap file is created on disk of equal size to the allocated memory.  Again, this may not seem like a lot of wasted space at the time but as we create more and more VMs it can certainly add up to quite a bit of capacity.  Keep in mind that if there is not enough free space available to create this swap file, the VM will not be able to be powered on.

 

Certainly these are not the only impacts that oversized virtual machines have on our environment.  They can also impact certain features such as HA, vMotion times, DRS actions, etc but these are some of the bigger ones.  Right-sizing is not something that’s done once either – it’s important to constantly monitor your infrastructure and go back and forth with application owners as things change from day to day, month to month.  Certainly there are a lot of applications and monitoring systems out there that can perform this analysis for us so use them!

 

All that said though discovering our over and under sized VMs within our infrastructure is probably the easiest leg of the journey of reclamation.  Once we have some solid numbers and metrics in hands we need to somehow present these to other business units and application owners to try and claw back resources – this is where the real challenge begins.  Placing a dollar figure on everything, and utilizing features such as showback and chargeback may help, but again, its hard to take something back after its been given.  So my questions to leave you with this time are as follows – First up, how do you ensure your VMs are right-sized?  Do you use a monitoring solution available today to do so?  And if so, how often are you evaluating your infrastructure for right-sized VMs (Monthly, Yearly, etc.)?  Secondly, what do you for see as the biggest challenges in trying to claw back resources from your business?  Do you find that it’s more of a political challenge or simply an educational challenge?

In the past few years, there has been a lot of conversation around the “hypervisor becoming a commodity." It has been said that the underlying virtualization engines, whether they be ESXi, Hyper-V, KVM etc. are essentially insignificant, stressing the importance of the management and automation tools that sit on top of them.

 

These statements do hold some truthfulness: in its basic form, the hypervisor simply runs a virtual machine. As long as end-users have the performance they need, there's nothing else to worry about. In truth, though, the three major hypervisors on the market today (ESXi, Hyper-V, KVM) do this, and they do it well, so I can see how the “hypervisor becoming a commodity” works in these cases. But to SysAdmins, the people managing everything behind the VM, the commoditized hypervisor theory isn't bought quite so easily.

 

When we think about the word commodity in terms of IT, it’s usually defined as a product or service that is indistinguishable to it’s competitors, except for maybe price. With that said, if the hypervisors were a commodity, we shouldn’t care what hypervisor our applications are running on. We should see no difference between the VMs that are sitting inside an ESXi cluster or a Hyper-V cluster. In fact, in order to be commodity, these VMs should be able to migrate between hypervisors. The fact is that VMs today are not interchangeable between hypervisors, at least not without changing their underlying anatomy. While it is possible to migrate between hypervisors, the fact of the matter is that there is a process that we have to follow, including configurations, disks, etc. The files that make up that VM are all proprietary to the hypervisor they are running on and cannot simply be migrated and run by another hypervisor in their native forms.

 

Also, we stressed earlier the importance of the management tools that lie above the hypervisor, and how the hypervisor didn’t matter as much as the management tools did. This is partly true. The management and automation tools put in place are the heart of our virtual infrastructures, but the problem is that these management tools often create a divide in the features they support on different hypervisors. Take, for instance, a storage array providing support for VVOLs, VMware’s answer to per-vm-based policy storage provisioning. This is a standard that allows us to completely change the way we deploy storage, eliminating LUNs and making VMs and their disk first-class citizens on their subsequent storage arrays. That said, these are storage arrays that are connected to ESXi hosts, not Hyper-V hosts.  Another example, this time in favor of Microsoft, is in the hybrid cloud space. With Azure stack coming down the pipe, organizations will be able to easily deploy and deliver services from their own data centers, but with azure-like agility. The VMware solution, which is similar, involving vCloud Air and vCloud Connector, is simply not at the same level as Azure when it comes to simplicity, in my opinion. They are two very different feature-sets that are only available on their respective hypervisors.

 

So with all that, is the hypervisor a commodity?  My take: No! While all the major hypervisors on the market today do one thing – virtualize x86 instructions and provide abstraction to the VMs running on top of them - there are simply two many discrepancies between the compatible 3rd-party tools, features, and products that manage these hypervisors for me to call them commoditized. So I’ll leave you with a few questions. Do you think the hypervisor is a commodity?  When/if the hypervisor fully becomes a commodity, what do you foresee our virtual environments looking like? Single or multi-hypervisor? Looking forward to your comments.

mwpreston

To the cloud!

Posted by mwpreston Sep 28, 2014

Private, Public, Hybrid, Infrastructure as a Service, Database as a Service, Software Defined Datacenter; call it what you will but for the sake of this post I’m going to sum it all up as cloud.   When virtualization started to become mainstream we seen a lot of enterprises adopt a “virtualization first” strategy, meaning new services and applications introduced to the business will first be considered to be virtualized unless a solid case for acquiring physical hardware can be made.   As the IT world shifts we are seeing this strategy move more towards a “cloud first” strategy.  Companies are asking themselves questions such as “Are there security policies stating we must run this inside of our datacenter?”, “Will cloud provide a more highly available platform for this service?”, and “Is it cost effective for us to place this service elsewhere?”.

 

Honestly, for a lot of services the cloud makes sense!  But is your database environment one of them?  From my experiences I’ve seen database environments stay relatively static.  The database sat on different pieces of physical hardware and watched us implement our “virtualization first” strategies.  We’ve long virtualized web front ends, the application servers and all the other pieces of our infrastructure but have yet to make the jump on the database.  Sometimes it’s simply due to performance, but with the advances in hypervisors as of late we can’t necessarily blame it on metrics anymore.  And now we are seeing cloud solutions such as DBaaS and IaaS present themselves to us.  Most of the time, the database is the heart of the company.  The main revenue driver for our business and customers, and it gets so locked up in change freezes that we have a hard time touching it.  But today, let’s pretend that the opportunity to move “to the cloud” is real.

 

When we look at running our databases in the cloud we really have two main options; DBaaS (Database functionalities delivered directly to us) and IaaS (The same database functionality being provided, but allowing us to control a portion of the infrastructure underneath it.)  No matter the choice we make, to me, the whole “database in the cloud” scenario is one big trade off.  We trade away our control and ownership of the complete stack in our datacenters to gain the agility and mobility that cloud can provide us with.

 

Think about it!  Currently, we have the ability to monitor the complete stack that our database lives on.  We see all traffic coming into the environment, all traffic going out, we can monitor every single switch, router, and network device that is inside of our four datacenter walls.  We can make BIOS changes to the servers our database resides on.  We utterly have complete and ??? control over how our database performs (with the exception of closed vendor code )  In a cloudy world, we hand over that control to our cloud provider.  Sure, we can usually still monitor performance metrics based on the database operations, but we don’t necessarily know what else is going on in the environment.  We don’t know who our “neighbors” are or if what they are doing is affecting us in anyway.  We don’t know what changes or tweaks might be going on below the stack that hosts our database.  On the flip side though, do we care?  We’ve paid good money for these services and SLAs and put our trust in the cloud provider to take care of this for us.  In return, we get agility.  We get functionality such as faster deployment times.  We aren’t waiting anymore for servers to arrive or storage to be provisioned.  In the case of DBaaS we get embedded best practices.  A lot of DBaaS providers do one thing and one thing alone; make databases efficient, fast, resilient and highly available.  Sometimes the burden of DR and recovery is taken care of for us.  We don’t need to buy two of everything.  Perhaps the biggest advantage though is the fact that we only pay for what we use.  As heavy resource peaks emerge we can burst and scale up, automatically.   When those periods of time are over we can retract and scale back down.

 

So thoughtful remarks for the week – What side of the “agility vs control” tradeoff do you or your business take?  Have you already made a move to hosting a database in the cloud?  What do you see as the biggest benefit/drawback to using something like DBaaS?   How has cloud changed the way you monitor and run your database infrastructure?

 

There is definitely no right or wrong answers this week – I’m really just looking for stories.  And these stories may vary depending your cloud provider of choice.  To some providers, this trade-off may not even exist.  To those doing private cloud, you may have the best of both worlds.

 

As this is my fourth and final post with the ambassador title I just wanted to thank everyone for the comments over the past month...  This is a great community with tons of engagement and you can bet that I won’t be going anywhere, ambassador or not!

For the most part most database performance monitoring tools do a great job at real-time monitoring – by that I mean alerting us when certain counter thresholds are reached, such as Page Life Expectancy below 300 or Memory Pages per Second is too high.  Although this is definitely crucial to have setup within our environment, having hard alerts does pose a problem of its own.  How do we know that reaching a page life expectancy of 300 is a problem?   Maybe this is normal for a certain period of time such as month end processing.

 

This is where the baseline comes into play.  A baseline, by definition is a minimum or starting point used for comparisons.  In the database performance analysis world, it’s a snapshot or how our databases and servers are performing when not experiencing any issues for a given point of time.  We can then take these performance snapshots and use them as a starting point when troubleshooting performance issues.  For instance, take into consideration a few of the following questions…

 

  1. Is my database running slower now than it was last week?
  2. Has my database been impacted by the latest disk failure and RAID rebuild?
  3. Has the new SAN migration impacted my database services in any way?
  4. Has the latest configuration change/application update impacted my servers in any way?
  5. How have the addition of 20 VMs into my environment impacted my database?

 

With established baselines we are able to quickly see by comparison the answer to all of these questions.  But, let’s take this a step further, and use question 5 in the following scenario.

 

Jim is currently comparing how his database server is performing now against a baseline he had taken a few months back.  This, being after adding 20 new VMs into his environment.  He concludes, with the data to back him up, that his server is indeed running slower.  He is seeing increased read/write latency and increased CPU usage.  So is the blame really to be placed on the newly added VMs?   Well, this all depends – What if something else was currently going on that is causing the latency to increase?  Say month end processing and backups are happening now and weren't during the snapshot of the older baseline.

 

We can quickly see that baselines, while they are important, are really only as good as the time that you take them.  Comparing a  period of increased activity to a baseline taken during a period of normal activity is really not very useful at all.

 

So this week I ask you to simply tell me about how you tackle baselines.

  1. Do you take baselines at all?  How many?  How often?
  2. What counters/metrics do you collect?
  3. Do you baseline your applications during peak usage?  Low usage?  Month end?
  4. Do you rely solely on your monitoring solution for baselining?  Does it show you trending over time?
  5. Can your monitoring solution tell you, based on previous data, what is normal for this period of time in your environment?

 

You don’t have to stick to these questions – let's just have a conversation about baselining!

A lot of times as administrators or infrastructure people we all too often get stuck “keeping the lights on”.  What I mean by this, is we have our tools and scripts in place to monitor all of our services and databases, we have notification set up to alert us when they are down or experiencing trouble, and we have our troubleshooting methodologies and exercises that we go through in order to get everything back up and running.


The problem being, that's where our job usually ends.  We simply fix the issue and must move on to the next issue in order to keep the business up.  Not that often do we get the chance to research a better way of monitoring or a better way of doing things.  And when we do get that time, how do we get these projects financially backed by a budget?

 

Throughout my career there have been plenty of times where I have mentioned the need for better or faster storage, more memory, more compute, and different pieces of software to better support me in my role.  However the fact of the matter without proof on how these upgrades or greenfield deployments will impact the business, or better yet, how the business will be impacted without them, there's a pretty good chance that the answer will always be no.

 

So I’m constantly looking for that silver bullet if you will – something that I can take to my CTO/CFO in order to validate my budget requests.  The problem being, most performance monitoring applications spit out reports dealing with very technical metrics.  My CTO/CFO do not care about the average response time of a query.  They don’t care about table locking and blocking numbers.  What they want to see is how what I’m asking for can either save them money or make them money.

 

So this is where I struggle and I’m asking you, the thwack community for your help on this one – leave a comment with your best tip or strategy on using performance data and metrics to get budgets and projects approved.  Below are a few questions to help you get started.

 

  • How do you present your case to a CTO/CFO?  Do you have some go to metrics that you find they understand more than others?
  • Do you correlate performance data with other groups of financial data to show a bottom line impact of a performance issue or outage?
  • Do you map your performance data directly to SLA’s that might be in place?  Does this help in selling your pitch?
  • Do you have any specific metrics or performance reports you use to show your business stakeholders the impact on customer satisfaction or brand reputation?

 

Thanks for reading – I look forward to hearing from you.

We all have them lurking around in our data centers.  Some virtual, some physical.  Some small, some large.  At times we find them consolidated onto one server, other times we see many of them scattered across multiple servers.  There’s no getting away from it – I’m talking about the database.  Whether its’ relational or somewhat flattened the database is perhaps one of the oldest and most misunderstood technologies that we find inside businesses IT infrastructure today.  When applications slow down, it’s usually the database in which the fingers get pointed at – and we as IT professionals need to know how to pin point and solve issues inside these mystical structures of data, as missing just a few transactions could potentially result in a lot of loss revenue for our company.

 

I work for an SMB, where we don’t have teams of specialists or DBA’s to look after these things.  This normally results in my time and effort focusing on being proactive by automating things such as index rebuilds and database defragmentation.  That said we still experience issues and when we do seeing as I have a million other things to take care of I don’t have the luxury of taking my time when troubleshooting database issues.  So my questions for everyone on first week of partaking as a thwack ambassador are.

 

  1. Which database application is mostly used within your environment (SQL Server, Oracle, MySQL, DB2, etc)?
  2. Do you have a team and/or a person dedicated solely as a DBA, monitoring performance and analyzing databases?  Or is this left to the infrastructure teams to take care of?
  3. What applications/tools/scripts do you use to monitor your database performance and overall health?
  4. What types of automation and orchestration do you put in place to be proactive in tuning your databases (things such as re-indexing, re-organizing, defragmentation, etc).  And how do you know when the right time is to kick these off?

 

Thanks for your replies and I can’t wait to see what answers come in.

 

Related resources:

 

Article: Hardware or code? SQL Server Performance Examined — Most database performance issues result not from hardware constraint, but rather from poorly written queries and inefficiently designed indexes. In this article, database experts share their thoughts on the true cause of most database performance issues.

 

Whitepaper: Stop Throwing Hardware at SQL Server Performance — In this paper, Microsoft MVP Jason Strate and colleagues from Pragmatic Works discuss some ways to identify and improve performance problems without adding new CPUs, memory or storage.

 

Infographic: 8 Tips for Faster SQL Server Performance — Learn 8 things you can do to speed SQL Server performance without provisioning new hardware.

Filter Blog

By date: By tag: