Geek Speak

13 Posts authored by: jdanton

No matter how much automation, redundancy, and protection you build into your systems, thing are always going to break. It might be a change breaking an API to another system. It might be a change in a metric. Perhaps you just experienced massive hardware failure. Many IT organizations have traditionally had a postmortem, or root cause analysis, process to try to improve the overall quality of their processes. The major problem with most postmortem processes is that they devolve into circular pointing matches. The database team blames the storage team, who in turn blames the network team, and everyone walks out of the meeting angry.

 

As I’m writing this article, I’m working a system where someone restarted a database server in the middle of a large operation, causing database corruption. This is a classic example of an event that might trigger a postmortem. In this scenario, we moved to new hardware and no one tested the restore times of the largest databases. This is currently problematic, as the database restore is still happening a few hours after I started this article. Other scenarios would be any situations where you have unexpected data loss, on-call pages, or a monitoring failure that didn’t capture a major system fault.

 

How can we do a better postmortem? The first thing to do is execute blameless postmortems. This process assumes that everyone involved in an accident had good intentions and executed with the right intentions based on available information. This technique originates in medicine and aviation, where human lives are at stake. Instead of assigning blame to any one person or team, the situation is analyzed with an eye toward figuring out what happened. Writing a blameless postmortem can be hard, but the outcome is more openness in your organization. You don’t want engineers trying to hide outages to avoid an ugly, blame-filled process.

 

Some common talking points for your postmortems include:

 

  • Was enough data collected to gather the root cause of the incident?
  • Would more monitoring data help with the process analysis?
  • Is the impact of the incident clearly defined?
  • Was outcome shared with stakeholders?

 

In the past, many organizations did not share a postmortem outside of the core engineering team. This is a process that has changed in recent years. Many organizations like Microsoft and Amazon, because of the nature of their hosting businesses, have made postmortems public. By sharing with the widest possible audience, especially in your IT organization, you can garner more comments and deeper insights into a given problem.

 

One scenario referenced in Site Reliability Engineering by Google is the notion of integrating postmortems into disaster recovery activities. By incorporating these real-world failures, you make your disaster recovery testing as real as possible.

 

If your organization isn’t currently conducting postmortems, or only conducts them for major outages, you might start to think about trying to introduce them more frequently for smaller problems. As mentioned above, starting with paged incidents is a good start. It gets you to start thinking about how to automate responses to common problems and helps ensure that the process can be followed correctly so that when a major issue occurs, you're not focused on how to conduct the postmortem, but instead on how to find the real root cause of the problem.

“The price of reliability is the pursuit of the utmost simplicity.” C.A.R. Hoare, Turing Award lecture.

 

Software and computers in general are inherently dynamic and not of a state of stasis. The only way IT, servers, software, or any other thing that has 1s and 0s on it can be perfectly stable is if they exist in a vacuum. If we think about older systems that were offline, we frequently had higher levels of stability--the exchange for that is fewer updates, new features, and longer development and feedback cycles, which meant you could wait years for simple fixes to a relatively basic problem. One of the goals of IT management should always be to keep these forces in check--agility and stability.

 

Agile’s Effect on Enterprise Software

 

Widespread of adoption of Agile frameworks across development organizations has meant that even enterprise focused organizations like Microsoft have shortened release cycles on major products to (in some cases) less than a year, and if you are using cloud services, as short as a month. If you work in an organization that does a lot of custom development, you may be used to daily or even hourly builds of application software. This causes a couple of challenges for traditional IT organizations in supporting new releases of enterprise software like Windows or SQL Server, and also supporting developers in their organization who are employing continuous integration/continuous deployment (CI/CD) methodologies in their organizations.

 

How This Changes Operations

 

First, let’s talk about supporting new releases of enterprise software like operating systems and relational database management systems (RDBMS). I was recently speaking at a conference where I was asked, “How are large enterprises who have general patch management teams supposed to keep up with a monthly patch cycle for all products?” This was a hard answer to deliver, but since the rest of the world has changed, your processes need to be changed along with them. Just like you shifted from physical machines to virtual machines, you need be able to adjust your operations processes to deal with more frequent patching cycles. It’s not just about the new functionality that you are missing out on. The array and depth of security threats means software is patched more frequently than ever, and if you aren’t patching your systems, your systems are vulnerable to threats from both internal and external vectors.

 

How Operations Can Help Dev

 

While as an admin I still get nervous about pushing out patches on the first day, the important thing is to develop a process to apply updates in near real-time to dev/test environments, with automated error checking, to then relatively quickly move the same patches into QA and production environments. If you lack development environments, you can patch your lower priority systems first, before moving on to higher priority systems.

 

Supporting internal applications is a little bit of a different story. As your development teams move to more frequent build processes, you need to maintain infrastructure support for them. One angle for this can be to move to a container-based deployment model--the advantage there is that the developers become responsible for shipping the libraries and other OS requirements their new features require, as they are shipped with the application code. Whatever approach you take, you want to focus on automating your responses to errors that are generated by frequent deployments, and work with your development teams to do smaller releases that allow for easier isolation of errors.

 

Summary

 

The IT world (and the broader world in general) has shifted to a cycle of faster software releases and faster adoption of features. This all means IT operations has to move faster to support both vendor and internally developed applications, which can be a big shift for many legacy IT organizations. Automation, smart administration, and more frequent testing will be how you make this happen in your organization.

One of the most interesting changes that I have observed in my career is Microsoft shifting from just being a development organization to truly becoming a DevOps team, in the case of the SQL Server team. The product code developers are the operations support for the cloud service hosting Azure SQL Database. Many other large development organizations have this model, but my relationships and experience with SQL Server have allowed me to observe this product much more closely. My major observation is that the major reason dev cycles turn so much faster is to fix problems that in traditional on-premises software would have possibly taken months between patches, or even years between full releases, that now get fixed in the course of a few weeks. This has really changed the way Microsoft releases SQL Server on-premises--after a major release, there is a cumulative update released every month for the first year. This means more problems get fixed faster, and if you are using the cloud service, the release cadence is even faster.

 

I know most of us do not work for hyper-scale public cloud providers with massive development resources, but I used this example because it is a very real-world visible impact into the way a DevOps model can transform an organization. So how does this apply to running our infrastructure organizations? I think even in the most off-the-shelf traditional shop, you should be thinking about how to automate the following work:

 

  • Manual
  • Repetitive
  • Work with no enduring value
  • Work that doesn’t scale with your application

 

As a system admin, your main job task is to keep the lights (and more importantly, the servers) on in your organization. By adopting a mindset of trying to automate as many of these tasks as possible, you will enjoy your job more, and have more time to focus on tasks that are more strategic to your organization. You can take advantage of frameworks like PowerShell and Python scripting in your environments (yes, you will have to write some code) to bring this all together, but this mindset will change the way you view system administration.

 

Where do you start with automation? Identify your most common tasks that consume the most time--for example, if you are using VM templates to deploy your operating system environments (that’s a really fancy way of saying VMs), you have already started down this path. A logical next stop might be to automate the process for keeping your templates up to date with patches, or to use scripting to ensure that any post-deployment configuration tasks happen without human intervention. To do this really well, you should adopt a developer mindset--all of your scripts should use source control, you should work to develop a unit testing process (the one negative to automation is that if you break something badly, you can break a lot of things badly, and quickly), and some aspects of a development methodology to keep the progress moving forward.

 

Moving into this mindset is a big shift for many organizations. However, in modern IT, influences like cloud and the aforementioned rapid development cycles mean that everything just moves faster. The other benefit of this automation, which should be a major influence on what tasks you automate, is the reduction in alert pages. If you can automate away your most common pager responses, your entire team and organization will benefit. The final, largest benefit of automation is that you know a script will run the same every time it's executed, as opposed to your junior admin, who may or may not get a complex task sequence correct each time.

Building a monitoring and alerting system should always be driven by your business needs. This is always a debate between the IT organization which tends to focus on granular measures, whereas the business users would like to see more of an end to end picture of the organization. An example of this would be uptime--as a DBA, if my database is available and servicing requests, I feel as though I’ve met my uptime goals, whatever they may be. However, if a load balancer goes down taking away access to the application tier, the application is unavailable to users, and that is all that matters. Building a monitoring solution that looks at systems holistically is challenging, and sometimes requires working backwards from desired monitoring objectives (is the system up) to the choosing indicators (is the database service available and writeable), and then building a target.

 

Defining Service Level Objectives

 

You want to focus on what your users care about, and not necessarily what is easy to measure. There are two main areas you will want to use to define these objects--performance and uptime. One notion that comes from Google’s Site Reliability Engineering is the notion of an error budget--a rate at which these service level objects can be missed. Additionally, having an error budget can allow you to be more aggressive with upgrades and resolving technical debt. While evaluating projects and change control efforts you can know that if you are well ahead of your SLO budget you can be more aggressive with rollout. If you are behind the curve, you may curtail some migration efforts.

 

Target Values for SLOs

 

Target values will be a negotiation between IT and the business. From an IT perspective it is important to not overpromise--for example if you only have one physical server in your stack, you probably aren’t going to reach 99.99% uptime. This is important for a few reasons, but in my opinion the biggest is helping the business users understand the correlation between resource cost and availability. In the above one server example, if the business wants that application to deliver 99.99% uptime, it is going to have to invest in redundancy at several levels. There are a few other tenants to think about:

 

  • Past performance isn’t a predictor of future performance--While building a performance target off of your historic baseline is a good start, it does not address the problem of a system that performs well at its current level, but that will fall off a cliff without a major reengineering effort.
  • Don’t Overthink Your Targets--While it may be tempting to bring in someone from the data science team to create your new targets using a complex machine learning K-means clustering algorithm, you are better of creating simple targets like percentage uptime and throughput. If you can’t explain your target in a sentence it is likely too complex.
  • Absolutes are bad--The notion of a system that is always available and can scale infinitely is completely unrealistic. Even hyperscale cloud providers have difficulties delivering 99.999% uptime. It’s better to promise what you can deliver and make the business understand what the cost of delivering more is.

 

This process allows you to set clear expectations with your business and reduces some of the finger pointing during outages. It does require a strong relationship between IT management and senior leadership of your organization, but in the end delivers IT that can be kept up to date while meeting the business needs of the organization.


 

Monitoring has always been a loosely defined and somewhat controversial term in IT organizations. IT professionals have very strong opinions about the tools they use, because monitoring and alerting is one of the key components of keeping systems online, performing, and delivering IT’s value proposition to the business. However, as the world has shifted more toward a DevOps mindset, especially in larger organizations, the mindset around monitoring systems has also shifted. You are not going to stop monitoring systems, but you might want you want to rationalize what metrics you are tracking in order to give better focus.

 

What To Monitor?

 

While operating systems, hardware, and databases all expose a litany of metrics that can be tracked, choosing that many performance metrics can make it challenging to pay attention to critical things that may be happening in your systems. Such deep dive analysis is best reserved for deep troubleshooting of one-off problems, not day-to-day monitoring. One approach to consider is classifying systems into broad categories and applying specific indicators that allow you to evaluate the health of a system.

 

User Facing Systems like websites, e-commerce systems, and of course everyone’s most important system, email, have availability as the most important metric. Latency and throughput are secondary metrics, but especially for customer facing systems are equally as important.

 

Storage and Network Infrastructure should emphasize latency, availability, and durability. How long does a read or write take to complete, or how much throughput is a given network connection seeing?

 

Database systems, much like front-end systems, should be focused on end-to-end latency, but also throughput--how much data is being processed, how many transactions are happening per time period.

 

It is also important to think about which aspect of each metric you want to alert (page an on-call operator). I like to approach this with two key rules: any page should be on something that can be actionable (service down, hardware failures), and always remember there is a human cost to paging, so if you can have automation respond to page and fix something with a shell script, that’s all the better.

 

It is important to think about the granularity of your monitoring. For example, the availability of a database system might only need to be checked every 15 seconds or so, but the latency of the same system should be checked every second or more to capture all query activity. At each layer of your monitoring, you will want to think about this. This a classic tradeoff of volume of data collection in exchange for more detailed information.

 

Aggregation

 

In addition to doing real-time monitoring, it is important to understand how your metrics look over time. This can give you key insights into things like SAN capacity and the health of your systems over time. Also, it leads you to be able to identify anomalies and hot spots (i.e. end of month processing), but also plan for peak loads. This leads to another point - you should consider collection of metrics as distributions of data rather than averages. For example, if most of your storage requests are answered in less than 2 milliseconds, but you have several that take over 30 seconds, those anomalies will be masked in an average. By using histograms and percentiles in your aggregation, you can quickly identify when you have out of bounds values in an otherwise well-performing system.

 

 

Standardize

 

Defining a few categories of systems, and standardizing your data collection allows for common definitions, and can drive towards having common service level indicators. This allows you to build a common template for each indicator and common goal towards higher levels of service.

Unlike most application support professionals, or even system administrators, as database professionals, you have the ability to look under the hood of nearly every application that you support. I know in my fifteen plus years of being a DBA, I have seen it all. I’ve seen bad practices, best practices, and worked with vendors who didn’t care that what the were doing was wrong, and others with whom I worked closely to improve the performance of their systems.

 

One of my favorite stories was an environmental monitoring application—I was working for a pharmaceutical company, and this was the first new system I helped implement there. The system was up for a week and performance had slowed to a crawl. After running some traces, I confirmed that there was a query without a where clause that was scanning 50,000 rows several times a minute. Mind you, this was many years ago, when my server had 1 GB of RAM—so this was a very expensive operation. The vendor was open to working together, and I helped them design a where clause, an indexing strategy, and a parameter change to better support the use of the index. We were able to quickly implement a patch and get the system moving quickly.

 

Microsoft takes a lot of grief from DBAs for their production systems like SharePoint and Dynamics, and some interesting design decisions that are made within. I don’t disagree—there are some really bad database designs. However, I’d like to give credit to whomever designed System Center Configuration Manager (SCCM)—this database has a very logical data model (it uses integer keys—what a concept!), and I was able to build a reporting system against it.

 

So what horror stories do you have about vendor databases? Or positives?

Career management is one of my favorite topics to write and or talk about, because I can directly help people. Something I notice as a consultant going into many organizations is that many IT professionals aren’t thinking proactively about their careers, especially those that work in support roles (supporting an underlying business, not directly contributing to revenue like a consulting firm or software development organization). One key thing to think about is how your job role fits into your organization—this is a cold hard ugly fact that took me a while to figure out.

 

Let’s use myself as an example—I was a DBA at a $5B/yr medical device company—that didn’t have tremendous dependencies on data or databases. The company needed someone in my slot—but frankly it did not matter how good they were at there job beyond a point. Any competent admin would have sufficed. I knew there was a pretty low ceiling of how far my salary and personal success could go at that company. So I moved to a very large cable company—they weren’t a technology company per se, but they were large enough organization that high level technologists roles were available—I got onto a cross platform architectural team that was treated really well.

 

I see a lot of tweets from folks that often seem frustrated in their regular jobs—the unemployment rate in database roles is exceedingly low—especially for folks like you who actively reading and staying on top of technology—don’t be scared to explore the job market, you might be pleasantly surprised.

If you are an Oracle DBA and reading this, I am assuming all of your instances run on *nix and you are a shell scripting ninja. For my good friends in the SQL Server community, if you haven’t gotten up to speed on PowerShell, you really need to this time. Last week, Microsoft introduced the latest version of Windows Server 2016, and it does not come with a GUI. Not like, click one thing and you get a GUI, more like run through a complex set of steps on each server and you eventually get a graphical interface. Additionally, Microsoft has introduced an extremely minimal server OS called Windows Nano, that will be ideal for high performing workloads that want to minimize OS resources.

 

One other thing to consider is automation and cloud computing—if you live in a Microsoft shop this all done through PowerShell, or maybe DOS (yes, some of us still use DOS for certain tasks).  So my question for you is how are you learning scripting? In a smaller shop the opportunities can be limited—I highly recommend the Scripting Guy’s blog. Also, doing small local operating system tasks via the command line is a great way to get started.

Last week at their Build Developer Conference and the week at Ignite, Microsoft introduced a broad range of new technologies. In recent years, Microsoft has become a more agile and dynamic company. In order for you and your organization to take advantage of this rapid innovation, your organization needs to keep with the change, and quickly adapt to new versions of technology, like Windows 10, or SQL Server 2016 . Or maybe you work with open source software like Hadoop and are missing out on some of the key new projects like Spark or the newer non-map reduce solutions. Or perhaps you are using a version of Oracle that doesn’t support online backups.  It’s not your fault; it’s what management has decided is best.

 

As an IT professional it is important to keep your skills up to date. In my career as a consultant, I have the good fortune to be working with software vendors, frequently on pre-release versions, so it is easy for me to stay up to date on new features. However, in past lives, especially when I worked in the heavily regulated health care industry, it was a real challenge to stay on top of new features and versions. I recently spoke with a colleague there and they are still running eight-year-old operating systems and RDBMSs.

 

So how you manage these challenges in your environment? Do you do rogue side projects (don’t worry we won’t share your name)? Or do you just keep your expert knowledge of old software?  Do you pay for training on your own? Attend a SQL Saturday or Code Camp? What do your team mates do?  Do you have tips to share for everyone on staying current when management thinks “we are fine with old technology”?

Back when I used to be an on-call DBA, I got paged one morning for a database server having high CPU utilization. After I punched the guy who setup that alert, I brought it up in a team meeting—is this something we should even be reporting on, much less alerting on. Queries and other processes in our use CPU cycles, but frequently as a production DBA you are the mercy of some third party applications “interesting” coding decisions causing more CPU cycles than is optimal.

 

Some things in queries that can really hammer CPUs are:

  • Data type conversions
  • Overuse of functions—or using them in a row by row fashion
  • Fragmented indexes or file systems
  • Out of date database statistics
  • Poor use of parallelism

 

Most commercial databases are licensed by the core—so we are talking about money here. Also, with virtualization, we have more options around easily changing CPU configurations, but remember overallocating CPUs on a virtual machine leads to less than optimal performance.  At the same time CPUs are a server’s ultimate limiter on throughput—if your CPUs are pegged you are not going to get any more work done.

 

The other angle to this, is since you are paying for your databases by the CPU, you want to utilize them. So there is a happy medium of adjusting and tuning.

Do you capture CPU usage over time? What have you done to tune queries for CPU use?

Throughout my career I have worked on a number of projects where standards where minimal. You have seen this type of shop—the servers may be named after superheros or rock bands. When the company has ten servers and two employees it’s not that big of a deal. However, when you scale up and suddenly become even a small enterprise, these things can become hugely important.

 

A good example is having the same file system layout for your database servers—without this it becomes hugely challenging to automate the RDBMS installation process. Do you really want to spend your time clicking next 10 times every time you need a new server?

 

One of my current projects has some issues with this—it is a massive virtualization effort, but the inconsistency of the installations, not following industry best practices, and the lack of common standards across the enterprise have led to many challenges in the migration process. Some of these challenges include inconsistent file system names, and even hard coded server names in application code. I did a very similar project at one of my former employers who had outstanding standards and everything went as smoothly as possible. 

 

What standards do you like to enforce? The big ones for me are file system layout (e.g. having data files and transaction/redo logs on the same volume every time, whether it is D:\ and L:\, or /data and /log) and server naming (having clearly defined names makes server location and identification easier). Some other standards I’ve worked with in the past include how to perform disaster recovery for different tiers of apps or which tier of application is eligible for high availability solutions.

In the fifteen plus years that I have in worked in information technology organizations, there has been a lot of change. From Unix to Linux, the evolution of Windows based operating systems, and the move to virtualized servers which led to the advent of cloud computing, it’s been a time of massive change. One of the major changes I’ve seen take place is the arrival of big data solutions.

 

There are couple of reasons why this change happened (in my opinion) at very large scale, relational database licensing is REALLY expensive. Additionally, as much as IT managers like to proclaim that storage is cheap—enterprise class storage is still very expensive. Most of the companies that initiated the movement to systems such as Hadoop and its ecosystem, were startups where capital was low, and programming talent was abundant, so have to do additional work to make a system work better was a non-issue.

 

So like any of the other sea changes that you have seen in industry, you will have to adjust your skills and career focus to align to these new technologies. If you are working on a Linux platform, your skills will likely transfer a little easier than from a Windows platform, but systems are still systems. What are you doing to match your skillset to the changing world of data?

I’m in the middle of tough project right now, a large client is trying to convert a large number of physical SQL Servers to virtual machines. They’ve done most of the right things—the underlying infrastructure is really strong, the storage is more than an adequate, and they aren’t overprovisioning the virtual environment.

 

Where the challenge is coming in is how to convert from physical to virtual. The classical approach, is to build new database VMs, restore backups from the physical to the VM, and ship log files until cutover time. However, in this case there are some application level challenges preventing that approach (mainly heavily customized application tier software). Even so, my preferred method here is to virtualize the system drives, and then restoring the databases using database restore operations.

 

This ensures the consistency of the databases, and rules out any corruption. Traditional P2V products have challenges around handling the rate of change in databases—many people think read only database workloads don’t generate many writes, but remember you are still writing to a cache, and frequently using temp space. What challenges have you seen in converting from virtual to physical?

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.