cancel
Showing results for 
Search instead for 
Did you mean: 

Software Updates in the IT Stack: (Not) a Game of Jenga

Level 9

Two weeks ago, I talked about upgrading storage systems. More specifically, how to determine the right upgrade interval for your storage systems. What the previous post did not cover is the fact that your storage system is part of a larger ecosystem. It attaches to a SAN and LAN, and is accessed by clients. A number of technologies such as backup, snapshots, and replication protect data. Monitoring systems ensure you are aware of what is going on with all the individual components. This list can go on and on…

Each component in that ecosystem will receive periodic updates. Some components are relatively static: a syslog server may receive a few patches, but it is unlikely that these patches fundamentally change how the syslog server works. Moreover, in case of a bad update, it is relatively easy to roll back to a previous version. As a result, it is easy to keep a syslog server up to date.

Other components change more frequently or have a larger feature change between patches. For example, hyper-converged infrastructure, which is still a growing market, receives many new features to make it more attractive to a wider audience. It is more of a gamble to upgrade these systems: new features might break old functions that your peripheral systems rely on.

Finally, do not forget the systems that are a spider in the web, like hypervisors. They run on hardware that needs to be on a compatibility list. Backup software talks to it, using snapshots or things like Changed Block Tracking to create backups and restore data. Automation tools will talk to them to create new VMs. Plus, the guest OS in a VM receives VM hardware and tool upgrades. These systems are again more difficult to upgrade, just because there’s so many aspects of the software exposed to other components of the ecosystem.

So how can you keep this ecosystem healthy, without too many “Uh-oh, I broke something!” moments with the whole IT stack collapsing like a game of Jenga?

Reading, testing, and building blocks

Again, read! Release notes, compatibility lists, advisories, etc. Do not just look for changes in the product itself, but also for changes to APIs or peripheral components. A logical drawing of your infrastructure helps: visualize which systems talk with other systems.

Next is testing. A vendor tests upgrade paths and compatibility, but no environment is like your own. Test diligently. If you cannot afford a test environment, then at the very least test your upgrades on a production system of lesser importance. After the upgrade, test again: does your backup still run? No errors? We had a “no upgrades on Friday afternoon” policy at one customer: it avoids having to pull a weekender to fix issues, or missing three backups because nobody noticed something was broken.

As soon as you find the ideal combination of software and hardware versions, create a building block out of it. TOGAF can help: it is a lightweight and adaptable framework for IT architecture. You can tailor it to your own specific needs and capabilities. Moreover, you do not need to do “all of it” before you can reap the benefits: you can pick the components you like.

Let us assume you want to run an IaaS platform. It consists of many systems: storage, SAN, servers, hypervisor, LAN, etc. You have read the HCLs and done the required testing, so you are certain that a combination of products works for you. Whatever keeps your VMs running! This is your solution building block.

Some components in the solution building block could need careful specification. For example Cisco UCS firmware 3.2(2d) with VMware ESXi 6.5U1 needs FNIC driver Y. Others are more loosely specified: syslogd, any version.

Next, track the life cycle of these building blocks, starting with the building block that you’re currently running in production: the active standard. Think ESX 6.5U1 with UCS blades on firmware 3.2(2d) on a VMAX with SRDF/Metro for replication and Veeam for backup and recovery. Again, specify versions or version ranges where required.

You might also be testing a new building block: the proposed or provisional standard. That could be with newer versions of software (like vSphere 6.7) or different components. It could even be completely different and use HCI infrastructure such as VXRail.

Finally, there are the old building blocks, either phasing-out or retired. The difference between these versions is the amount of effort you will put in upgrading or removing them from your landscape. A building block with ESXi 5.5 could be “phasing-out” in late 2017, which means you will not deploy new instances of it, but you also do not actively retire it. Now though, with the EOL of ESXi 5.5 around the corner, that building block should transition to retired. You need to remove it from your environment because it is an impending support risk.

By doing the necessary legwork before you upgrade, and by documenting the software and hardware versions you use, upgrades should become less of game of Jenga where one small upgrade brings down the entire stack.

13 Comments
Level 13

Good Article. Many of the points raised are pertinent to other areas of IT too.

MVP
MVP

Good article

This emphasizes the complexity of our environments, the need for the right number of staff to get the job done, and the importance of getting professional training for the people responsible for the house of cards that is the network.

Level 14

This pretty much sums up what I have been telling our management.  We are about to implement a £3m HCI infrastructure based on VxRail with a DS60 + DD6800 data domain and all the associated gubbins in an active / active data centre pair.  After that we will just need to migrate about a thousand VM servers onto it.  Should be fun.

Level 20

Often storage systems don't get updated much unless there's a reason to I've found.

Level 13

JENGA!!!!!!!!!!!!!!!

Level 8

It is NOT a game of JENGA!

API Changes when upgrading.... fond memories....

My biggest pet peeve of IT. Version upgrades/patches/firmwares, etc. We in Support are a mouse on a wheel working hard and running in place when it comes to versioning. So much time. Soooo much technical debt. This needs to be improved.

This type of Jenga block-removal could provide incentive for software writers to perform at their very best levels:

Slow Motion Whip Jenga - YouTube

Level 9

I'm seeing an increasing amount of "critical" upgrades lately, where they fix things ranging from faulty SSD wear leveling to entire "if you don't upgrade, your datacenter is going to burn down"-PSU issues. Plus lots of features being added or improved (data reduction mostly). I tend to upgrade more frequently now than compared to 10 years ago!

Level 9

You could probably also apply this to the people performing infrastructure upgrades... pinpoint accuracy and lightning fast upgrade processes

MVP
MVP

Good article. There are several elements to a proper change process (whether that is software updates, patches, network changes or even developing policies and procedures) I'm a documentation and check list kind of guy, but I find it's important to build a path to success by defining the steps and parts necessary for the desired outcome to be attained. But staffing has to be made available to fulfill all of the steps and management has to hold each individual accountable.

About the Author
Based out of the south of the Netherlands, working with everything from datacenters to operating systems. I specialize in storage, back-up & server virtualization systems and will blog about experiences in the field and industry developments. Certified EMCIEe, EMCTAe and EMCCAe on a number of Dell EMC products and constantly trying to find the time to diversify that list. When not tinkering with IT you can find me on a snowboard, motorbike or practicing a variety of sports.