Geek Speak

5 Posts authored by: jonklaus

Two weeks ago, I talked about upgrading storage systems. More specifically, how to determine the right upgrade interval for your storage systems. What the previous post did not cover is the fact that your storage system is part of a larger ecosystem. It attaches to a SAN and LAN, and is accessed by clients. A number of technologies such as backup, snapshots, and replication protect data. Monitoring systems ensure you are aware of what is going on with all the individual components. This list can go on and on…

 

Each component in that ecosystem will receive periodic updates. Some components are relatively static: a syslog server may receive a few patches, but it is unlikely that these patches fundamentally change how the syslog server works. Moreover, in case of a bad update, it is relatively easy to roll back to a previous version. As a result, it is easy to keep a syslog server up to date.

 

Other components change more frequently or have a larger feature change between patches. For example, hyper-converged infrastructure, which is still a growing market, receives many new features to make it more attractive to a wider audience. It is more of a gamble to upgrade these systems: new features might break old functions that your peripheral systems rely on.

 

Finally, do not forget the systems that are a spider in the web, like hypervisors. They run on hardware that needs to be on a compatibility list. Backup software talks to it, using snapshots or things like Changed Block Tracking to create backups and restore data. Automation tools will talk to them to create new VMs. Plus, the guest OS in a VM receives VM hardware and tool upgrades. These systems are again more difficult to upgrade, just because there’s so many aspects of the software exposed to other components of the ecosystem.

 

So how can you keep this ecosystem healthy, without too many “Uh-oh, I broke something!” moments with the whole IT stack collapsing like a game of Jenga?

 

Reading, testing, and building blocks

Again, read! Release notes, compatibility lists, advisories, etc. Do not just look for changes in the product itself, but also for changes to APIs or peripheral components. A logical drawing of your infrastructure helps: visualize which systems talk with other systems.

 

Next is testing. A vendor tests upgrade paths and compatibility, but no environment is like your own. Test diligently. If you cannot afford a test environment, then at the very least test your upgrades on a production system of lesser importance. After the upgrade, test again: does your backup still run? No errors? We had a “no upgrades on Friday afternoon” policy at one customer: it avoids having to pull a weekender to fix issues, or missing three backups because nobody noticed something was broken.

 

As soon as you find the ideal combination of software and hardware versions, create a building block out of it. TOGAF can help: it is a lightweight and adaptable framework for IT architecture. You can tailor it to your own specific needs and capabilities. Moreover, you do not need to do “all of it” before you can reap the benefits: you can pick the components you like.

 

Let us assume you want to run an IaaS platform. It consists of many systems: storage, SAN, servers, hypervisor, LAN, etc. You have read the HCLs and done the required testing, so you are certain that a combination of products works for you. Whatever keeps your VMs running! This is your solution building block.

 

Some components in the solution building block could need careful specification. For example Cisco UCS firmware 3.2(2d) with VMware ESXi 6.5U1 needs FNIC driver Y. Others are more loosely specified: syslogd, any version.

 

Next, track the life cycle of these building blocks, starting with the building block that you’re currently running in production: the active standard. Think ESX 6.5U1 with UCS blades on firmware 3.2(2d) on a VMAX with SRDF/Metro for replication and Veeam for backup and recovery. Again, specify versions or version ranges where required.

 

You might also be testing a new building block: the proposed or provisional standard. That could be with newer versions of software (like vSphere 6.7) or different components. It could even be completely different and use HCI infrastructure such as VXRail.

 

Finally, there are the old building blocks, either phasing-out or retired. The difference between these versions is the amount of effort you will put in upgrading or removing them from your landscape. A building block with ESXi 5.5 could be “phasing-out” in late 2017, which means you will not deploy new instances of it, but you also do not actively retire it. Now though, with the EOL of ESXi 5.5 around the corner, that building block should transition to retired. You need to remove it from your environment because it is an impending support risk.

 

By doing the necessary legwork before you upgrade, and by documenting the software and hardware versions you use, upgrades should become less of game of Jenga where one small upgrade brings down the entire stack.

After you’ve installed your new storage systems and migrated your data onto them, life slows down a bit. Freshly installed systems shouldn’t throw any hardware errors in the first stages of their lifecycle, apart from a drive that doesn’t fully realize it’s DOA. Software should be up to date. Maybe you’ll spend a bit more time to fully integrate the systems into your documentation and peripheral systems. Or deal with some of the migration aftermath, where new volumes were sized too small. But otherwise, it should be “business as usual.”

 

That doesn’t mean you can lie back and fall asleep. Storage vendors release new software versions periodically. The interval used to be a couple of releases a year, apart from the new platforms that might have a few extra patches to iron out the early difficulties. But with the AGILE mindset of developers, and the constant drive to squash bugs and add new features, software is now often released monthly. So, should you upgrade or not?

 

If It Ain’t Broke…

One camp will go to great lengths to avoid upgrading storage system software. While the theory of “if it ain’t broke, don’t fix it!” has its merits up to some point, it usually comes from fear. Fear that a software upgrade will go wrong and break something. Let’s be honest though: over time, the gap between your (old) software version and the newer software only becomes bigger. If you don’t feel comfortable with an upgrade path from 4.2.0 to 4.2.3, how does an upgrade path from 4.2.0 to 5.0.1 make you feel? Especially if your system shows you an uptime of 800+ days?

 

On the other hand, there’s no need to rush either. Vendors perform some degree of QA testing on their software, but it's usually a safe move to wait 30-90 days before applying new software to your critical production systems. Try it on a less critical system first, or let the new installs in the field flush out some additional bugs that slipped through the net. Code releases have been revoked more than once, and you don’t want to be hitting any new bugs while patching old bugs.

 

Target and latest revisions

Any respectable storage vendor should at the very least have a release matrix that shows release dates, software versions, adoption rates, and the suggested target release. This information can help you balance “latest features and bugfixes” versus “a few more new bugs that hurt more than the previous fixes.”

 

Again, don’t be lazy and hide behind the target release matrix. Once a new release comes out, check the release notes to see if anything in it applies to your environment. Sometimes it does really make sense to upgrade immediately, like with critical security or stability patches. Often, the system will check for the latest software release and show some sort of alert. In the last couple of months, I’ve seen patches for premature SSD media wear, overheating power supplies that can set fire to your DC, and a boatload of critical security patches. If you keep up-to-date with code and release notes, it doesn’t even take that much time to scroll through the latest fixes and feature additions.

 

One step up, there’s also vendors that look beyond a simple release matrix. They will look at your specific system and configuration, and select the ideal release and hotfixes for your setup. All this will be based on a bunch of data they collect from their systems at customers around the globe. And if you fall behind in upgrades and need intermediate updates, they will even select the ideal intermediate upgrades, blacklisting the ones that don’t fit your environment.

 

How often do you upgrade your storage systems? And what’s your biggest challenge with these upgrades? Let me know in the comments below!

A storage system on its own is not useful. Sure, it can store data, but how are you going to put any data on it? Or read back the data that you just stored? You need to connect clients to your storage system. For this post, let’s assume that we are using block protocols like iSCSI or traditional block storage systems. This article also applies to file protocols (like NFS and SMB) and to some extent even to hyper-converged infrastructure, but we will get back to that later.

 

Direct attaching clients to the storage system is an option. There is no contention between clients on the ports, and it is cheap. In fact, I still see direct attached solutions in cases where low cost wins over client scalability. However, direct attaching your clients to a storage system does not really scale well in number of clients. Front-end ports on a storage array are expensive and limited.

 

Add some network

Therefore, we add some sort of network. For block protocols, that is a SAN. The two most common used protocols are the FC protocol (FCP) and iSCSI. Both protocols use SCSI commands, but the network equipment is vastly different: FC switches vs. Ethernet switches. Both have their advantages and disadvantages, and IT professionals will usually have a strong preference for either of the two.

 

Once you have settled for a protocol, the switch line speed is usually the first thing that comes up. FC commonly uses 16Gbit and 32Gbit switches that have been entering the market lately. Ethernet, however, is making bigger jumps, with 10Gbit being standard within a rack or wiring closet and 25/40/100Gbit commonly used for uplinks to the data center cores.

 

The current higher speeds of Ethernet networks are often one of the arguments why “Ethernet is winning over FC.” 100Gbit Ethernet has already been on the market for quite some time, and the next obvious iteration of FC is “only” going to achieve 64Gbit.

 

Oversubscription

Once you start attaching more clients to a storage system than it has storage ports, you start oversubscribing. 100 servers attached to 10 storage ports means you have on average 10 servers on each storage port. Even worse, if those servers are hypervisors running 30 virtual machines each, you will now have 300 VMs competing for resources on a single port.

 

Even the most basic switch will have some sort of bandwidth/port monitoring functionality. If it does not have a management GUI that can show you graphs, third-party software can pull that data out of the switch using SNMP. As long as traffic in/out does not exceed 70% you should be OK, right?

 

The challenge is that this is not the whole truth. Other, more obscure limitations might ruin your day. For example, you might be sending a lot of very small I/O to a storage port. Storage vendors often brag with 4KB I/O performance specs. 25,000 4KB IOps only accounts for roughly 100MB/s or 800Mbit (excluding overhead). So, while your SAN port shows a meager 50% utilization, your storage port or HBA could still be overloaded.

 

It becomes more complex once you start connecting SAN switches and distributing clients and storage systems across this network of switches. It is hard to keep track of how much storage and client ports traverse the ISLs (Inter Switch Links). In this case, it is a smart move to keep your SAN topology simple and to be careful with oversubscription ratios. Do the oversubscription math, and look beyond the standard bandwidth graphs. Check error counters, and in an FC SAN that has long distance links, check whether the Buffer-to-Buffer credits deplete on a port.

 

Ethernet instead of FC

The same principles apply to Ethernet. One argument why a company chooses an Ethernet-based SAN is because it already has LAN switches in place. In these cases, be extra vigilant. I am not opposed to sharing a switch chassis between SAN and normal client traffic. However, ports, ISLs, and switch modules/ASICS are prime contention points. You do not want your SAN performance to drop because a backup, restore, or large data transfer starts between two servers, and both types of traffic start fighting for the available bandwidth.

 

Identically, hyper converged infrastructure solutions like VxRail and other VMware VSAN place high demands on the Ethernet uplinks. Ideally, you would want to ensure that VMware VSAN uses dedicated, high-speed uplinks.

Which camp are you in? FC or Ethernet, or neither? And how do you ensure that the SAN doesn’t become a bottleneck? Comment below!

At some point, your first storage system will be “full." I’m writing it as “full” because the system might not actually be 100% occupied with data at that exact point in time. The system could be full for another technical reason. For example, shared components in a system (e.g. CPUs) are overloaded before you ever install the maximum amount of drives, and upgrading those would be too expensive. Or it could be an administrative decision that has made you decide to not hand out new capacity from an existing system. For example, you’re expecting a rapid organic growth of several thin provisioned volumes, which would soon fully utilize the capacity headroom of the current system.

 

The fact that a single system has reached the maximum capacity, either for technical or administrative purposes, does not mean you need to turn away customers. IT should be a facilitator to the business. If the business needs to store additional data, there’s often a good reason for it. In health care it could be storing medical images. For a service/cloud provider, hosting more (paying) customers. So instead of communicating “Sorry, we’re full, go somewhere else!”, we should say something in the direction of “Yes, we can store your data, but it’s going to land on a different system”. In fact, just store the data and leave out the system part!

 

More of the same?

When your first system is full and you’re buying another one, you could buy a similar system and install it next to the original one. It might be a bit faster, or a bit more tuned for capacity. Or completely identical, if you were happy with the previous one.

 

On the other hand, this might also be a good moment to differentiate between the types of data in your company. For example, if you’ve started out with a block storage, maybe this is the time to buy a NAS and offload some of the file data to it.

 

Regardless of type, introducing a second system will create a couple of challenges for the IT department. First, you’ll now have to decide which system you want to land new data on. With identical type systems, it might be a fill and spill principle where you fill up the first system and then move over to the second box.

 

Once you introduce different types and speeds of systems though, you need to differentiate between types of storage and the capabilities of systems. Some data might be better suited to land on a NAS, other data on a spinning disk SAN, and another flavor of data on an all-flash SAN array. And you need to keep track which clients/devices are attached to which systems, so documentation and a clear naming convention is paramount.

 

Keeping it running

Then there’s the challenge of keeping all the storage systems running. You can probably monitor a handful of systems with the in-box GUI, but that doesn’t scale well. At some point, you need to add at least central monitoring software, to group all the alerts and activities in a single user interface. Even better would be central management, so you don’t have to go back to the individual boxes to allocate LUNs and shares.

 

With an increasing number of storage systems comes an increasing number of attached servers and clients. Ensuring that all clients, interconnects and systems are on the right patch levels is a vertical task across all these layers. You should look at the full stack to ensure you don’t break anything by patching it to a newer level.

 

If you glue too many systems together, you’ll end up with a spaghetti of shared systems that make patch management difficult, if not impossible. Some clients will be running old software that prevents other layers (like the SAN or storage array) from being patched to the newest levels. Other attached clients might rely on these newer codes because they run a newer hypervisor. You’ll quickly end up with a very long string of upgrades that need to be performed, before you’re fully up to date and compliant. So, it’s probably best to create building blocks of some sort.

 

How do you approach the “problem” of growing data? Do you throw more systems at it, or upgrade capacity/performance of existing systems? And how do you ensure that the infrastructure can be managed and patched? Let me know!

When designing the underlying storage infrastructure for a set of applications, several metrics are important.

 

First, there’s capacity. How much storage do you need? This is a metric that’s well understood by most people. People see GBs and TBs on their own devices and subscription plans on a daily basis, so they’re well aware of it.

 

There’s also performance, which is a bit more difficult. People tend to think in terms of “slow vs. fast," but these are subjective metrics. For storage, the most customer-centric metric is response time. How long does it take to process a transaction? Response time is, however, the product of a few other metrics, including I/O operations per second, the size of an I/O, and the queue depth of other I/O in front of you.

 

Sizing a storage system

If you size a storage system to meet both capacity and peak performance requirements, you will generally have low response times. Capacity is easy; I need X Terabytes. Ideally, you’d also have some performance numbers to base the size of your system on, including expected IOps, I/O size, and read:write ratio to name a few. If you don’t have these performance requirements, a guesstimate is often the closest you can get.

 

With this information, and an idea of which response time you’re aiming for, it’s possible to configure a system that should be in the sweet spot. Small enough to make it cost effective, yet large enough that you can absorb some growth and/or unexpected peaks in performance and capacity. Depending on your organization and budget, you might undersize it to only cover the 95th percentile peak performance, or you might oversize it to facilitate growth in the immediate future.

 

Let it grow, let it grow… and monitor it!

Over time though, your environment will start to grow. Data sets increase and more users connect to it. Performance demands grow in step with capacity. This places additional demands on the system; demands that it wasn’t sized for initially.

 

Monitoring is crucial in this phase of the storage system lifecycle. You need to accurately measure the capacity growth over time. Automated forecasts will help immensely. Keep an eye on the forecasting algorithms and the statistics history. If the algorithm doesn’t use enough historical data, it might result in extremely optimistic or pessimistic predictions!

 

Similarly, performance needs to be guaranteed throughout the life of the array. The challenge with performance monitoring is that it’s usually a chain of components that influence each other. Disks connect to busses, which connect to processors, which connect to front-end ports, and you need to monitor them all. Depending on the component that’s overloaded, you might be able to upgrade it. For example, connect additional front-end ports to the SAN or upgrade the storage processors. At some point though, you’re going to hit a limit. Then what?

 

Failure domain

Fewer, larger systems have several advantages over multiple smaller arrays. There are fewer systems to manage, which saves you time in monitoring and day-to-day maintenance. Plus, there’s fewer losses, as silos tend to not be fully utilized.

 

One important aspect to consider, though, is the failure domain. What's the impact if a system or component fails? Sure, you could grow your storage system to the largest possible size. But if it fails, how long would you need to restore all that data? In a multi-tenancy situation, how many customers would be impacted by a system failure? Licenses for larger systems are sometimes disproportionally more expensive than their smaller cousins; does this offset the additional hassle of managing multiple systems? There’s multiple approaches possible. Let me know which direction you’d choose: fewer, bigger systems, or multiple smaller systems!

Filter Blog

By date: By tag:

SolarWinds uses cookies on its websites to make your online experience easier and better. By using our website, you consent to our use of cookies. For more information on cookies, see our cookie policy.