cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Do You Still Need To Worry About Performance Once You Go All-Flash?

Level 9

Flash storage can be really, really fast. Crazy fast. So fast that some have openly asked if they really need to worry about storage performance anymore. After all, once you can throw a million IOPS at the problem, your bottleneck has moved somewhere else!

So do you really need to worry about storage performance once you go all-flash?

Oh yes, you definitely do!

All-Flash Storage Can Be Surprisingly Slow

First, most all-flash storage solutions aren't delivering that kind of killer performance. In fast, most all-flash storage arrays can push "only" tens of thousands of IOPS, not the millions you might expect! For starters, those million-IOPS storage devices are internal PCIe cards, not SSD's or storage arrays. So we need to revise our IOPS expectations downwards to the "hundred thousand or so" than an SSD can deliver. Then it gets worse.

Part of this is a common architectural problem found in all-flash storage arrays which I like to call the "pretend SSD's are hard disks" syndrome. If you're a vendor of storage systems, it's pretty tempting to do exactly what so many of us techies have done with our personal computers: Yank out the hard disk drives and replace them with SSD's. And this works, to a point. But "storage systems" are complex machines, and most have been carefully balanced for the (mediocre) performance characteristics of hard disk drives. Sticking some SSD's in just over-taxes the rest of the system, from the controller CPU's to the I/O channels.

But even storage arrays designed for SSD's aren't as fast as internal drives. The definition of an array includes external attachment, typically over a shared network, as well as redundancy and data management features. All of this gets in the way of absolute performance. Let's consider the network: Although a 10 Gb Ethernet or 8 Gb Fibre Channel link sounds like it would be faster than a 6 Gb SAS connection, this isn't always the case. Storage networks include switches (and sometimes even routers) and these add latency that slows absolute performance relative to internal devices. The same is true of the copy-on-write filesystems protecting the data inside most modern storage arrays.

And maximum performance can really tax the CPU found in a storage array controller. Would you rather pay for a many-core CPU so you'll get maximum performance or for a bit more capacity? Most storage arrays, even specialized all-flash devices, under-provision processing power to keep cost reasonable, so they can't keep up with the storage media.

Noisy Neighbors

Now that we're reset our expectations for absolute performance, let's consider what else is slurping up our IOPS. In most environments, storage systems are shared between multiple servers and applications. That's kind of the point of shared networked storage after all. Traditionally, storage administrators have carefully managed this sharing because maximum performance was naturally quite limited. With all-flash arrays, there is a temptation to "punt" and let the array figure out how to allocate performance. But this is a very risky choice!

Just because an array can sustain tens or even hundreds of thousands of I/O operations per second doesn't mean your applications won't "notice" if some "noisy neighbor" application is gobbling up all that performance. Indeed, performance can get pretty bad since each application can have as much performance as it can handle! You can find applications starved of performance and trudging along at disk speeds...

This is why performance profiling and quality of service (QoS) controls are so important in shared storage systems, even all-flash. As an administrator, you must profile the applications and determine a reasonable amount of performance to allocate to each. Then you must configure the storage system to enforce these limits, assuming you bought one with that capability!

Note that some storage QoS implementations are absolute, while others are relative. In other words, some arrays require a hard IOPS limit to be set per LUN or share, while others simply throttle performance once things start "looking hot". If you can't tolerate uneven performance, you'll have to look at setting hard limits.

Tiered Flash

If you really need maximum performance, tiered storage is the only way to go. If you can profile your applications and segment their data, you can tier storage, reserving maximum-performance flash for just a few hotspots.

Today's hybrid storage arrays allow data to be "pinned" into flash or cache. This delivers maximum performance but can "waste" precious flash capacity if you're not careful. You can also create higher-performance LUNs or shares in all-flash storage arrays using RAID-10 rather than parity or turning off other features.

But if you want maximum performance, you'll have to move the data off the network. It's pretty straightforward to install an NVMe SSD in a server directly, especially the modern servers with disk-like NVMe slots or M.2 connectors. These deliver remarkable performance but offer virtually no data protection. So doing this with production applications puts data at risk and requires a long, hard look at the application.

You can also get data locality by employing a storage caching software product. There are a few available out there (SanDisk FlashSoft, Infinio, VMware vFRC, etc) and these can help mitigate the risks of local data by ensuring that writes are preserved outside the server. But each has its own performance quirks, so none is a "silver bullet" for performance problems.

Stephen's Stance

Hopefully I've given you some things to think about when it comes to storage performance. Just going "all-flash" isn't going to solve all storage performance problems!

I am Stephen Foskett and I love storage. You can find more writing like this at blog.fosketts.net​, connect with me as @SFoskett​ on Twitter, and check out my Tech Field Day events.

10 Comments
Level 21

I would first like to say that I think you make some excellent points here!

We manage a considerable number of storage systems ranging from SATA shelves all the way to up to Flash systems.  I think another bit that is important to add to this is that due to that underlying storage system architecture not all flash systems are created equally.  Some flash systems have much better hardware in their controllers and some have much better software for managing those systems all which can significantly impact performance.

To support your point regarding performance management still being very important; we just recently had a client completely overwhelm their flash storage system which they didn't think would be possible.  By running a single (very inefficient) backup process against all of their VM's at the same time all of which lived on the same flash array they overwhelmed the controllers capabilities and latencies shot through the roof.  The entire environment came to a screeching halt.  This is a perfect example of why performance management is still critically important.  You need to know the limitations of your systems and architecture and monitor thresholds so you don't hit those limitations.

MVP
MVP

Good points !

The topic is a red-flag header:  Anytime someone asks whether you still need to worry about performance once you've made change "X", the answer is always "Yes."

Unless change "X" involves retiring the application without a replacement.

"moving the bottleneck..." a game I used to play many years ago with the engineering team. Make the disks go fast? Well then the controllers are too slow. Upgrade the controllers and then the server's motherboard is too slow. Upgrade that and then the NIC/CPU is too slow. Upgrade them and then the network... it doesn't sound like much has changed. At least the technology continues to improve. The amount of data that can move today as compared to 15 years ago... wowza!

Level 12

All of these excellent points aside, does anyone know if there is real-world data out there on the reliability of flash-based shared storage?  One of the biggest problems with spinning disks has always been that whole mechanical failure issue.  Even if flash offered little to no performance gains (and I don't believe that to be true for even a second), if it offered a significantly lower failure rate in the component storage, wouldn't that still be worth something?

Great article here!

SSD reliability has been tested & reported on pretty thoroughly, and I'd tend to treat it equivalent to flash-based storage:

Which SSDs are the most reliable? Massive study sheds some light | ExtremeTech

Google did some intense testing here:  SSD reliability in the real world: Google's experience | ZDNet   I'm impressed with their discoveries:

pastedImage_2.png

Level 13

Nice points, as I just move my storage to a tiered solution. 

Level 13

@Stephen, I couldn't agree more. In fact, I wrote a similar post on my blog not too long ago, prompted by a conversation that arose from a TFD event. Why IOPS? – Virtuallytiedtomydesktop's Blog . There are so many variables that would effect the performance of an AFA. A perfect example would be the overtaxing of a SAS card. Putting 48 SSD's on a 4Channel SAS card is simply bad practice. Like anything else, you need to build your configurations based on what kinds of performance you're hoping to achieve.  Connectivity fabric is also a huge performance enhancement or hindrance. Let's get ourselves away from slow ethernet, or Fibre.

I'd add that many applications aren't able to handle the kinds of IO that super fast disc pushes back. Often, your bottleneck has nothing to do with the storage, but the way that the app handles its IO.

Level 14

We tend to have power and space issues.  Loosing power greatly shortens spinning drive life.  SSD is more resilient when it come to power  It also takes up less space.

Level 9

Hah! You're so right. Plus, there's Betteridge's law of headlines​... So yeah, you see what I did there...

About the Author
Former sysadmin and storage consultant, present cat herder for Tech Field Day and Gestalt IT, future old man shouting “on-premises” at passing business droids