cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

It is Always the Storage. Always.

Level 11

I’ve always loved the story about the way Henry Ford built his automotive imperium. During the Industrial Revolution, it became increasingly important to automate the construction of products to gain a competitive advantage in the marketplace. Ford understood that building cars faster and more efficiently would be hugely advantageous. Developing an assembly line as well as a selling method (you could buy a Model-T in every color, as long as it was black.) If you want to know more about how Ford changed the automotive industry (and much more), there is plenty of information on the interwebs.

In the next couple of posts, I will dive a little deeper into the reasons why keeping your databases healthy in the digital revolution is so darn important. So please, let’s dive into the first puzzle of this important part of the database infrastructure we call storage.

As I already said, I really love the story of Ford and the way he changed the world forever. We, however, live in a revolutionary time that is changing the world even faster. It seems -- and seems is the right word if you ask me -- to focus on software instead of hardware. Given that the Digital Revolution is still relatively young, we must be like Henry and think like pioneers in this new space.

In the database realm, it seems to be very hard to know what the performance, or lack thereof,  s and where we should look to solve the problems at hand. In a lot of cases, it is almost automatic to blame it all on the storage, as the title implies. But knowledge is power as my friend SpongeBob has known for so long.

Screen Shot 2017-10-12 at 18.53.44.png

Storage is an important part of the database world, and with constantly changing and evolving hardware technology, we can squeeze more and more performance out of our databases. That being said, there is always a bottleneck. Of course, it could be that storage is the bottleneck we’re looking for when our databases aren’t performing the way they should. But in the end, we need to know what the bottleneck is and how we can fix it. More important is the ability to analyze and monitor the environment in a way that we can predict and modify database performance so that it can be adjusted as needed before problems occur.

Henry Ford was looking for ways to fine-tune the way a car was built, and ultimately developed an assembly line for that purpose. His invention cut the amount of time it took to build a car from 12 hours to surprising two-and-a-half hours. In a database world, speed is important, but blaming storage and focusing on solving only part of the database puzzle is sshort-sighted. Knowing your infrastructure and being able to tweak and solve problems before they start messing with your performance is where it all starts. Do you think otherwise? Please let me know if I forgot something, or got it all wrong. Would love to start the discussion and see you on the next post.

24 Comments
MVP
MVP

Nice post

Level 20

From my experience storage having problems can be a HUGE headache.  Once storage has problem the datastores do, then the VM's start having problems.  It can quickly spiral out of control!

Level 17

Maybe I was just lucky, but in my years of being a production DBA I can count on one hand the number of times storage was the bottleneck. If I had to rank the resource bottlenecks in order from most likely to cause issues to least, I would go with network, memory, CPU, and then storage. And I'd put locking/blocking ahead of storage as well. In fact, I've got a list of the five most common problems with SQL Server, and storage isn't there: The Top 5 Most Common Problems With SQL Server - Thomas LaRock

I'm not saying storage can be ignored, just that it isn't much of a factor in a virtualized world combined with SSD arrays. If you are spending time managing disk in order to squeeze out performance then you are doing infrastructure wrong (IMO).

Level 20

Yes but you've gotta admit that when there is a storage problem... the effects can be really really bad.

Level 17

Agreed, no question.

MVP
MVP

nothing quite like a misconfigured san switch or a san switch with buggy software to make life "interesting".

Why not simplify troubleshooting from the DBA / NPM point of view and create a Bottleneck Detector and escalate its alert to the top of NPM/DBA?

pastedImage_0.png

Level 21

As odd as it may sound the biggest problem I have had with databases is backups.  Backups that have caused storage slowdowns, backups that have caused failovers, backups that have caused VM's to quiesce (which databases really don't like) etc.

MVP
MVP

yes.....this ^^^^

After upgrading to NPM 12.2, I've had issues with Solarwinds' database when the DBA's do their re-index or defrags.  I'll come into work in the morning and NPM is hung, with this message:

pastedImage_0.png

So there's a classic case of a database impacting customers.

In this case I resolved it by rebooting the NPM main instance; perhaps I could have recovered more quickly by stopping & starting one or more Orion service.

The point, though, is that a normal database action caused a service outage.  It could have been for other apps besides Orion; I'm "lucky" in that only my team and our Solarwinds customers were affected.

Level 17

But that's an issue with the switch, not the storage itself. Hence my comments from earlier. Often times the issue isn't with the disk itself, but with the act of *trying* to get to the disk.

Level 17

Probably because you don't want to see "bad code" at the top of the list each morning.

Level 17

A single backup causing this issue? Or do you mean all servers doing database dumps at once? We had issues when all the servers would kick off at the same time, and we would stagger our backups to help offset that issue. But if a single backup is causing an issue then you need to talk with someone about how best to store the backups so that you don't interfere with any other processing. We would have dedicated backup servers/datastores, segregated by themselves to minimize performance issues.

It's not exactly that--I'd actually LIKE to see a "bad code" message at the top when I come in the morning, if it meant I could get the root problem corrected.

Treating a symptom by rebooting NPM, or rebooting any other server, is never the right answer--it's just a temporary work-around until the root cause can be identified and corrected.

Level 17

I don't know all the specifics of the upgrade or the maintenance process in play for you there, but it seems that you've lost your connection. This is a common issue for database servers and applications, and it depends on the nature of the work and how the apps try to connect. Your DBA team should be able to explain what is happening and how to avoid this issue. It could be a simple fix on their end.

Level 17

Yes, *you* would like that, but not the dev team. At least not any dev I ever worked with.

Level 21

This isn't a single backup, it's backups for everything.  We have staggered them as much as we can and still keep them all running within the backup window.  We also have a bunch of different backup types taking place, App Aware backups using agents, agent-less VM snapshot backups, etc. all depending on the systems/environments and what our clients requirements are.  We are currently in the process of evaluating the entire thing and looking at moving to less impactful backups that take place completely on the storage side.  Most of our production storage is all Pure flash storage so we are really doing something wrong if we are causing problems for it. 

Level 17

LIkely not an issue with the Pure array, but with the data getting to the array. I've seen issues when the VM kernal gets flooded, and SQL Server *thinks* it is disk related, but it's not. But unless you are running the ESX Top command you might never know the kernal was the issue. But it sounds like you've got a lot more going on there than just a few database dumps.

My DBA saw the issue and has been analyzing logs, troubleshooting it.  It turns out the system hosting my Orion database is also hosting several other databases, one of which was out of control.  So FAR out of control that it impacted the other databases on that system.

Yes, my system was disconnected from the Orion database because of that--nicely diagnosed!

Now, if only I could get the DBA's to buy into SAM and DPA . . .

Level 16

Back in the day it was always a network problem, until you proved it wasn't.... Now it's Storage, SQL, or the VM Cluster and the network is rarely the culprit

Level 12

The Network, in the court of IT, is guilty until proven innocent.

Many Network Engineers were harmed in the making of this statement.

I agreed with that statement several years ago.  Well, actually, back in 2004. 

But no longer is that sentence applicable in my business.

NPM has been a Godsend, showing me where Networks can improve things immensely, and as a result of those corrections I've changed the reputation of the network such that the Help Desk, the SA's, the DBA's, the Apps folks, and the Desktop Support people no longer send tickets to my Network team automatically.

Using NPM correctly, and making changes to the network accordingly, has resulted in Solarwinds being a mirror with a MTTI rating of <1 minute.  Frankly, it's sub-second, since people have stopped blaming the network, and started looking further before dropping it in my team's Inbox for triage & diagnosis.

Level 17

Having the right tools makes all the difference in the world. You get to cut through the blamestorming and focus on where the problems are located.

This is where viewing the whole stack is essential.