Next on a very special episode of SolarWinds Lab. [light guitar music]
Leon, the school called again. They tell me you've put the SolarWinds database on RAID 5 again. Who's teaching you to do these things, Leon?
I learned it from you, okay. I learned it from watching you. You're not my real DBA.
An episode no IT pro should miss.
Leon, open this door. You can't keep locking and blocking like this forever. It's not healthy.
Think of the data. The data!
A tragic situation that affects many in our community.
We're going to get through this together. I've made you a nice fresh SSD array.
If you or someone you love is showing the warning signs of poor database storage design, watch SolarWinds Lab.
Thank you for believing in my dataset. I promise I won't let you down ever again. [Light guitar music] As you can see, we feel pretty strongly about putting databases on RAID 5.
Well not always, because some databases will do fine on RAID 5 storage.
Okay, seriously guys. I thought we had this completely figured out already.
No, Tom has a point. The relationships between database and storage is not as simple as "more power" answer.
So, how's anyone to really know whether or not their setup is adequate or not?
Well for starters, there are some guidelines that you can follow.
Right, and I'd assume monitoring figures in there somewhere, right?
Well then we better get started.
Hi. I'm Leon Adato.
I'm Thomas LaRock.
I'm Kevin Sparenberg, Product Manager for the online demo.
I'm just crazy excited to be on set of SolarWinds Lab.
He mean to say that he's James Honey, Senior Product Marketing Manager and our resident storage enthusiast, as you can see.
Yes, so in this episode, we're going to talk to you about the right setup for your SolarWinds database.
By extension, any database right?
Along the way, we'll show you how to figure out the health of your database and your storage environments.
Meanwhile, if you have questions, you can ask us in the chat window that you see all the way over there. If you don't see a chat window, it means you aren't watching us live. To do that, head over to lab.solarwinds.com and sign up for reminders, or to catch up on past episodes, or leave us a comment about what you'd like to see us cover in future shows.
Does that mean we'll finally be able to show off some cool DPA and SRM features?
Yeah, we should have time. [Clears throat] NetPath. [Leon clears throat]
We will, honey badger, we will. [Zapping]
Before we dive into the implications of having a database one RAID 5, or on RAID 10, or whatever, I think it's important to get our terms defined. So what is RAID 5 versus RAID 10 versus, you know. What are we talking about there? How does it work?
Okay, well, RAID is basically a way to take a bunch of disks and either put them together for increased performance or increased capacity. Years ago, when this was built for smaller disks, when disks were very expensive, you bought a lot of smaller disks, you can put them together in RAID group and get as much capacity as very, very expensive disks. In fact, I've actually heard people refer to it years ago as redundant array of inexpensive disks, as opposed to independent. So you can get increased importance, and you can create fault tolerance, and you can create larger disk groups. But there's always a tradeoff between them. RAID 0 is striping. What RAID 0 says is, "I've got two disks, and I write a little bit to this one, then I write it to this one, and then I go back and forth, back and forth." So, the information gets split across those channels. Then we've got RAID 1. RAID 1 is just a mirror set. So literally, everything that gets written to one disk immediately gets written to that second disk. So that creates your fault tolerance. Now on RAID 0, no fault tolerance. Lose either of the disks; all of your data's gone. Then RAID 5 was an extension on those with a little tweak to it and that's mirroring with parity. What that really means is, we write it to all the disks, save one, and then on that last one, write parity information that basically goes back and tracks that. Then for the next write, we do the same thing. Keep a parity block, so on and so forth. That's kind of your standard RAIDs, when you talk about 0, 1, and 5. Then we jump into what we were talking about. 1+0, or a lot of times it's referred to as 10, is nested RAID. That's actually a RAID inside of a RAID. So when you talk nested RAID, you're taking a RAID 1, and you're taking a RAID 0, and you're putting RAID 1s inside of a RAID 0. So you get RAID 1+0 or typically just called RAID 10. Now what this gets you is, you've got full tolerance because you've got mirrored sets. But you also have some performance because you're writing the separate disks instead of one disk, filling it up, moving to the next, so on and so forth. You can split the writes out. So this is one of the reasons that was used a lot. In fact, at a previous job, I had an Orion database on a RAID 5. My friend Tony used DPA to look at it and told me, "This is why your performance is horrendous." He gave me a laundry list of things and right at the top was "you're on the wrong types of disks." I said, "How do I fix that?" He got me in touch with the storage guys and we just talk to the storage people. I said, "I need to migrate this data over to whatever he said." That's when I tried to really learn about what this was all about.
I think one of the key things to understand it, too, is with RAID, is the fault tolerance aspect. As you get larger and larger drives specifically in a RAID 5 environment. I mean today, we have 8 terabyte drives. If you're using hard drives, you lose an 8-terabyte drive in a large RAID 5 set, the rebuild time is days. It can be days on end. Not only do you have a performance hit because of the rebuild, but you have a risk to your data, a risk to your environment, because there's no available disk if something happens. If you lose something, you're in trouble.
Right, if you lose a second disk during the rebuild...
You lose the array. So here's the question. So you talk about the redundance array of disks. Is RAID an acceptable backup mechanism, recovery mechanism? [All laughing]
I would say...
The answer is no.
So don't hesitate.
No, no, no.
The answer is no.
It wasn't a trick question.
No, what it is, is it eliminates one more failure domain, which is something that I heard for years, didn't understand what people were talking about. Now all of sudden, after working with servers for so many years, I understand the way they're talking about it. If I take every single point where there could be a failure, and say, "I've got different ways that I can mitigate that failure," then all of a sudden, I mitigated that failure domain. RAID arrays, regardless of the type, are one way to do that.
To this day, I still see people say, "Hey, I'm using RAID for my disks, so I don't have to worry about backups."
No, no, no.
Right, and I'm taking snapshots, and those are good backups also. Right, no.
On that RAID 5 set.
These are bad things.
Okay, so I think we've done a really good job of defining the different kinds of RAID. We've left storage to the side. But we'll get back to that in a minute. We're starting to talk about what each kind of RAID is good for. But I want to spell that out. If you had a problem in front of you and you were saying, "Gosh, I don't know what I'm going to do. Ah, RAID 5, that's the solution. That's going to fix it." What is the problem we just solved for RAID 5? How do we know that if we've implemented the right kind of RAID 5? I'll say specifically for databases. When is RAID 5 the right answer for databases?
Oh, that's easy. When you don't have to write.
RAID 5, the only issue with a RAID 5—I'm sorry, the overhead is the parity bit, right? So if I have a whole bunch of disks. What would you say? 15, I have 15 disks. One of those has to write that extra bit for parity. The other 14 get all the data. So if you think about it, if it's just the write penalty, if I just need to read data from those disks, there is no penalty.
Now I'm reading from a whole bunch of disks. I have a whole bunch of control; spend those however, whatever you're using. But I can read many times and very quickly from all of those disks.
And maximize the amount of space that you have.
And maximize space, yes. So there are certain database workloads that are going to run just fine on RAID 5 storage. For a lot of people, they use RAID 5 because it maximizes storage, it's just the way they configure their environments. 90% of your database workloads are going to be just fine, even in a mixed-use situation, where you're going to be reading and writing quite a bit. Is there a penalty? Absolutely. But is it acceptable performance? That's really what you have to decide. If you have a database that is much more read intensive than write, RAID 5 may be just fine. It'd be perfectly acceptable. Now these days I see, though, whenever somebody talks about, "Hey, let's build a database server. What should my storage be?" It's OBR 10, One Big RAID 10. I see that in a lot of the forums, "OBR 10." It took me a while to figure out what OBR 10 was. I thought it was a new array. But "One Big RAID 10." These days, disk is now cheap and it's really easy to configure a RAID 10, at least for disk. That's what people are using more as the default for a lot of their database workloads. RAID 10 is going to be just fine as well, for read and write. So I don't think, these days, we're in an area of where you're really diving in and configuring the storage like that in order to get the performance. Mostly in the hybrid environment, you don't have a lot of these options. In some environments, you can flip back and forth. Say you're using VSAM. Say you're using storage spaces up in Azure VM. You have options where you say, "Oh you know what? The nature of my workload changed today. Let me change the configuration of the disks on that server, or let me change the policy." Now my VSAM, I can have a policy tied to reads, and writes, and whichever one. Then just flip over and say, "Now I've hit this threshold, let me move the storage." So you have so many more options that these types of decisions that we used to have to do all the time. Here's a server and you get only so many bays of disks in this box. What are you going to do with them? Those days are gone. So you don't have to really worry about it as much. But yes, there are going to be times when you're going to look at it and say, "RAID 5 for this particular workload is going to be just fine. RAID 10 for just about everything else."
Got it, and I think that you just indicated why we, at SolarWinds, have for so long in our documentation, and everything said that the SolarWinds database— and we'll talk about what I mean when I say that in a minute. But the SolarWinds database should be on RAID 10--that we absolutely do not want it to be on RAID 5.
Do we write a lot of data to that?
Occasion— all the time.
Just a little bit.
Yeah, it is constant. It's huge.
A little bit.
SolarWinds monitoring, especially NPM--no, the main modules, are so incredibly write intensive that RAID 5 is really a challenge for a lot of customers that choose to go down that path and then realize later on that they have to make an adjustment.
That was me.
Yeah, guilty also. But now we've explained why. So hopefully you're able to go back to your storage or database, folks, and be able to say that A) SolarWinds is very write intensive. You've used the products, you know how write intensive it is. It's small writes, that's the other thing. It's not just writing large blob files or whatever. It's constant writing on thousands of elements on a minutes’ basis, if not seconds’ basis.
But it's concurrency. It's not just one thing trying to do a small write at a time, because that would probably be fine. It's the concurrency aspect. It's many things trying to write and communicate with the database. The concurrency aspect is really what comes into play with the storage.
Yeah, and don't forget, we're also pulling information out of this database, because that's where all the configuration is for how to speak to these devices. Whether it's with NPM or with SAM or whatever. But we pull that information out of the database. Then we split that off and we say, we've got these 15 jobs to pool these metrics. Get them back, sanitize them, then inject all of them into the database.
You have 30 users who want to see the webpages update that's come from the same database.
Okay, so I think we've identified why we're so eager here at SolarWinds to see our customers on RAID 10. I just want to sort of wrap up a little bit by talking about the storage considerations. We've mentioned a few things. We've mentioned spindles. So there's spindles. There's also flash.
There's also—I think it's important to point out there's NAS. Not that we would suggest it, but I'm saying that when you're talking about storage, you want to talk about NAS, and SAN, and the one that people always overlook, you still have a choice of local. You can still do local and that's something that when I've got small- or medium-sized NPM installation, and I've got physical hardware, go with a physical disk. Go with a straight disk, direct connected to the server. You can get a big enough three terabyte, whatever it is, drive. I don't think that there's any problem with that.
No. No, your performance, your latency's going to build, your IOPS are going to be high, and that's what you're looking for. That's what we need for one of these transaction-processing systems.
So we don't require RAID. We're not saying that if you have a database and you have storage, you don't have to have it on RAID. It's just if you do, you need to understand what kind of things are going.
It provides a level of fault tolerance that a single disk doesn't.
That's the only other thing that's really big about going to single disk versus some type of even small RAID group. Now that you can do that on a local direct, attached storage, why not?
Okay, very good. Last point. When we say the SolarWinds database, throwing that around. We've been throwing that around. But there isn't one SolarWinds database.
No, there's not.
I'll throw out the first one, which is the actual Orion database. The one that NPM, and SAM, they attach to. But we have a few other ones there. You want to throw some out there?
We've got Web Help Desk, which will talk to MySQL or Microsoft SQL Server. Let's see. We've got VMAN, which is on? [Mumbles]
Postgres. Yep, and LEM's got one on the backside. I honestly don't know what that is. There's a slew of them all over. A lot of the ones that are non-Microsoft or non-MySQL are the ones that are kind of running on our environments for appliances.
I'd be remiss if we didn't mention that DPA...
MySQL. So we can have three different types of repositories. MySQL, Microsoft SQL Server, or an Oracle back-end.
Right. So when we say the SolarWinds database, right now when we're talking about performance, we're primarily talking about the NPM, SAM, the main data repository. But there are a lot of databases that you're dealing with. Each of them still has these considerations that we want to be thinking about, that we want to make sure that the storage is matched up to the kind of data that we're talking about. So I wanted to mention that. [Zapping] With the theory out of the way, I think the next place, the best place to go, is to actually show what this looks like with real data. So we've set up a couple of comparative environments, the same data set in each of the environments. Then we're going to see how they're performing. So let's start off with SRM. James?
Yeah, so, Kevin set up a really good demo environment, a good setup for us. So where can kind of see. Then we created a custom report that we'll include in the show notes. But as we see here, we have a custom report, and we have a few LUNs here. Created one on a flash device. Then we have two more LUNs created in a RAID 5 and then another one in a RAID 10. So we can see some compares there. Obviously, with the flash, that's a custom setup because most flash vendors today, they set up their own special set, RAID set on there. But it's flash, so we have a lot of performance to work with there. As we see here, we've got, like I said, a custom report, but it shows IOPS and latency over a period of time that we were putting a heavy load on the system. If you look here, these top two lines, we're showing read and write IOPS. The reason why I want to show read and write IOPS is, like Tom was saying earlier, there's a difference. If I'm writing a disk, it's going to be different than when I'm reading from disk. Same with SSDs for flash. As we can see here, just in this period of time, and it was a pretty heavy load that you put on the system.
Yeah, this was actually, for people that are really worried about how we set it up. I built three Orion servers, and gave it a 12 gig Orion database, and had those three servers run 10 of the applications. So 10 parts of the suite. They were tuned to one-half of the traditional polling. No, I'm sorry, one-third of the traditional polling. So if the traditional polling was 30 minutes, I told it to do it every 10 minutes. So, super aggressive, lots of reads, because we wanted to really show that if you have something that's either a very aggressive environment, where you want to have these statistics as much as possible. Or you have a very, very large environment and you only do so many in parallel, we wanted to kind of mimic that out as much as possible.
It's worth mentioning that we're going to have the design for this in the show notes. So you can check the show notes for schematics and diagrams, and some of the explanation of how we built this.
It's also worth mentioning: that's why he's in charge of the Orion demo.
Yes. [All laughing] Right so when you go to demo.solarwinds.com, that's his baby.
Well, me and my team.
Yes, oh sorry, the team.
I think the key here is what we kind of see is if you look on the IOPs—the flash device, no problems. I mean it handled it very well. If you look at these bottom lines down here, the very, very bottom, we're looking at a RAID 5 going on there. Then right above it, is the RAID 10. So, basically what Kevin did is, he brought a RAID 5 set to its knees. I mean, it's just not going to perform.
I've used that database. That's been my life right there.
That's the one you built. [Leon laughs]
That's the first one I ‘architected’ too, because I didn't know any better. I was like, "I need tons of storage, so we'll do this."
We were young; we needed the work. It's okay.
Then, as we can see here, the RAID 10 performed better. I mean if you compare it to the RAID 5, if this was a RAID 5, RAID 10 graph, it would be a ton better. But then when you put flash in there and I think this ties into the conversations around using flash for your database. It's just the magnitude is so much more right there. This is an easy report out of SRM that literally Kevin and I took five minutes, if that, to create.
We're going to post it to THWACK and we'll have a link to that in the show notes, also.
Absolutely. So you can do this across volumes. You can do it across arrays. It's to really look, again, to the data of what's going on. Next to it, we have latency. Now latency didn't work out so well because one of the devices wasn't showing.
One of the devices doesn't actually report latency at the LUN level. So we could've gone higher instead of at the storage array level. But that's not exactly the same statistic.
Exactly, but if we look here, our flash device, even when we do have a spike, we have a spike of 12.5 milliseconds of latency.
On read and six on write. I think that's livable for most environments.
Well SQL, the best practices with Microsoft as published is under 10 milliseconds, happy, happy. 10 to 20, meh. Then 20 plus, expect performance problems.
Now real world.
Real world. [All laughing]
No. So for years, especially in the virtual environment, I would tell my customers, because a lot of times they would complain if they got to that 15 to 20 milliseconds. They'd go over and tell the storage team, "Hey, something's wrong." Unfortunately, with the idea of virtualization, and shared storage, and all that, the reality is there was more like 30 to 40 milliseconds is what I would see on average with a lot of customers. So that became kind of my baseline. I would say if you do get to 30, you should just walk over and say, "Hey, how you doing?" [All laughing] "Everything going well today? I see we're about 30. Seems to be fairly normal. Does it seem high? Because sometimes they say 20." You should just have a nice, polite conversation. If you see a spike of 70 to 80 milliseconds is roughly the old floppy drive, so that's when you walk over and say, "So you see..."
Five and one quarter?
Yeah, right. But when you see 60, 70, and more, now go have a conversation. You might get the answer, come back, hey, there's shared storage, something else spiked, caused a problem for you, I understand. But don't expect and think you're going to get those 15 to 20. Those are written years ago for different standards. Don't expect you're going to get 15 to 20 seconds of latency if you're hybrid. Don't expect any of that anymore. Things have changed quite a bit. You don't really have control of the storage as much. When you go flash, yes you should expect some pretty low latency numbers. But you can still have bottlenecks, especially in virtualization at the kernel layer. You can have other bottlenecks that still exist. It's always a nice conversation starter, though. But yes, the reality these days what I see in shared environments is roughly about 30 milliseconds. That is all the layers in between you and your data. The storage itself actually returns back the data pretty quick. It's just traversing the network and everything else involved before you actually get it back.
In this case, what we're looking at, we really can see, especially on the graph on the left, that RAID 5 really is having a problem. So I just want to clarify, I want to ask it out loud. Is there anything that we can do short of going to RAID 10 or flash? Is there anything else that any of the viewers can do to make it better? Before they pull the trigger and go to this other thing to fix that condition?
Well, I think one of the first things—this is obvious— is what is all on that RAID 5 set? Is this the only thing on the RAID 5 set? Because a lot of times, for a lot of customers, they get in a lot of trouble when they have a good RAID 5 set, it's performing how they want to perform, plenty of space. Somebody goes, "Hey, let's put this on there." "Hey, we need to stand something else up." That's when you run into a noisy neighbor situation. We see that a lot with SRM customers, that look at it and go, "Ah, I've got noisy neighbors." They've kicked off this report or somebody's doing something crazy, and everything else is crashing. So to me, the first thing is, what is all sitting on that RAID 5 set?
If you want to talk architecture, if it is a RAID 5 and let's say your organization is kind of bound to that, you have to stick with it. You can put more spindles behind it if we're still talking spinning disks. If we're talking hybrid with spinning and flash, then maybe give a portion of your flash array to that group. Certain vendors support it, other ones don't. But you can kind of tweak it up a little bit. But there will be a threshold that you hit where the write penalty just is all you have left.
Can't get around it.
Got it. [Zapping] So we've talked about write penalties and just the penalties involved a couple of times. I know you set up some numbers to explain like what do those penalties really look like? Again, because we want the folks who are watching to be able to go back to the storage teams and the database teams, and explain the impacts that we're seeing because the SolarWinds database is so write intensive, and other applications are too. So, talk us through some of these numbers that you've put together.
Okay so the write penalty is basically, it's the logic of how many actual operations take place for one single input-output operation, or one IO. When you write to a standard disk, you just write to that disk. They call that a penalty of one. When you write to a RAID 5, you've got to read the existing block, and any non-parity disks. Then you have to XOR that data together, technical thing. Then you have to write that, both the block and the parity, back. So you're talking one, two, three. There's four commands for one write. Okay, but there's some benefits to RAID 5. When you go to RAID 10, because it's the mirroring, you literally write once, and do the copy. So that's two total. The striping is taken care of on its own behind the scenes. So you have a cost of two for RAID 10. You have a cost of four for RAID 5. And you have a cost of one. Now what this means is if I have 10,000 IOPs heading out to a RAID 5, those 10,000 turn into 40,000 actual transactions that have to happen across the disks. If I do the same thing against a RAID 10, it goes from 10 to 20. So you literally, RAID 10 is twice as efficient as RAID 5 is on performance levels. That's just everything under the covers. That number is used a lot. Like, if you have to calculate the absolute max throughput, use these as multipliers to make sure which ones can come through at what speeds.
Very good. [Zapping] So with you standing there at the controls, we know that the next thing that we're going to talk about is obviously network. No, database.
Obviously the greatest thing ever, data.
Oh, not bacon?
Okay, the second greatest thing ever: databases. So anyway, we're going to take a look at the same dataset--but from the perspective of the database, and what a jewel like DPA has to show us.
Exactly, because as a database administrator, you want to know all the possible bottlenecks. Of course, there's really only a handful of buckets. Storage, disk, network, CPU, you can even say locking and blocking. So when it comes to the storage aspect of things, you had asked earlier, how do you know if the storage subsystem is keeping up with the workload, is really the question you have to ask. How do I know if storage is a bottleneck? Well, what we did was, we took Kevin's wonderful demo that he had set up. What's the most intensive part of that?
For the database side?
Yeah, for the database side.
That's running the Configuration Wizard.
So that's what we did. What we did was we took the RAID 5, the RAID 10, the flash servers, and we ran the configuration setup. So we have some numbers to walk through. What I'm going to do is drill into— I'll start with the RAID 5. We'll drill into—ooh, look at that. Big bar, bad. Actually, the configuration, we're about 10 o'clock on August 15th. So we'll drill into there and now that I've drilled into this timeframe, I'm going to bounce over to this aspect of DPA that we put in about a year ago called Storage I/O. A year or two ago now. So now, I get some metrics at the file level for this entire server. I could filter by files; I can look at read, write, and current. What we're going to do is, I'm going to come over and just show what's I/O wait by file. We'll come down here, so this is for the RAID 5. You can see where we're at and there's a spike here of about 52 seconds on that data file.
Which is an eternity in database time.
So this is the total wait time across all executions. I want to give you an idea of what the workload is. This was not waiting actually 52,000 milliseconds. It's across all executions for all the activity that the configuration has done. What you're in DPA, things are shown to you at an aggregate, almost always. So this is an aggregate amount of information for that data file. If you look, the timeframe was at 10:18. This was the one-minute level of detail. So out of 60 seconds in the minute, 52 seconds of this was waiting for I/O.
But hey, it's just busy. It's doing work. You paid for the disk to do the work. Do your work. All right, so it's working hard. Now, I'm going to show you the same time, the same graph, but let's look at what it is for RAID 10. How are those write penalties looking?
So not only is it doing a lot less work, a lot less activity is happening for this RAID 10 workload in that particular hour. This is fabulous. That's a huge difference alone. You saw— in case you forgot, let me show you. RAID 5. RAID 10. RAID 5. RAID 10. RAID 5. Right? Big difference between those two. You want to look at the flash?
Absolutely, we want to look at the flash.
Let's look at the flash. Remember now, this spike was eight seconds. So here's the flash. Less lines. It has a spike over here for the configuration of six seconds out of the minute. Remember, that's across all executions. The top chart, also, a lot less activity.
It's pretty much empty because it finished all the work.
It's pretty much empty for the hour. How long did the configuration take on the flash? Four minutes and 22 seconds.
Four minutes and 22 seconds.
That's why I'm here, to correct you. Four minutes and 22 seconds.
That and the other ones took a half hour?
33 minutes and 48 minutes respectively.
So I just want to, because as SolarWinds folks, you understand that the install, or upgrade, or whatever, you don't do it every day. But that is an experience that is sort of visceral for you--you know what it means. So the difference between a 43 minute running the Configuration Wizard, which, depending on the systems— like yeah, on a big system, 43 minutes versus a four minute and 32 second. Now you're not going to make a business justification to your company by saying, "Oh, I need flash storage because my Configuration Wizard is going to go much faster." That's not a business justification, but it is a way of exemplifying the speed that all of the database operations take. Whether that's your minute-by-minute read/writes on your network and server, and application data or anything else that's going on. So I just want to point out, we're not trying to tell you that the install speed is the selling point. It's just an example point of how all the rest of your data is going to be reacting also.
Go buy flash. [All laughing]
Go buy flash.
Did we mention the array we were using?
No, this is a Pure Storage M20 array that was given to us on a loan by the nice people at Pure Storage, which SRM now fully supports.
Thank you, Pure.
Yep, you can build it and it's really super easy to build. I just went in, I said, "Build me one of these. Do this, do that to the other. Attach it to this machine." Done. The beauty also of that particular array, and I think a lot of the flash arrays, is that it also has native de-dupe. So although this was a 12-gig database, if I do the compressions right in my head, it's actually being stored in around three gigs of actual space. So that SAN array that we got from them that was 10 tera, we can probably hold 60, 70, 80 without a problem.
I think this highlights something, Leon, you touched on it, is you're not going to justify buying flash just out of this. But it all adds up. It's all cumulative over time. "Hey, this takes an hour. Hey, that takes an hour."
Well, I mean, if we go back to the main screen for SolarWinds and you look at just the instances with the highest wait time, you can see the difference between— Look at five days of wait, total the wait. How do you get five days of wait in one day?
That's a long day.
That's a very long day.
Look, but for the RAID 10, it's only three. Then for the storage, zero. I mean, that graph alone can just tell you that the total amount of wait and work that you're doing for these workloads we've set up on these demo machines, it's just not in the right ratio that you would expect. Three, two, and zero. It's very telling. So, to me? Yes, this is a business justification. When you walk in and say, "All workload." I'm not just talking configuration.
Right, that's important.
Configuration was one specific example. I can sit here, and look at this graph, and I can show you that it's your entire workload. I have another report that'll do this. So I went into DPA and I created the report. I created three reports and then I grouped them together. So I have a report that runs against all of this. Now look, this is for the entire day from August 15th to August 16th. This tells me what the waits were. So you can see, WRITELOG, write completion. We're really for RAID 5. If you don't know what WRITELOG is, in DPA, you click. You get a description, you get an idea of what it is--who's likely to resolve it. But look, RAID 5, you can see in terms of hours here. Three, all the way up to a spike of about 10 hours for that RAID 5. You can see here it's only about six hours’ worth of a spike in for the RAID 10. Then you can see for flash, we're measuring in minutes. It's just about two hours’ worth of wait. WRITELOG was the top wait for all of these for that particular day, was just waiting on WRITELOG, waiting for the buffer to write and acknowledge back. So that alone tells me right away, flash is the guy. That's what you want to buy for my entire— no, but this is for the entire workload. This isn't just one particular thing, the configuration. This is all the activity on that server. I'm telling you flash, and I can tell you if we wanted RAID 10, would get us a little bit, but flash is going to get us the most.
Now, I'm going to actually turn it around for a second. Kevin, you've got the greatest depth with this, which is I'm on RAID 5. What am I going to be experiencing from SolarWinds NPM standpoint, on RAID 5, what's that going to look like? Is it going to look like my system is crashing? Probably not. What is it going to be? Because I think a lot of our viewers, it's going okay. Okay because they don't have anything to compare it to. What's the experience of SolarWinds in each of these environments?
So in RAID 5, there is no read penalty. So normally working with the web browser and going through the web console, you don't see a lot of a problem. If you have a problem there, there's normally something different underlying. It's not the disk subsystem. Where you have a problem is you don't hit your completions. So if you're polling completion drops below 90%, and you don't have like an entire office down, that means that you're making the request, the information's getting back, and you're trying to write back in the database but you're not. Because you're not, there's a timeout that it just throws that away, and it says, "I'll just wait for the next polling cycle." That is a huge...
So you're going to have empty graphs, or breaks in your graphs. You're going to see a line, and then a break, and then a line, and then a break, and whatever for certain periods. It's going to be different. You're trying to line it up and say, "Well what happened during that window?" But it's going to be different on every element because your completion, it may be somewhere else in the completion cycle. So that's part of it.
Yeah. That's really the big, big one. Because when you really get down to it, the SolarWinds Orion database, and I use that name because that's what it's called. If you can just do next, next, finish, install. That database is just a repository of statistics. I mean, there's a lot of other good information there, but that repository. When you end up with those gaps, especially when someone says, "I need an exec report for how much all my WAN interfaces we're using, because I need to purchase next year." Well if you don't know you're missing these polling, you're going to be in real trouble if you're not able to get that.
Right, and I'll say that also cascades to missed alerts because you weren't collecting that data. The data didn't hit the database, therefore it wasn't queried, therefore it wasn't, you know, and so on, and so forth. So that's the experience. It's not going to be that the system crashes. I know that I, when I was trying to make the case in one of my former companies, they said, "Well, if it runs okay, then we'll leave it there." It did run okay. It ran okay for eight months on almost 10,000 devices. It ran okay, but there were these weird quirks, these gaps in data. I had to do lots of restarts on services and things like that. Ultimately, we were able to use this data, both SRM and also DPA, to say, "Look at how hard it's working. This is why we're seeing these weird instances, these weird situations."
Especially if you're dealing with 10,000 endpoints or managed endpoints, or managed entities, excuse me. If you're polling all of them, and you've got it set to go every 30 minutes, or every nine minutes for detail statistics. Well, you've got to remember, even just for one interface, detail statistics, there's like four metrics for that. I mean you've got in, out, that's two there. Then you've got packet drops, you've got bandwidth, you've got those kinds. Those are a lot of metrics you pull back even for one node. And to realize that, you take that, and you know. Maybe your database isn't a problem on RAID 5 right now. Maybe you're not having any problems. But then you add the second office. You add a third office. You add a fourth office. Then all of a sudden you're getting these gaps in the data and you don't know where it's coming from. When you troll through the logs, you'll see SQL timeout, SQL timeout, SQL timeout, SQL timeout. Because it's trying to write, and the database can't write that back to the disk, and get it back fast enough.
Right, so just to wrap up with this. I think that we've done a really good job of covering what you're going to see, what you're going to experience for the harried SolarWinds administrator, the person who has a sense that something's not right. This is one of the reasons why we offer the 30-day demo. It's an unlimited demo, meaning you can throw that element, that module, at as many devices as you want to. In this case, you probably won't have to, because you're talking about just the SolarWinds database. But you can download the demo of a DPA, and the demo of SRM, pointed at those devices, either on a separate VM that you set up, or however you're going to do it. You can start to bring these metrics back to your teams. When we started off, I said that sometimes the application team, and the storage team, and the database team aren't so friendly. Because we're all busy. We all have other jobs to do, and other people that we owe answers to. But here you can do a little bit of self-serve by pulling down those modules, pointing it at the SolarWinds database, and being able to go back to those teams, and saying, "Look, here's the information. I'm not telling you what to do. I'm telling you here's the data. Now let's talk about how to fix it."
Sometimes that is completely out of your purview. If you've got really ‘silo-ed’ environments, you can say, "This is my application. It's running poorly. Here are the stats I'm seeing." You give that to the storage team. They say, "We'll take that. We'll ingest it. This is what you need. We will change it." That's actually the easiest way, most times, is give the people exactly what they need in exactly the frame they need it. Give them the power to do their job better.
I'm just going to stare at the flash numbers. [Zapping]
I always thought that database stability had more to do with the CPU and RAM you threw at it than it had to do with storage. But I never realized the impact that could have.
Most people don't. At least until they're neck deep in a poorly performing database.
Yeah but it's nice to know that it's possible to find these problems given the right set of tools.
I don't know what you all are talking about because this is just another day in the average life of your bacon-eating DBA.
Okay but I've been at companies where the interaction, the closeness between the application, and the storage, and the database teams isn't really so friendly. So in those situations, knowing how to make your case can be valuable for the application owner.
I also think this is good information for the storage engineers, because clearly not all databases are equal in terms of read/write volume.
Yeah, no they're not, my friend. But I think we've covered enough for this one lab. So for SolarWinds Lab, I'm Kevin Sparenberg.
I'm Thomas LaRock.
I'm Leon Adato.
I'm still just totally stoked to be on this set.
Yeah. [All laughing]
[Upbeat happy music]