Performance Tuning for the Accidental Cloud DBA
As Hybrid IT enters the mainstream, there will be a rise in the number of "accidental cloud administrators." Data professionals will need to develop a base understanding of networks in order to transition their skills as the cloud DBA of the future. Traditional performance-tuning methods never consider the network as a tuning opportunity, but in the cloud, your network bandwidth will be a priority. Head Geeks Thomas LaRock and Leon Adato will discuss necessary skills for cloud DBAs during the next three to five years.
[Funky upbeat music]
Hi, I'm Thomas LaRock, and joining me for this THWACKcamp session is my fellow Head Geek, Leon Adato.
Now, we've heard the term "accidental DBA" a lot. In fact, it was part of the basis for our 2014 THWACKcamp session that we called The Accidental Admin. And I've been an accidental DBA once or twice in my career, but I'm curious, how is this session different than what we did back then? Are we going to be helping DBAs understand that networks aren't to be feared?
Well, that would be a good thing, yes. That is the goal for today, and we're talking about performance tuning for the accidental cloud DBA. And what we're going to help people understand is— well, first of all, as you mentioned, you are the DBA accidental once or twice.
That's usually the first thing you have to understand is, am I the DBA? We're going to help you figure that out. You might be the DBA, and then there's the realization, "Oh, I am the DBA, now what am I going to do?" And what am I going to do about this data that we have in the cloud? [Leon sighs]
Right. Now, on our last IT trend survey, we discovered that database was the fastest-growing segment of people's cloud adoption or migrations— you know, AWS, RDS, Azure SQL are all a thing. They're really happening, but I'm pretty sure that you're going to love—well, you always love your data, but you are going to love your data, no matter where it may be.
True that, so let's get started.
Are you going to use a deck or anything?
I am going to use a deck, and it's not going to hurt much, maybe just a little, but some of these slides are possibly made of bacon.
Of course they are.
Okay, so first, let's talk a little bit— we talked about why we're going to be here today, but let's make sure that we know why they're here today, right? So it's possible that maybe you've heard about this cloud thing. I mean, it is in the name of the title, right?
Rumors, yeah, cloud, and maybe you might be worried that maybe you are the DBA, and if things or you, might be on fire at some point. Or maybe you just know that users are kind of scary when they're mad, that all of these things— if any of these things apply to you, then you're in the right place today. So, we mentioned here about hybrid IT being a real thing, right? So, this is from our trends report, and we can see here some of the data behind people that have been migrating to the cloud over the past 12 months. So hybrid IT, it's real, it's here, even if you don't know it yet. If it hasn't hit you yet, it will at some point. So some of the key areas that we have from the trends report. 74% of businesses already migrate applications to the cloud, followed by storage at 50%, database by 35. So, it's growing, and databases were some of the last pieces of technology to be moved to the cloud, and they are rapidly becoming, I want to say "cloud-ified." They're the things that you see the most growth area, because they were the things not being done yet.
Right. And it's worth mentioning that, first of all, we're going to have a link to the trends report in the show notes?
But also, this was a pretty large survey. This was over 800 businesses were represented from across all sectors: private, public, and all of that. And these are the numbers that have been growing over the last five years that we've been doing this. So again, databases people were saving until last, because they want to make sure that they had their cloud feet under them, if that makes any sense, and—but it's time, it's time to start getting these major application components out there.
And the other point we have here is why are people migrating? And what they have cited is the potential for ROI, right? So, the potential for return on investment, or even the potential to lower their total cost of ownership, their TCO. These are the reasons that people— so, it's financial, more than anything else, is the biggest driver for cloud adoption. We're seeing it. It's here. So if it hasn't hit you yet, it will. But now, let's shift a little bit.
Let's talk a little bit about signs why you might be a—or, how you might be an accidental DBA. I have a picture of a tweet here I put out a while back. Basically, I tried to summarize it, because in the glory of 140 characters. But there's really two ways you end up becoming a DBA, it's almost like a gateway. One is you reset the password. That means, first of all, you have the ability to do it. And if you did it correctly, you know how to do it, and the other is you recover a database. Also, you have the ability to do that, you have the provisions to do it, and you have the know-how to put somebody back where they need to be. So you've given them access and given them— put the data where it needs to be, guess what? You are now the DBA.
It's slightly more rigorous than passing within five feet of a database, but not by much.
Not by much, so this is one of the major reasons, and I got some good feedback on this. What ended up happening was people said, "Well, there are the other things, too, that ended up," and I'm like, "Well, what are some of the other things?" And maybe run the scripts again to database, or you're the only one doing backups or restores. Or how about this? Maybe you report to the DBA manager.
That would be a clue.
If you're reporting to the DBA manager, you might be a DBA. You can create logins, reset passwords, we mentioned, how about if you think about sets a lot, no?
Sets, sets of data, I have this table, I have this, I need to join them, how is this going to happen? Maybe you're the expert in query tuning. That actually seems to be kind of a gateway for a lot of people, especially if they're doing SQL development work, they're writing store procedures. Hey, that they run this fast, how do I make this run a little bit faster? You start learning some of the nuances of tuning, making an instance run better, guess what? You might be the DBA. You might not be doing the backups and restores yet, but you're not far off from it.
It also puts a lot of our folks who are developers, you know, think about it, you might actually be the DBA, even though you think you're just, you know, the programmer.
Yeah, maybe if you're on call? [Laughs]
DBA for a day.
Yes, I've had that happen, where somebody said, "I need to take a day off, and if anyone needs a database restored or a password reset, I'll just have them call you, okay?" So I'm on call now? Oh, maybe I'm the DBA after all. You get to decide what to name tables and columns, that's a popular one.
You're the decider.
You're the decider; you're the one that gets to help design a database. And if the people are coming to you for that, they probably assume you have a level of experience in such things, such that you are a database administrator. How about if you can do schema changes, so you have permissions to do things?
Right, again, back to that, if you have access to do it, then you probably are...
Yeah, or maybe you just get blamed all the time, which is a familiar— actually, when you look at some of these things, there's a lot of crossover between being a database administrator, a system administrator, a network administrator, especially the blame part.
You're getting blamed all the time for things that may or may not have been your responsibility, but you are the one that actually has to take action on something. So these were some of the responses I got back as to all the different ways that you may find yourself operating in the role of an accidental DBA. So, that is a traditional earthed DBA, we'll say. What about cloud DBA? So who's the most likely person to become a cloud DBA? It could be all of those things we talked about. It could be any one of those things: they're running scripts against a database, it happens to be in the cloud. Chances are cloud DBAs, the most likely candidate for cloud DBA, would be an existing DBA. We already hear somebody. Usually how it happens is they say, "I have a database somewhere, it happens to be over there, and you're the DBA, you should go take care of it," right?
So a cloud DBA, there's no difference as to how— well, there's no real difference in how it ends up in your lap, because it's just a collection of data, a repository of data that sits somewhere. And somebody thinks you're the one responsible for it, whether it's earth or cloud, doesn't really matter.
No, actual DBAs, people who actually have it on their business card and they know that's part of their title. And they're not surprised about that, are probably used to this, because a lot of times, you become the accidental Oracle DBA. Well, you know, you're the Microsoft SQL DBA, but it's just data, right?
So you're in charge of the Oracle database now, or you're in charge of the MySQL, or the Postgre. You know, it's just data. How different could it possibly be? So the cloud is just another permutation of that, but I think that they key thing that I want everyone to take away is that if you are already an accidental DBA— it may not have glommed onto that concept yet— that cloud is coming, like we said. And you might find yourself also the accidental cloud DBA along with, so this could be a double whammy for some of you.
Right, because it doesn't really matter if it's Microsoft SQL Server or Oracle or the artist formerly known as Sybase--none of that matters to them as to where it sits. If it's in your data center or if it's in Azure or AWS, that doesn't matter to them. It's just a database, you're the DBA, you should be taking care of that. So the most likely candidate would be existing DBAs, and as usual, they're likely to feel overwhelmed at first, but I always try to get people to understand, as long as you focus on the core mission, then it doesn't really matter what the platform is or where it is. Earth, cloud, Sybase, Oracle, Microsoft SQL Server, you just have to focus on that core mission. And the core mission is, all right, you get paid for performance, but you keep your job with recovery. If you can't recover data to the point in time that the business needs it, you cannot keep your job as the DBA, and so that's true, whether it's earth or cloud. It doesn't matter where it sits. If you can't recover, then update resume, leave town. That's the way it works. So, I'd like to tell people how we really—sorry?
Well, I was just thinking that that's not an excuse to back out of the DBA. Like, if you are an accidental DBA, don't think you're going to get out of it by saying, "Oh, I can't recover, so, you know, don't give me—" Don't try to fail your way out of the job, no one's going to be happy with that. You'll still end up updating your resume and having to leave town.
Okay, so just didn't want to give people an out that way.
No, [laughs] but the other thing about when you become an accidental DBA or a cloud DBA, I tell people, when it comes to troubleshooting is, to think in buckets. Is it this or that? So, back in the day when we were going into a virtualized environment, right? It became—well, is the issue with the database? Is it maybe the issue in the virtual layers somewhere? And how will that get the information so I would know what that is? So what's the first thing I would look at? Is it virtual or not? Oh, if it's not a virtual problem, then I'll go look somewhere else, and then you can dive into specific resource bottlenecks. Is it a memory issue? Is it CPU? Is it disk? Is it network? It's never the network, right?
It's never the network, no.
Never the network, it's always the network. We're going to find out what it is. So thinking in buckets is a great way for you to find a way to get to that root cause. Is it this or that? And as long as you can have a clear picture of what it is, one way or the other, you can just keep thinking in terms of the biggest buckets and eventually you narrow it down. Like a decision tree, right?
Right, or in problem solving, for those folks who've studied, you know, problem solving in a more rigorous environment, you know problem domains. You can work from the back of the box forward, you can work from the network area to the—I mean, there's different ways to divvy it up, but what you're saying is, don't try to think about the whole problems at the same time. Figure out these domains in your own head, and then start checking each of those domains.
Right. So, when you go cloud, though, I still would tell people to think in buckets, but what I try to remind them is that there's a gap. There's a bit of a gap in their knowledge and— because traditionally, for earth DBAs, there's a bit of a gap, and the gap is, quite simply, it's the network. So as a traditional earth DBA, first, my background is Microsoft SQL Server. There's no information inherently inside of SQL Server regarding network. There might be a few little things, but not when it comes to things like CPU and memory. When it comes to network, SQL Server doesn't really have a lot of information about; because it just doesn't really care about network...
It's a sad state of affairs.
SQL Server doesn't care about disk, either. It assumes that there's a cold cache, and you're going to have to pull every page off of disk for whenever it tries to build an execution plan. It doesn't care about network, it doesn't say, "Well, this query would run faster because I'm on a one gig network or, versus 10." It knows nothing about it, and DBAs, traditionally, don't know a lot about the networks, either. We just know if things are slow, and sometimes we look at it, and we just assume that the network and the pipe is enough for our traffic. And most of the time it is, it's like plumbing, you just don't ever think...
And that it's consistent, and it never changes, and there's no other impacts to it, and it's all, like, right.
So over time, what I learned as an earth DBA was, keep calm and blame the network. Now, I was fortunate, I worked with a great network team, right, but I worked with a great network team, but they were good about it. If I said— I actually found what was a slight network issue once, and I felt bad about it, and they said, "Don't worry, you should just tell people that the issue's with the network." I couldn't believe that. I asked, "Aren't you worried about me blaming you?" And they go, "No, no, because that lets us remind management that they didn't buy the right equipment the first time around. And that let's management go to the business and say, 'Remember, we asked for a couple million dollars for all that network stuff? That's why your queries are slow.'" And then the business sat on it for a while, figuring out if it was worth to upgrade or not, and it was the IT circle of life. [Leon laughs]
If nothing else, it bought me time, "It could be a network issue, let me go look at it."
Right, well, okay, and I want to– we love joking about, you know, blame the network, blame the DBAs, but the reality is, don't baselessly blame the network. In that example that you just gave, you had information. You said, "Look here, this is how I believe or know that the network is an issue." There's a very big difference between throwing your hands up and saying, "Ahh, it must be the network," and saying, "Well, we have this data; we have this information that is leading me to believe that it's the network." That is where you make a friend out of the network team, because you're helping them narrow it down. This is a story that we've told many times.
You need to have friends on the network team.
Always, always. They are your friends. So the gap, of course, being network knowledge, this becomes even more important when you go cloud. And now I tell people, "Keep calm and monitor the network."
There we go.
Right, cloud demands that you become proficient in troubleshooting basic network issues. [Leon gasps] It does.
Say that again. [Both laugh] I love that. So you, database administrators...
Or cloud DBAs.
Yes, a cloud DBA needs to add network troubleshooting, basic troubleshooting into your toolbox.
I'm just soaking it, just like, breathe it in, breathe it in.
I'm not doing it again. [Leon laughs] So, the thing is, and there's going to be a reason why. And well, let's talk about it. There's—becomes so important, and I've already mentioned it, is because your primary focus is recovery.
If you can't recover, you can't keep your job. So if you already know this to be true and you've gone cloud, now you have to realize, just as we did with virtualization, then there became extra layers of abstraction between the end-user and data. Now you've gone cloud, now there are further layers of abstraction between the end-user and the data. And the big part of that is a network infrastructure. It turns out that the internet is a complicated place. And that network infrastructure, you really don't control, which means, you need to be able to have the insight into what's happening. Otherwise, you're never going to know where the problems really are. More importantly, when it comes to recovery— so remember, recovery, we've talked about before; it's measured in acronyms, right? I have service—this is your favorite, by the way— Service Level Agreement, RPO and RTO, so Recovery Point Objective, Recovery Time Objective, this is your favorite one--
There you go.
MTTI: Mean Time to Innocence, right? So...
It's not me!
You put all these together; these are all the important acronyms when it comes to recovering data. What's my service level? What did we agree upon to the business I could do? What was the amount of data loss we could do? So, an RPO says, "15 minutes, I can restore your data to how exactly it was 15 minutes ago." Well, that sounds great, but how long will that take you? That's recovery time objective. "I can get that done in five minutes." If it takes you longer than 15 minutes to go 15 minutes back, you have a different problem.
But you need to know that, and that could be the case, actually, if you go cloud. You could say, "Whoa, I have an RPO of 15 minutes, I have the backups, but right now the network is such that I'll never move that volume of data within that 15-minute window."
And now becomes MTTI, "Wait, was that your fault?" "No, it wasn't my fault; it was the ISP that we're using. It was a network issue."
Well, and there's an important point, and we've made this on SolarWinds Lab before, and when we're giving talks at SWUGs and places, is that there's three things that you need to be successful as an IT pro. It doesn't matter, DBA, network engineer, whatever. The three things that you need are accountability, you're accountable for it, you get that when you walk in the door. Again, you pass within five feet of an object, and you are now accountable for it. Responsibility also comes along for the ride. You are now responsible for that thing. On the way to the bathroom you passed a server, it's now yours, okay. But then the third thing to be successful, and this is the one we have to fight for, is authority. I can't be successful if I don't have the authority to affect change on that thing. Well, internet, you—I don't have the authority to change Time Warner Cable's network, I don't have, whatever. But there's a fourth thing that you can get that's almost as good as authority, and that's visibility. If you have visibility into something— and I'm not just talking about, you know, I can see my bits traveling, whatever, you really can see the objects that are along the way, then it's almost as good as authority. Because, like I said before, if the database team comes to me and says, "We have this information, it specifically says this happened at this time on this device," that's a lot of insight that saves me time and tells me where to go look. Nobody's calling me saying, "The internet is down." No, it's not, you know, that doesn't help me. So visibility is, what you're speaking to is, how do we provide that, how do we find that, because that visibility will speak to whether or not I can accomplish the RTO or the RPO.
Absolutely. That's a great point about visibility. I haven't thought about it quite that way, I love those four acronyms, you should write something like that...
I might, someday I'll write something about...
Maybe we could record that and remember he said he'd do that. So, let's talk a little bit more about why networks matter so much for cloud DBAs, and I want to talk about a couple of things that come up all the time when it comes to RPO and RTO. I see a lot of people talk about, and I have here in the lower left-hand corner, which is one second of data loss and one second of downtime. And you hear people, 99.9, maybe five nines, so know they've lost, and then I see, mentions that— in order to have that, it's going to cost millions of dollars. Well, what I realized, one day, finally it hit me. I go, actually, there's a huge part that's missing, and that is, you have to ask the question: how much data?
How much—if somebody just tells you, "Hey, one second, RPO and RTO, millions of dollars, no questions asked." No, no. The first question is, how much data are we talking? Because it might not be a lot of data, and there might be ways around it with caching or eventual consistence. Maybe you're not using the right tools in order to meet this type of RPO and RTO. So when you go cloud, now you have to think, how much data are we really talking about? Because in cloud, the way you pay is by compute and by storage, and egress. You pay for the network traffic coming out. It doesn't cost you anything to get in, but it costs you to get out. It's like New Jersey. So data volume—if you only have a gigabyte of data, that's different than if I have to move a terabyte within 15 minutes. This changes the nature of the conversation. So as a cloud DBA, or cloud architect, now you have to start thinking a little bit different, because you don't control that network. It's not going to be as easy for you to meet your SLAs and you kind of have to revisit all of them.
And it's not even guaranteed to be a consistent flow, and this is something that non-DBAs, just regular assistant administrators, started discovering about a year and a half ago with Office 365 coming out. And everyone made this bit push to, "Oh, we're going to run Exchange on Office 365," and suddenly they realized it was like, a month and a half to get. Why? Because we can only push so much data up to the cloud through our internet connection, and it was just a pure math exercise. But people just thought, "Well, you know, I have a two terabyte PST file," but you have 52 executives who all have those. So that was the discovery that people made, that getting it into the cloud was costly, in terms of time and effort and things like that, and this is the same thing. If you have to back up this data, how much data are you moving around?
30 years’ worth of email history. That shouldn't take any time at all to push up to the cloud.
It goes very fast in the movies.
Yes. [Laughs] So, let's talk about how the internet is a complicated place. So, I have a picture here of NetFlow, NetPath, and NetPath is showing me— basically, what I've outlined here is a server in Austin, and the server in the cloud. And this is one—this is a series of a path, this will give me an indication of how my traffic might move between these two physical locations.
Right, and notice that—I mean, it's an interesting slide with a lot of data, but there's different speeds between each of those nodes, and those speeds change on a second-by-second basis. It is not a consistent number, so.
What else, though, is down here, I get this little measurement that tells me, hey, by the way, it went yellow, so the latency was not good— as good—as it could be, as the green. And that's the point where you look at that and say, "Well, at that moment in time, what if there's an incident, no fault of my own. What if there's an incident? What if somebody just deletes a table from production?" And they say, "Hey, can you go get that table back for me? It's kind of important." Well, what if you're in that yellow zone, and now, all of a sudden, I can't get my data back for you, or I can't get it to the cloud for you and everybody else inside of the SLA. It has nothing to do with you as a cloud DBA, as a person, necessarily, right?
This was things that are outside of your control, or if somebody's just running a query right now. You're looking at that, and they're going, "Huh, well, it seems to me right now, there might be a slight issue between me and where your data is. And it's inside the ISP or I don't..."
Unless you can fix, unless you can fix the internet.
Unless you can fix the internet.
Now, in terms of problem solving, just like sideways, stepping sideways a little bit, is that's the whole point of NetPath, is to identify "oh, it's this node in so-and-so's network," and now you know how to call. That's the visibility piece. But it doesn't help you in the middle of an emergency, if you're the accidental DBA, trying to do that restore.
That's right. So more about, you know, why this matters so much for cloud DB is again, it's recovery. There's HA/DR, these are all— this was current as of a month or two ago. The cloud changes fast, so there could be new HA/DR deployment architectures, but here's a partial list that I know exists just for Azure. Amazon would have a similar-looking list. But if you look at each and every one of these— so, which one's right for you? You could go through, figure out RPO, RTO, the amount of data, go through all of that, and you can figure out, well, this is the right technology, then guess what? All of those depend on network.
Wait, okay, can you just name a technology for me that doesn't actually depend upon the network?
That is not true. [Both laugh] That's absolutely not true.
Okay, I can't think of any right now. You're right. The network rules all. But with this HA/DR stuff, this is how you set— this is your business continuity planning centers around all of this. Without your data, your business doesn't exist, so all these technologies depend on network. If you're the DBA, you'd better be thinking about how to do basic troubleshooting for the network, so let's talk about that a little bit. So, troubleshooting, of course, usually starts by somebody saying, "The system is slow," something is slow. So we're going to talk—I've mentioned thinking in buckets, right, so there's only a finite number of resource bottlenecks. So you've got memory, CPU, Disk, Network. You also have this thing, locking and blocking, inside of a relational database system. I have listed here, highlighted network as becoming the most important bucket now, if you are a cloud DBA. This becomes the first bucket you will want to check. All right. If you are a cloud DBA, the first bucket you're going to want to check is just say, "Is there a network problem or not?" Because if there's a network problem, you're going to want to go figure that out, or you're going to want to talk to more about that before you try to tune a query or do anything else. So that becomes the most important bucket. In order to get to these other resource bottlenecks, we have what's called the waits and queues tuning method. This is what DPA, Database Performance Analyzer— this is what DPA excels at, is waits and queues, we simply can go in, look at what the query is waiting for, we know what resource it's waiting for, then you know how to fix it, right? So the waits and queues tuning method, it basically lets you do things like this. So, I have a picture here of a server that's been virtualized. At the top, I have the metrics from inside the engine. Underneath that is that the guest, then the host, then the storage layer. I've highlighted a couple things, because you could see, the engine itself is saying, "Yeah, I'm having problems with single block disk read time and commit time." Well, it sounds like that could be an issue with disk, maybe. I don't know, but what I find more interesting is that, for this disk device down here, I can see that I'm having issues with the latency at the host-disk-device level, with the read and write. So if I'm having an issue at my storage layer, how much time do I want to spend fixing a query inside the engine? The answer is none, because I want to go and address the storage issue before I spend all day trying to fix this query.
Now why doesn't that happen? Because a lot of times, if you are the real DBA, you can affect this thing up here. Like, that's the lever I have to pull, so it's more comfortable for me to keep on messing with this, than the uncomfortable part, which is having to go to talk to other humans. And go down to the storage team and things like that, so there. But you have to sort of overcome that.
Well, it's not—it's also, a lot of times, they don't have visibility. Here, we have a correlated view in the picture that kind of helps us understand what's really happening here. Matter of fact, let's dive into DPA, and I want to show you a little bit more.
So, I've got DPA up here on the screen here. I've picked out the server. It's 2016 Microsoft SQL Server that we're looking at. I've got the range of dates, also for this information. What you see is all these bars, and each color actually coordinates or goes back to one particular query. On this particular day of June 6th, and you can see information about the number of executions, 232. On average, everyone's seven seconds. I see that's an update statement. I get a lot of base information about this particular statement for that day. So if I drill—I'm going to drill, let's see, going to June 6th, we'll look at— I'll pick this. The biggest bar means the most amount of the weight for that particular timeframe, so we'll just grab 2:00 p.m. So now, I come to this information and I can see this particular query was the one that had the largest amount of weight. I see that for this particular query, what it was was memory CPU, which means it was either running or runnable. Okay, so that might be a good thing, or it might not necessarily be a bad thing, but I get information about what's really happening so I have an understanding of what to do. This is what waits and queues is, right. So every time a request is sent to SQL Server, what we usually call a query, your statement is one of the three states. It's either running, runnable, or suspended. If it's suspended, then there's always a weight associated with it to let you know. So sometimes, you might get suspended because you have to go and read a page, a data page off a disk and into memory. Or you might be locked, or you could be waiting on network, so all these things—writelog, I'm just simply trying to get information written to the log because they do write ahead logging. This is how the database works, and you get information about all of these. So the idea behind waits and queues is simply to have an understanding of what your queries are waiting for, because if you know what they're waiting for, then you know how to go and solve that. DPA gives what I call the Rosetta Stone of information when it comes to this stuff. So, I see ASYNC_NETWORK_IO. I'm simply waiting for the client to acknowledge that it received the data that it was sent to. SQL Server's like, here's your data, and it's waiting for a signal.
For an Ack-Back.
For an Ack-Back. But this weight, this concept, exists in every relational database platform. The names change but the crimes don't. So DPA is this Rosetta Stone, because if I was an Oracle DBA, I wouldn't know what ASYNC_NETWORK_IO means, but I know what it means that if I'm waiting for the client application to fetch the data. And I can do the translation, right, and I say, "Oh, I know what that is, and I know how to solve it." So, waits and queues is this great way of aggregating that type of information. Now, I mentioned those four: memory, CPU, disk, and network, but I also talked about locking and blocking, because in the other screen, we saw that there were some locks for some of the waits. What does that mean? Well, that's how databases work. Sometimes you have a resource like a table, and you might want to update what you think is a small part, but maybe it's so large that it might be more efficient for the engine to simply lock the table for a moment, and to do the update, and then release the lock on the table.
Well, if that's the case, that sounds good, sounds efficient for that query, but other queries might be waiting. So what you see is, we have a nice way of aggregating the information. Let me just grab June 1st here, and I can see that these queries were being blocked. We'll drill into 4:00 p.m., why not? Now we can get information about what was happening at this period of time, because a lot of times, what you find is that the reason a particular user's having performance issues in a relational database, is simply through the fact that there's locking and blocking that occurs naturally as a way that the engine works. Because relational means that there are relations. In order to keep these relations, you have transactional properties that maintain consistency of your data. Data quality is fairly important for most people.
Right, yeah, if you have really messy data, it's a fast database, but...
Yeah, yeah, but that's why you have no SQL solutions. This is really fast and eventually everything will be consistent, but I don't—yeah, that look on your face. [Both laugh] So for this particular query, I can see the wait time is a root blocker— 10 seconds across all the executions here. 10 seconds, this particular query was taking in order to execute, but other queries are trying to do things, and that caused 39 seconds of wait across those child SPIDs, right? So that's important information to have, so when you're trying to figure out— even as a cloud DBA, I need to know where the problem is. Was it network or not? Well, actually, it wasn't the network at all; somebody was just locked, blocked, whatever is happening. I can see the resource wait was somewhere else, but if you come here and you see none of that, then you have to be thinking: could it be the network? You have to be thinking of that, and to me, that's the first bucket you think of. But if you don't have access to that type of information, at least if you have something like DPA, you can see where it might be inside of the engine or in the virtual layers itself.
Right, I think it goes back to what you were saying, is that in an earthed environment, your first thought as a DBA is locking and blocking, maybe questioning your storage environment. But if the storage was okay yesterday, it's probably okay today.
It's probably—but now, moving to a cloud environment, the network moves up in priority of what could— where could the pressure be, where could my problems be? You still need to have a solution like this that gives you this kind of insight, because things happen. But as a cloud-based network DBA, accidental or not, you need to think about network coming up the chain of priority of, "Mm, check that next, because it could very likely be."
The other thing I like about this thinking in buckets, especially if you look at resource waits versus, say, a locking/blocking. This lets me understand how I can scale the application. Because if I'm suffering from a lot of locking and blocking, moving to cloud or buying more hardware, none of that is going to solve the issue, which is essentially a transactional issue, right? Locking and blocking is the nature of the workload. Now, if I see them having a memory CPU wait, things of that nature, well, maybe I can scale some hardware. Actually, maybe the cloud's a better solution, because I can scale easily, more easily there with flash storage or whatever.
More multi-threading, right.
But if my root cause is ultimately like, yeah, you're blocked for like, an hour every night because Joe in accounting is running a report; it doesn't matter where your stuff is. So, this tells me, also, I can be kind of a cloud architect. I can help you understand a little bit about how you can actually use the resources, either earth or cloud, in order to maximize them. This type of tool, or this type of method of waits and queues was very valuable for me as a DBA, because it didn't just help me solve what was happening now. It helped me architect what was going to be best for everybody, right? So, that type of tool, like DPA, really helps me understand what's happening. Because when the phone rings— when that little red phone, that iconic red phone that's, we all know what it is, but I won't say— when that phone rings, you better be able to answer it and be able to take action, be responsible, be responsive, have the visibility that you need. But now the real problem for the cloud DBA, of course: now, the first bucket is that network. And I've got a picture here where I simply try to call out, you see, I get information about what's happening inside of my utilization right now. I get details about what's happening. I can see SQL Server or how much data's going in and out. And I can also see that right here, in this pipe, the bulk of the traffic is world wide web. It wasn't DPA. What's this internet traffic doing? I want my data to go through there.
Right, okay, but, and here's where understanding what you're looking at and understanding the visibility that your tools may give you really pays off. So, I just want to speak to this slide for a minute, and we'll have some more visual aids in a second. But this, just looking at this, you see that whole pie chart and you think, "Oh my gosh, that's where all of my bandwidth is going." No, that is a picture of how the bandwidth you are using, how all the bandwidth that you are using, is being used.
So this could be a picture of half or a quarter of your total available bandwidth. And then you're fine, because it means that your SQL traffic, which is the— I think that the light blue there, is fine, it's got all the room it needs. If it needs more, it's got more, it's got room to go, but you know, if this is all of the bandwidth that you have, then you have to think about prioritizing a little bit.
Okay, so fair point. So what I've grabbed here is just a screenshot of the current traffic in the pipe, but that doesn't show me how big the pipe is.
All right, so what's in the pipe right now, then? And how big is it? As a DBA, there's no DMV, there's no system view for me to look at and say, "Hey, how big is my, what's the throughput on this NIC card?" I maybe can PowerShell my way and do some WMI in order to get those details from hardware out. So, as a cloud DBA, how am I going to get the information to know, not just what's in the pipe, but what is the pipe actually capable of?
Right, so let me show you a couple of things to give you some visibility. We're going to switch back to our demo. So here, we're looking at the actual NetFlow page. You've got the same graph, or very similar graph here, with that information. And again, this is the total amount of bandwidth that I'm using at this moment, but it's not the total amount of bandwidth I might have.
So the first thing, to answer your question, is how do I know how much— how big my pipe is and how much is getting used? Well, one place that a lot of DBAs will look is on the server itself, because that's what they have access to. And just to take you to that screen, this is a look at the— this just happens to be a virtualized instance; this is the Hyper-V network adaptor on this database server. And if I scroll down here on the left, I can see—and this is information you can get out of WMI, you can get it out of SNMP, you can get this information, that the interface bandwidth is, a 10 gigabyte per second pipe. I'm only using, on the server, currently, two—a little bit more than two—in and a little bit more than 28 megabytes per, not gigabytes, megabytes per second out. So that's what I'm talking about. But this is the server, and maybe the server I'm concerned about, it may be the server that I care about, but that's not the traffic that this screen is talking about. This is talking about all of the traffic going through a network interface.
So, remember, you've got your server, virtualized or not, and it's connected to a whole bunch of network nodes, switches, and routers, and things. And typically, the routers are routing lots of traffic that are moving through it. So you have to have an awareness of that architecture. Now, there's some ways to map it out. You can get a topology diagram that tells you that your server's connected to— this switch was connected to that server, but in this view, I actually already have a sense, because looking at this traffic, I can drill into just the SQL traffic, which I said was the light blue. Now, this doesn't look like an exciting screen, this is all Microsoft SQL--
Single server's using all of the traffic, this is beautiful.
It's using all of the traffic, except now look down. It's using 1.1 gig in, and 772 meg out. That's it, on a 10 gigabyte per second pipe. If I'm talking about just that one server, I'm not.
So how would I know—sorry to interrupt— but how would I know if that node device is at capacity? Could I get more data throughput?
So, here's where I can look and see, just by clicking this little down arrow, I can see here are all the interfaces that I'm collecting information from, and how much of the just SQL traffic is coming through? So here, of that 1.1 gig total— this interface here, this device— actually, I should say, this fast ethernet on this device is passing that much traffic. There's a ProCurve. I've got Steelhead, I've got a Juniper here, and so I can break down the—so, again, awareness of your architecture, which is important, I mean, you have to know your schema as a DBA, you've got to know your network architecture as an IT professional of all stripes. And so you have to know where it's— what plumbing it's going through. So if I wanted to know whether that interface was at capacity, that tells me that, you know, the fast ethernet, and I can look that up and in the same kind of screen and say, "What's the actual capacity rating of that?" Then I would know if I have actual network pressure. But, you're still not really asking the clear question, or there's other levers you have to pull.
Let me go back here. Let's say that this screen is true, it's your full amount of capacity, it is completely used up, and world wide web is there. You have an option—or you, the IT department, you, the DBA, have people to go to--your network staff. You can ask them to create a traffic shaper that will, you know--a QoS setting that will change the quality of service, that will change the way traffic is routed, and it will prioritize SQL traffic over the world wide web traffic, so that it gets as much of the pipe as it needs, prioritized over. Now typically, this is used for voice traffic, because voice traffic is extremely sensitive to delays and things like that. But if you're trying to move, if you're trying to back up that data from your primary server to your backup and the, or vice versa for recovery, and that timing matters, your RTO and RPO matters, you can ask for that traffic shaper to prioritize that data traffic so that you're not being choked out by other things.
So as a cloud DBA, chances are you don't have voice traffic going to your—as your server...
One would hope not.
But it's not just that thing. It's also—you have this concept that it might, your voice data and database traffic might be shared until it gets out.
So you might want to prioritize, or you might want to have different networks for how that data is handled, or for you to get out to the cloud.
And again, as a DBA, being aware that this exists, but I still think that we're not asking completely the right question, which is: “I'm being slowed down, but why?”
Okay, so that's where you get into, is it—so, we're going to understand that it is slow.
We're not arguing that point anymore. But is it slow because of the network or is it slow because of the server? The server isn't handing off the data, and this is where de-pack and inspection, quality of experience comes in. And here on this screen, you can see where we have a variety of things, and if we just look at the SQL traffic, down at the bottom there, I can see that the application response time is being slowed down, versus the network response time. So, you want to have the ability to look into that, as well. This is all, once again: responsibility, accountability, authority— can't always have that— visibility. If you have the visibility, then you can point to it.
So I think, Leon, what you just showed was really good for a DBA, an earth DBA transitioning to a cloud DBA, because you're going to have those questions. How much am I hitting my throughput? What else is in the pipe? How are we routing the traffic? How are we taking care of the traffic as it gets between my earth position to the cloud? These are all things you're going to want to know because you might find—you say, "Whoa, what happened here?" Maybe there's an issue, things got moved around and now you're routing some of your voice traffic, and now you're sharing a pipe for the next few hours. You're going to want to know that, as a DBA, before—as a cloud DBA, especially— before you start spending time trying to troubleshoot and fix the things you can fix. You're going to want that visibility. So, we have some tips and tricks for cloud DBAs. One, as we just talked about the network, part of it, using the Azure Express Route, so if you're not familiar with what Azure Express Route is, it's basically a dedicated pipe to the cloud. You get a dedicated connection; you can work through some of the ISPs that are offering it. So in certain cities, for example, they've just got a building that just goes right to the cloud. It has a bolt of lightning and everything. But yeah, if you haven't heard about the Azure Express Route, you should look into it, because that's an easy--they're making it easier for you to get dedicated connections to the cloud, so you don't have to go--so basically, the ISPs are knowing to route that cloud traffic through dedicated channels, instead of everybody that's online in your neighborhood trying to share the same pipe, right? The other thing I like to mention is having an alternative RPO and RTO. So if things go completely wrong for you, and you're out of the water and you all of a sudden realize that the network is now the issue and you can't meet those RPO/RTOs. Well, what's the alternative? What's your worst-case scenario? So you ask "what if," and this is when we talk about focusing on the data and not necessarily the system itself. What will it take for me to get the data to where it needs to be? Have an alternative?
Right, and this is a key point, because in the past, the alternative is, you know, can I spin up another system? We're in the cloud, it's elastic, you can bring up new containers like nothing, if you're really in that environment. That's not your pressure anymore; it's getting the data where it needs to go, that is the focus.
And I mentioned focusing on reducing the data volume between, because not only is it a billing thing, but it's just going to make things a little bit more robust for you, especially if you have to do an alt-RPO. The example I give is Delta, when they had their outage. The alternative was, just reboot everything right now, you know you're down for four hours and it will disrupt things for days. But that's a lot better than us trying to just struggle along and fix whatever is happening right now. So yeah, were they meeting their RPO and RTO? I don't know at the point, but it was easier for them to have the alternative, and that was a known for them. Think of the same thing. You go, well, if all of a sudden I was out of the water, what would I do? Well, is it okay if I— I have a local copy of a backup from, a full backup from last night; we can just go to there and go forward. And that might be okay, but you have to have those discussions, otherwise you'll never know. And then, of course, use the right tools. Know things like QoE and QoS, quality of experience, quality of service. DBAs need access to network monitoring data. You don't get it natively. You've got to have access to it. Otherwise, you're going to be kind of, you're going to be out of the water, so to speak. And think in buckets. Is it network or not? Is it this resource constraint, that one or not, as well. [Thomas sighs] [Leon laughs]
Just a little bit to cover.
Just a little bit to cover for you cloud DBAs. So, I want to thank you for joining me, Leon. This was fun, if you can believe that.
Oh, it was my pleasure, yeah...
And thank you to the viewers, both of you out there.
And both of you that came to hear about data and cloud databases and networks.
Yeah, this was fun. We should do it again sometime because there's probably a little bit more for us to cover, and hopefully we can get together after you've finally stopped learning to blame the network.
I will stop blaming the network when you stop blaming the database.
Okay, we are at an impasse.
For THWACKcamp, I'm Thomas LaRock.
And I'm Leon Adato. Thanks for joining us. [Upbeat music]