Using Monitoring to Reduce Your Hardware & Software Investments
IT organizations are being tasked with doing more with less. This is especially true with virtualized and cloud environments, where the assumption is that you can provision more resources in seconds at the click of a button. In the IT world, we know that while this may be true, you can’t just keep providing resources indefinitely, and they sure aren’t free. The relationship between performance, efficiency, and cost is becoming ever more evident.
See how you can get the most out of your monitoring software besides Up/Down and load status. In this session, you will likely experience a fundamental shift in how you see and use your monitoring software to reduce your hardware and invest in software that helps optimize your current workload.
Hi, I'm Jared Hensle, part of our PMM team here for our system management products here at SolarWinds. Welcome to this THWACKcamp session. I'm joined here today by PM Chris Paap and Senior PMM Rob Mandeville. Gentlemen, thanks for joining me here today.
Thanks for having us.
Yeah, thanks for having us.
All three of us were system administrators in our previous lives. Have you ever found yourself being tasked with a project to only find out you don't have enough free resources to run it? Ever wonder where that IT refresh just went to?
What if you use your existing monitoring tools to provide more than just operational status?
Ever think about using it to optimize and reduce your company's hardware and software investments?
Well, guys, we're going to just do that in a few repeatable steps.
Yeah, if you know what reports to run and alerts and thresholds to set, you can do just that.
Okay, guys, let's get started. Before you begin any type of project, migration, or upgrade, you need to know what you're working with. You need to inventory everything. You can't come up with a good game plan without knowing what's out there.
Exactly. You don't know what you don't know. Doing a regular inventory scan will help identify what's in your environment. A great tool to do this with would be our Server and Application Monitor. Here, you can quickly do a setup of a discovery scan.
Chris, why don't I let you drive the demo?
Okay, so the first place that we want to go to is the Main Settings and Administration. We have our Getting Started with Orion as our Discovery Central, which we want to click on. Once you go in there, you want to do a Discover My Network. Once in there, just click Start. It's your normal wizard that we go through. It's typical of all Orion products.
Hey, Chris, I noticed there's a couple ways to do the scan—one by IP and one by Active Directory?
Why would you do one versus the other?
So, that's a good question. IP range is if you just want to catch everything in your environment, right? Like, if there's nothing out there that you don't know, like your development environments that may not be in your Active Directory settings. In Active Directory, if you do keep a clean house and you're adding everything in there, it's a good start to spot. But most people, they just, their first scan is a full subnet scan.
So like, for the shadow IT people?
Exactly for this type of people. For that server that's critical to a business application that's sitting under somebody's desk, so they scan the whole subnet. And that's, for this example, that's what we'll do.
That makes sense.
So now that our scan is complete, we're going to go back to the All Settings page and go down to Product Specific Settings. In this case, we're going to go down to the SAM settings, so that from here, what we can do is scan each of the nodes that we identified and scan them for applications.
So, this list is the machines that we've just discovered in the prior step when we scan for IP address.
Exactly. So we would scan them here. And then we can choose—like, the templates that come up by default are the most popular. Or we have a list of numerous different templates that might apply to your environment, everything from Exchange, Active Directory, Apache, whatnot. In this case, we'll search for Citrix. Hit Next.
So another method of doing this, though, is by using the Agent Deployment method that you could tie to Active Directory when a new machine is found, or in a login script that an agent would be deployed to the machine. Correct?
That's exactly right. So, you would roll these agents out, and they actually get installed by the server, versus using API to connect to the server. And the agent would then poll each of the nodes that it's on and then find what applications are on it, and then report that back to the main poller.
Okay, so it's an auto-discovery, automated.
Correct, correct. It is an agent that gets installed on the server itself. Very small footprint, but that is the key element to identify that. Versus the traditional way with SolarWinds is agentless, and so you're just scanning that server versus actually installing something on it.
Okay. So now, once we've got either agent or agentless method deployed, we've then inventoried our system. We're now able to determine, "I've got five Active Directory servers," "I've got machines I didn't even know existed." It really allows you to baseline and inventory your environment to see what's actually out there and figure out where to go next.
Right, I think the key thing that you just said there was "baseline," right? If you don't know what good is, it's important to get that inventory to find out what good actually is. And then you determine the improvements that you've made to your environment or that have been actually good or detrimental, and at least you have that baseline of before and after.
Chris, now that we've discovered our items, applied application templates, now we have the ability to run reports on these, right?
Exactly, and true to SolarWinds nature, if you just go to All Reports, you're going to bring up every single type of out-of-box report we have. But specifically, since we were talking about applications, if you just select Application Reports, you can run application availability. You can run load utilization, whether it's CPU or memory, and that's just down the line. And those are things that are just out of the box. And of course, you can always create customer reports based on the data that you've polled.
So now, we have a good inventory of our servers and application and are able to determine duplicate roles, but more importantly, find the servers and application that are completely unused.
So let's assume that we're in a virtualized environment. Where can we start to find unused resources?
You're right, analysts estimate about 75% of enterprises have virtualization. So the next step is to monitor your virtual environment.
Now, will traditional server monitoring apply here?
Sort of. You really should monitor from a hypervisor point of view to truly understand how a VM's performance can affect one another.
Okay, Chris. Why don't you show us in the demo where we can start reclaiming those resources?
Absolutely. Where I want to take you in Virtualization Manager is the Virtualization Sprawl page, because this truly talks to our point of reclaiming resources and saving resources. And actually not only reclaiming them but applying them to the virtualization servers that actually need them. So they've been undersized.
So we've got a VM Sprawl screen right out of the box, correct?
Correct. That didn't take any more configuration. It's actually by default, out of the box. Everything that you're monitoring will come in here. We broke it up. We have our top 10. So top 10 VMs under-allocated by virtual CPUs. That means that you haven't allocated enough resources. Top 10 VMs over-allocated by vCPUs. So, the opposite of that, so you're wasting resources. And that's what we're truly talking about today. So, we do the same thing for memory and for snapshot usage in orphan VMDKs, which, when I talk to customers, the orphan VMDKs, they don't realize they're even out there.
That's huge, the deleting from inventory and not deleting from disk.
And, you know, kind of jump back. You don't know they're out there. It's not something that normally pops up unless you're running like a PowerShell script or something out there in your environment. So, we bring that up for you, and we give you the ability to not only identify what's there but to delete those things and actually reclaim those resources.
Now, how long does the sprawl take to calculate? Is it instantaneous, or does it take a day or two?
No, it doesn't, that's a good question. So what we really have to do is monitor it for at least a week, and then those resources start popping up. And it's not static, right? So if it's been there for one week and something changes, as that polling occurs over time, it will update, and so will your sprawl page update as well. So virtualization sprawl is a perfect page to come to, when you're looking at what you can reclaim in terms of right-sizing and resource reclamation. On the left side of the screen, we have memory and CPU that's been over-allocated, and to that case, under-allocated as well, so you can right size it. On the right side, we have VMs that have been idle for the last week. And why that's important is, if it's not been running or nobody's been logged into it and it's just been idle for that time, you then go off and power it off, right from the console and reclaim those resources. Same is true for VMs powered off for more than 30 days. Those have not been on, so you can actually reclaim that storage by deleting that whole VM from there.
Chris, this looks like some really cool information. But how long would it take for me to get this view?
Great question. So, it's not immediately. We require at least seven days’ worth of polling to occur before this starts populating with usable data.
Just like anything else, the more data we have, the more accurate we can be with it.
So I see the idle VMs and then I see the powered off for 30 days. My assumption is if you turn something off, then after 30 days, it'll drop down to the section below?
Correct. Nothing's static here. So it's as time goes on, and if something changes, or if one of these VMs started spinning up and they put a new workload on it...
It would pull itself out.
It would pull itself out, exactly.
All right, yeah. I know I would recommend taking a backup of that VM before purging it, because I know that on day 31, I'd purge it, and day 32, they'd be like, hey!
Best practices apply, right? Don't go out here and start deleting everything without confirming that you have a fail-safe. So we actually point that out in our orphan VMDKs. We actually have it highlighted to, you know, this is monitoring. You know what you're doing with it. So, if it's an orphan VMDK, we haven't found any VM that's attached to it, go ahead and delete it, but ensure you have a backup.
Now, I know that's huge. I hear plenty of people that are doing a delete from inventory, not a delete from disk. And they're unaware that the VMDKs are just kind of there because they're not out in front of their face at all times. I noticed you had snapshots on the bottom. Is that based on the snapshots we're taking from here, or just anything that's using a snapshot?
Anything that's using a snapshot. It's snapshot disk usage. So, the reason why it's important to us, from a performance level, is the larger a snapshot is it can have adverse performance effects on your VM. But also, things like backups.
Oh, that'd be a good way to confirm.
Yeah, you can actually grab backups that have left snapshots behind. They haven't done their normal cleanup, and it's a very common occurrence. Or somebody comes in, and they do their maintenance on a server, and they manually create a snapshot, and you aren't aware of it. So, before it becomes too big, you can identify and remove it from your environment, reclaiming resources and saving the performance of that machine.
About a year ago, we did a TechValidate survey using the VM Sprawl feature, and on average, our customers actually reclaimed approximately 23% of the CPU, RAM, and hard drive back into their virtualization pool by just looking at it. Because people spun up 64-gig RAM SQL servers and 32-core blah, and they just didn't need it. It was just way over-provisioned, so really, that is a great tool to pull things back into the fold.
Right. And it speaks to exactly— when you go through, it's a constant battle as an admin to validate what you've provisioned in your environment. There's times where it's just by the book. An application will say it needs x amount in their environment, and so you validate what that is, and if it's using it, great. If it's not, get those resources back to something that does need it.
So virtual resources aren't infinite and they don't just keep scaling like people think they do?
Contrary to popular belief, no. Contrary to popular belief, no. So, the next step that you pointed out was recommendations. So, we can go into recommendations and actually see the options that are available from there.
Now, what's the difference between the recommendations versus the sprawl page? Or are you going to see some things in common, or are you going to see some things different?
You will see some overlap, but recommendations are using more of our intelligent logic, and there's two types of recommendations that you'll see. One of them is an activate recommendation that's more of an after-the-fact, but our predictive recommendations actually, they, too, take seven days’ worth of data minimum before they'll show up, and they're going to get you ahead of the problem. So before you have an alert show up, it's going to tell you actions to take in your environment before that actually will become an alert.
So you're trending stuff out to see that kind of future impact.
So, yeah, the predictive is obviously looking at, "Hey, the CPU's been low and we haven't done anything. Pull it back." The active could be, "Hey, this job's been running for the last couple hours. I'm maxed out. I need something right now."
Right. And what we try to do with recommendations is not just look at what the actual server's been averaging over a week. We try to find when the peak workloads are going to be. Perfect example, if you have a server that its work is always between like, three and five in the morning, or it's always on Friday—that is the actual workload of that VM. And so you don't average it out and normalize something that's hitting 100% utilization and then the rest of the week, it's doing nothing.
Oh, like a VDI environment where it gets hammered eight to five, or eight a.m. in particular when everybody's booting up, but the on the weekends, it's idle. You don't want to say, hey, I'm pretty idle on Saturday and Sunday. Let's start reclaiming these things.
Correct. Where it looks like, oh, these things are just doing 10% utilization, right? So, it's actually, intelligently going to identify when the utilization of that server is important.
Now, I see that you can actually— It looks like you can schedule them and take the— you don't need to get out of this tool. You can actually run it right from here, correct?
Correct. What you can actually do is, if it's— What I've seen customers do, actually, is if it's a low-hanging fruit, that action, something that doesn't require an interruption of services, or if it's something that's in a dev environment or test environment where they don't have change control, they'll execute it right there. Most customers in a production-running environment have change control to follow, so they'll schedule it for a day that they actually have to run it on. And so, if it's Thursday at four a.m., they'll go and schedule that for four a.m., and they can automate that and apply the recommendation and it'll kick off by itself.
That's awesome that we've now looked at it. You've verified it. The software has made the suggestion. But that you're able to do it hands-off during maintenance windows or after hours that you're not having, as an administrator, to be up and running at two in the morning to do this reboot or VMotion, or something along those lines.
That's what I was thinking.
Correct. And here's a perfect example. We want to be able to provide that data to you to validate what we're doing. We're trying not to be a black box and just spit out recommendations. We want to provide the data so you can follow it and use your background and your intelligence of your environment to make that decision. In this example, you're seeing where it's having these, where I was talking about normalization. It drops down below 50%, and then every so often, it's cyclical. It drops up above. It's hitting— it’s maxing out. So, this is an example. It's above your baseline, and this is the actual true workload. It's past your threshold. So, yeah, you'd want to make a change to give it resources and to make it fall underneath that threshold.
Awesome. So, Chris, being a database guy, databases and storage go hand-in-hand.
Storage is near and dear to my heart. So, how can we see how we optimize storage resources within SRM?
Within the Capacity Dashboard, what are the top two things that I like to look at? One is, if you know when you're going to run out, planning is key to everything. So if you can avoid those outages and you can avoid the fire drill, especially with storage. That doesn't come up overnight. So that gives you the ability, not only from a staying ahead of the problem perspective, but also from one of the less looked-at items in terms of like, negations, in terms of finding new hardware. Because storage is not cheap.
The individual storage is, but you're going out and buying arrays for an enterprise environment, not cheap. And so the more time you have to identify a plan to migrate to a new platform or to purchase a whole new platform to aggregate that, the better it is for you.
Yeah, I mean, I kind of would equate it to buying airplane tickets. If you know you're going on vacation in six months, start shopping now. Look for the specials, as opposed to that last-minute flight where you're just paying what you've got to pay, because you need to get on there. And then just really, from a management perspective, being able to say, hey, we need to allocate this money come June or July or six months out, a year out. It looks a heck of a lot better than hey, boss man, I ran out of storage. I need to get something overnighted. I'm going to burn the midnight oil this weekend moving us.
Exactly. And when you come into the Capacity Dashboard, we do have that projected capacity run out. We have subscribed capacity, oversubscribed capacity, and when that's going to run out. Also, because we have two different types of storage, as you know, thin and thick, thin being like, people are most familiar with that from a hypervisor standpoint. But all modern hypervisors now will abstract that, and they will tell you you're using x amount of storage when you're only using y, right? So they'll tell you, you provisioned 100 gigs when you're only actually using 10. So it's important to know that. So, that's why you'll see the over-provisioning, where, in this case, thins by LUN capacity. Provisioned capacity is 9.8 terabytes. That's actually what's been provisioned on a total size of 10 terabytes, 98% used. So we're still taking our eyes to what's important, highlighting those alerts and those thresholds, and then taking you to those things that you should probably be looking at that are warnings. So again, you're trying to get ahead of an issue being very critical. And then we take you to— we do have some widgets like Storage Objects by Capacity Risk. You're going to have the resource, what the resource name is, what it is—if it's a LUN or a volume— the array name, and what it's looked like in the last seven days. Obviously, in this example, 100% use is absolutely bad, right? That's not something that we want to see.
Is that my database?
Yeah, it probably is, absolutely. Hopefully not.
So, SRM, you know, it doesn't really help in the sense that from a VM perspective, it lets you reclaim resources right there, but it does show you how your resources are allocated, the overall makeup. Saying this array is at 100%, this array is at 10%, helps you load balance, make sure your workloads are in the correct spot, too, correct?
Right, exactly, and that's key. Not all storage is created equal. I mean, you have flash. That's not infinite. You have these all-flash storage arrays, but increasingly, what you're seeing in these environments are these hybrids, right?
I thought de-dupe and compression, 100 to one or 20 to one or whatever it is— It is infinite! [Chris laughs]
That's actually—I'm glad you brought that up, because that's key, right? Because if you're putting 100 VMs on a de-dupe compression LUN on flash storage, that's going to compress and de-dupe a lot better than if you put a SQL database on there where it's not going to get much compression at all. So that's going to determine, if you're planning strategically, how long that storage is going to last you. You obviously have to take in what the importance is of application response time based on. Like, it's a database, and that's always going to be key, but also what you're putting on there. Because you're going to get more capacity out of that, based on what you are putting on the LUN that is being compressed and de-duped, right? I mean, yeah, I think we all love high compression, but at the same point, putting all of EDI on there and then putting Rob's database on some sort of slow storage may not be the most beneficial thing to the organization.
You know, we're talking about some very high level, de-dupe and compression, as well as, you know, thin provisioning versus thick provisioning. But at the end of the day, the key things, as you all know, are latency and IOPS, right?
Right, that's what I care about.
Everybody's looking at that. That's what you're baselining, at the end of the day. Well, we have that. You have Storage Objects by Performance Risks. Not only are we showing these items, but we're taking you to the ones you need to look at first. So, IOPS are 4,289, with a throughput of 65.28, right? This is actually flash storage that we're running on here. But latency is key here. We're showing what's equivalent to like 3.8 seconds of latency, which means it's probably offline.
That's an eternity in computer time.
Exactly, when it should be, what is the average? What would you say, about 20 milliseconds for virtual?
I like to see less than 20 milliseconds.
Microsoft's, I think, best practice is 10 or 20, somewhere along those lines. It's low.
Right, 10 to 20.
It's not 3,000 plus.
Right, right. Not that high.
So all these are key right here, in terms of determining where you're going to place workloads. Just because you can reclaim storage doesn't mean you can automatically apply it someplace else, because you're not going to put SATA disks on your tier-one SQL database, right? And based on best practices, you have to take that environmental knowledge, and you can apply it to what you're seeing on your SRM dashboard.
Now, SRM would integrate with the other products and allow me to see a noisy neighbor situation where I can see all of the servers that are on this array, so I can say, hey, this is production-only. Why is this dev test box on here? Or I'm going to put all my ERP server or SQL server information, or separate it, make sure that my SQL servers are not all in the same spot, because of...
Not competing for resources.
Correct. That would be key on optimizing the workloads or making sure, that when they are peak or not peak, that they're not in sync with one another--that they're opposite.
Right, right. And thank you. You kind of touched on something as well, is when you're trying to sync that up to see where those workloads are at. How many times, as a database administrator, have you talked to your storage admin if you are siloed and those type of environments where you're trying to determine where your storage is actually coming from? Or when you're making a change to the production environment, making sure you're making the change to the correct volume, right? There's so much abstraction there, and there's a chain to follow. And that's kind of key. If you're looking through, in this example here, you're seeing the servers that are on this pool. And so you kind of drill down to the server level, and in this example, it just happens to be an ESX demo server that we have, and it's showing you that it's attached to this server pool. So it's that kind of chain of connection all the way through. It's important, right?
Yeah, no, that's great, because sometimes, it does seem like we're speaking different languages. So it's great to be on the same page.
Yeah, you're saying I want fast, and you're saying, I want lower latency, and I'm thinking capacity. And so I'm putting you on completely different storage than what you need, and you can identify that there.
And these are all synthetic in that when, in a virtual environment, if things are moving and shaking, I mean, that's the greatness of a virtual environment, that you can move thing around, that these are always updating it. It's not, hey, I put the app on this server, or this is what it was a year ago. It looks identical today.
Right. And that's key to know, that that's tied to your polling, too. So, in this case, it's not going to be immediate, but if it was like that on day one, it doesn't mean that it's going to be like that now. You're looking at like, within 10 minutes, right? It should update trigger...
Relatively reasonable amount of time.
So now that we've kind of tuned our storage to say, hey, I want Rob's database on peer, or Rob, you are on some archaic storage, how would we further go in and start tuning the databases or applications? Can we take that even a step further and now actually look in the application to start optimizing things?
Absolutely. You can absolutely go from, as long as you're looking at it from like a server and application monitor perspective, and they're all together, that's key to identifying the strength of the proposition that we bring, right? So you're looking at it from an application, all the way down to the spindle level, and the hypervisor just sits in between that, right? And then, if you have DPA, obviously you have that database. Like, that's one more key to the puzzle. Nothing is happening in an isolated environment. Like, one change affects the other.
Yeah, that's kind of nice, because some of the roles kind of sit in the middle there. So it's nice to be able to see upstream and downstream.
Yeah, it's nice to try to figure out where the actual problem is, fix it, admit if it's yours or if it's somebody else's, but at the end of the day, to get the application running as smooth as possible with as minimal resources as possible. I mean, some applications need a ton of resources. It is what it is. But I think more times than not, I've just thrown memory and CPU at a problem thinking, oh, that'll fix it, and that doesn't fix the problem. It's just still slow. It's something inside the application. It's just not the bare resources.
And sometimes due to licensing considerations and stuff, you might not want to do that kind of stuff. Because you can expose yourself to liability from a software license, if it's core-based.
Okay, well, why don't you show us inside of DPA, how we can continue and actually tune a database, because I know that was actually key in one of the environments I worked at, was where I kept giving it more and more resources, and the database never responded as quickly as it should have. Purchased DPA, and then basically was able to show the SQL guy hard proof of, here is the problem. Here, you go fix it.
Right. It's not always a resource problem, right? Okay.
From a database perspective, you've got to make sure that you're getting everything you can out of the hardware that you have provisioned. Are there inefficient or poorly written SQL statements in your environment? And what are the biggest hitters to your performance in resource consumption? So, to illustrate this, let's take a look at this example in Database Performance Analyzer. So I'm in a SQL instance here, DPASQL2016. I'm going to go ahead and drill into today. I've set my interval to one full day, and that's just to get some good statistics here. When I click on that, it brings up all the activities that are happening within my database engine. So you'll note the number-one activity is MemoryCPU. That's actually a good thing. That means that workload is getting done. Things are processing. But if I want to know what's contributing most to that specific activity, I can click on it and drill in. And it gives me a list of all the SQL statements that are contributing to CPU and memory pressure. So if I click on this specific hash, which represents this SQL statement, right away, I get some great information about the statistics associated with this particular statement. I can tell you how many times it ran during this one-day period. I can tell you how many logical reads, physical reads, and rows processed that this caused my database engine to do. So, looking at this one, as a DBA or a developer, I see an opportunity here. I see that the number of logical reads is fairly high compared to the rows processed. So you can see here that if I do some quick math in my head, that's about a 270-to-one ratio. Yeah. So that's not good, right? Definitely an opportunity for improvement. Now, Chris, the nice thing here is that I'm not going back to my storage array to do any reads, right? It's all in cache.
That's good to know. You read my mind. I was going to ask you, is it putting any additional load, like, testing load, on top of your current array or your current database, or is it just collecting this data and then summarizing it for you?
Yeah, great question. And it's really not. There's less than one percent overhead on your monitored instance, and everything we're grabbing already exists in cache.
Now, what about all the KPIs that we were polling? Is this all out of the box, or does this take additional setup to pick what you want to see?
So I'd say probably 98% of it is fully out of the box. There is that little bit of customization that you might want to do, just to tweak it, put some bells and whistles on it, per your liking. Yeah. No, great question. But as a DBA or developer, I would look at this and say, that ratio is too high. I'm asking the server that I'm on to do way too much work to get the number of rows that I really want to act on. So now, would a DPA though--would they recognize that? I know being a former system administrator, I'd be like, looks good to me. I mean, I'm trying to figure out where I would see that, a danger, Will Robinson, red, you know, hey, this exceeds some threshold, somewhere where it's visual to me and like, hey, this is where our problem is.
Right. So within the existing resources that we've already had allocated to us, as long as we're staying in cache, there's really not that big of a performance hit. But when you start flushing those pages out of memory, then you're starting to go back to disk. That's where the inefficiencies and the performance hits really come in. Yeah, great point.
So we've identified there is inconsistencies on lag rates. What is the next step? So, how would I go as a new admin? Coming into this, like, what would by my next step?
Yeah, great point. And this is where a DBA really has to be a DBA, right? They're going to look at the strategies that they have, the tools within their toolbox, to fix things like this or to mitigate this. So things like indexing, maybe data archiving, which comes into storage, because you might put some of that archive data on a lower tier of storage, something like that, to kind of reduce cost. Maybe I look at the join order. I'm going to parse through the query plan or the execution plan to see if there's inefficient steps within there that I need to address. So I think you've brought something that's key. So, once you go back and you make these changes, I think something that should be as part of anybody's plan is to go ahead and re-baseline, correct?
Absolutely. Yep. And since we're constantly grabbing that historical data, this kind of does it for you. DPA will give you that historical baseline over a day, over the past 30 days, so we kind of keep that information for you.
That's good to know.
Yep, great to look back.
Now, you made reference earlier, was that, you know, I as a system administrator, database is slow, more RAM, more CPU, just, feed the beast. And that could cause the database to be out of compliance, that it's using more cores or something along those lines. Is that a visual here, or is that something you're just aware of, that hey, you just now gave it more cores than what we're licensed for, and that still did not fix the problem?
Right. So, my primary point was really to be careful, at least be aware of any kind of licensing considerations that might come into play if you're going to throw more cores at it, because database managers do tend to be core-based licensing. So you’ve just got to be careful not to expose your company or yourself to liabilities or additional costs that you weren't intending to do.
Yeah. And a lot of times, the pointer with this is that it's not always resources. That may be a further step number two, three, four to look at, but you want to look and see where there's opportunities to tune the existing environment.
Now, the example you just showed where the logical reads and the writes, was it, they were out of sync?
What where we at before, and where could we get? Actually, go back a split screen, or I see seconds. So this is how much wait time we're at, and we can decrease that by changing the query up. I'm using the words here, but...
Right, right, right. Indexing or looking at inefficient query plan steps, things like that. The nice thing here is that if I go ahead and click into this, I can also see the historical chart for this specific SQL statement. So here, I can see this is kind of the baselining question that you said, right? I can look back the past 30 days, and I can tell you how things ran back on June 28th versus today. Is it different? Is it the same? And one of the most important things is what's driving it? Is it the number of executions? Is it the physical reads? Is it the logical reads?
So you would see, Black Friday, for example, you would see, hey, I'm an e-comm site on Thanksgiving. I got executed a couple hundred times. Two a.m. comes around; I'm getting executed thousands of times, because everybody's trying to buy whatever at the discount price.
Yeah, no, absolutely. Yep. And that's where we kind of take our trending and all that information, that historical data, to understand our baseline. You have to have that baseline to understand, what is the norm? And now, how do we compare against that norm? Is it better? Is it worse? Is it the same?
And it truly is, to that point, it truly is changing the way you work. If you don't have that norm, finding out what is good, what is bad--there's no way you can make a change and see if it's actually had an effect.
That's right. And since this is taking it from an end-user point of view, an end-user proxy, really what you're displaying here is end-user pain, right? They don't really care what's happening underneath the hood of the database engine. They just want to be on time for lunch with their friends.
Yeah, no, I completely agree. As I alluded to earlier, I purchased database DPA at my former company, and the website was constantly running slow. I was getting thrown under the bus as the system administrator. There was broadcast traffic, so I created VLANs. It's a DNS server. We made a new DNS server. I mean, we literally bought a Dell server and made a $15,000 server a DNS server. I mean, talk about overkill. But we were grasping at straws on what was the problem. Finally, we put DPA in. I should have done that a lot longer, but then quickly, we were able to see, even from my perspective, this query seems to be running an awful long time, or a lot of time and frequently. Was able to rope in the SQL guys and then they were like, oh, yeah. They made a couple changes. Two weeks later, the websites were hauling. I was asked, what'd I do? I didn't do anything different. And then I actually was able to re-provision all that hardware that we threw away trying to fix that problem. That DNS server was reallocated to something else, so I know it wasn't necessarily on the database server, but all that, all the stuff we were throwing at the problem that didn't exist, we were actually able to reallocate to be what it was truly supposed to be.
Makes sense. And going back to something that you had said earlier is that you want to go through this in an iterative process. You want to look at the most impactful SQL statement, do what you can to mitigate that, get that off the radar, move onto the next, right? Rinse and repeat.
Rinse and repeat. I mean, and as Chris brought back up, you tune your database, you tune your applications, you go in there and re-baseline. I mean, if you threw extra RAM at it, obviously you probably now see, hey, you were at 60 or 70% capacity. Now you're back down to 20 or 30. You can give out some of those cores back--pull them back in. So really going back in there, using VMAN, using SAM to analyze and baseline your environment and then like you just said, rinse and repeat. Reallocate the inventory or hardware, and then baseline again and go from there.
Yeah, and that's one of the great things about having all those products kind of work in concert with each other, because it really gives you a good idea, especially being kind of in the middle. Like, I'm not front-end application. I'm not back-end storage or anything, but I can kind of see both upstream and downstream. I'm not existing within a silo so much.
Right. I think it's key to identify where is the problem at? But it's also very easy to mask a problem by throwing more resources at it versus actually solving what the issue is, right? It's like trying to put a Band-Aid on a broken leg. It's not going to fix the issue.
Right. Throwing tons of resources at an issue, throwing better storage— it can really mask a lot of problems, but it can also cause the increased software licensing costs. It can mask a problem that's eventually going to rise up and bite you again.
Take your resources away from the things that you need.
So throwing resources at things is not a strategy? [Rob laughs]
It is a strategy, but maybe not a good one, right.
Not necessarily the right one.
Maybe not a good one.
Rob, that was an awesome DPA demo. That really showed how to optimize and tune a query where I've been just throwing resources, CPU and RAM at it, trying to fix it the whole time.
Yeah, no, thanks. It's been great.
Well, that about wraps it up here today for this THWACKcamp session. Hopefully you've learned how you can use monitoring software to pull some hardware and software back into the fold. This is a never-ending process, and what was tuned and optimized yesterday is not necessarily tuned and optimized today. For THWACKcamp, I'm Jared Hensle.
I'm Chris Paap.
And I'm Rob Mandeville. And thank you for being with us today.