The Seven Deadly Sins of Data Management
All too often, data professionals are our nemesis when it comes to handling data and data management. Many data professionals and system administrators fail to recognize that the danger in our own habits increases the risk that the business will fall short of its goals. The danger may not be as destructive as an all-out data breach, but we are often to blame for enabling our business end-users to lust after BIG DATA, resulting in data hoarding that leads to ROT (redundant, outdated, trivial information)
So, while the world’s collective media shines a light on the never-ending list of security breaches, we suggest that there are common—and bigger—threats that data professionals need to guard against. Not all data professionals are guilty of every one of these sins. Rather, the collection of individuals who work in modern enterprise IT shops is culpable. Head Geeks™ Thomas LaRock and Destiny Bertucci will share examples of data management, or rather, mismanagement
Hi, I'm Thomas LaRock.
And I'm Destiny Bertucci. And today we're going to take you to church to confess our sins.
Well, that's not quite right but we are, we're not going to church, but we are going to talk about sin.
But we will have a chance to make them confess, right?
Oh, yes. Yes, before this is done, there's going to be plenty of chances for them to admit that they have data management sins. In fact, that's the title of this session. The Seven Deadly Data Management Sins, which we are going to apply to Orion data.
And I better hear a confession from someone before this is over, right?
Before we begin, I want to make sure people understand a little bit about what data management means. For a lot of people, it's almost like being a database administrator. That can mean a lot of different things. More developer, more administration. Data management has a lot of different layers. I've written just a few of them down: data architecture, administration, access, data quality, data security, and so on. And as you can see, yeah, it's not the most exciting topic to talk about.
No, definitely not. It sounds like a lot of work.
Right, it is a lot of work, but the fact is that, for most of the people out there, what they probably don't realize is they're already doing each one of those things to some degree if they're an Orion administrator. Right, did I just blow your mind?
You did. And I'm actually taken away by that. [Laughs]
Most people just skip over. I'm not a DBA. Well, are you the one installing Orion and configuring the database and things like that? You're actually part-time DBA there. Are you in charge of the quality for the Orion data? Probably. How about security? How about access? Who has access to the Orion database? All of those things actually fall into one layer or more of data management. So, there's a lot of ways that things can kind of go wrong, and I want to bring some awareness to what those possibilities are. So, one way of doing that, of course, is looking at the seven deadly sins but applied to data management. So we're going to go through each one of these and see examples of where you might be sinful [chuckles] with how you are taking care of your Orion data.
This sounds like a plan.
All right, let's get started. So the first sin—deadly data management sin—is lust. Now in traditional companies, when you think about lusting after data, the easy examples, of course: everybody wants big data. They want their data big, they want their data now, they want it fast and accurate, and they need all the things. They basically lust after data. Every company out there knows that without data, they wouldn't exist. Data is their most critical asset that they own. But of course, if that's true, then they obviously want to own more of it. They need more of it. They need all the data. And lots of times, they just want to get more and more data in and they don't really have an effective data management strategy. They just say, "Just get more of it and we'll figure things out as we go."
And that's the thing that you hear a lot of the time. They're like, "Let's save it for a really long time." And they're not understanding the storage that it would actually cause, the security issues of keeping everything there, and the integrity of it. So they actually, when they're going back through all of this, it's kind of like when you're not the DBA, you're accidentally there, where you accidentally just got a bigger mess that you have to deal with.
That's right. So, lusting after data is obviously a sin but let's think about that. What does that mean in terms of Orion? So, Orion, you think about polling and tracing all the things. Is that fairly common?
Yes, and we see a lot of people go wide open. Like, you guys understand me here. [Laughs] We literally will both turn on every syslog, every trap, send it all to us, and then we'll figure it out later.
Exactly. So, there's this concept that's called paralysis by analysis, as well. So lusting after all of that data, trying to trace everything that's possible, it gets kind of hard to actually do any sort of effective analysis on this when you have all of that data. In my world as a DBA, the question becomes: Do you want to do tracing to capture all the events that are happening when the query is running, or do you just want to do some polling and just do sampling every few seconds to see what's happening? That's how DPA works. It uses sampling and polling versus tracing. Tracing can lead to a lot of overload. It's basically lusting after every piece of data imaginable in the hope that some little piece of that might pay dividends down the road, should you have to do the forensics on the incident. So in Orion, we do see that. You have a picture here—all the alerts.
Yeah, definitely. Especially because you kind of get that alert overload. And we have the out-of-the-box alerts, but then if they're like, "I want this alert, "I want this one and this one." When technically they may have 10 alerts that are set up that are using up your resources, when actually one would actually do the whole, blanketed, and you wouldn't have to have 10 of them that's actually alerting you. Well, that saves your email box and that saves your time and your resources, as well, on things to alert on when you're more specific for what you need, not just overloading or blanketing it.
And we talked before here on THWACKcamp and on Labs many times about the difference between an alert or just, like a notification. An alert requires action. So if you've already got an inbox full of alerts in your building rules, you've already lost.
A lot of noise.
A lot of noise. We want to avoid the noise. Lust leads to just a lot of excess, you going after all the things, and it leads to a lot of noise. All right, next sin: gluttony. So, for me as a DBA, when I think of the sin of gluttony when it comes to data, it's basically data hoarding. We are pretty much, everybody out there, you are a data hoarder. If I could make a TV show about data hoarding, I would.
Let's just have a moment for the hoarders.
Okay, we're good.
Nobody seems to be archiving data. Everybody just collects it. We talked about lust. You go after the data. Do you have an archiving strategy for that? Not really, I just know that I have access to a terabyte of data a day and some of it might be useful, some of it not. But I'm going to keep it all because it might be useful one day to me, and I just don't know when. So, there's no real archiving, there's no strategy in place to say, "After 30 days have it roll over to a different set of disk and cold storage." There are companies that make software-defined solutions to try to move data between hot and cold storage for you. SQL Server and Azure has a concept of the stretched database. You can take your older data and push it up to the cloud. But you know what would be better than that? Is if you didn't have to have it at all to begin with. It's like having that garage with a car and all the newspapers piled around it and you just want to get rid of all that extra trash that you just don't need. So for Orion, where do we see that?
So you're definitely only going to see that with all the actual elements that are coming together that are just out there. So like this response time, things of that nature, that's just out of this world. But we're not really doing anything and we're not setting an example of what we have with it. So we want to trend those. If we know what the baseline is, then we can actually set a threshold up to know what is outside of, like, for response time. So when you're monitoring your data or if you're—if you don't want to keep it forever, but you want to be able to have that baseline. That's what you really need on a monitoring side of it is that baseline awareness. You don't need five years of data because are you really going to remember five years from now what's going on? On top of it, that's five years of data that you have to actually do a security check on. You have to also maintain the awareness of it being up, and you have to have your reports to actually come back and forth to load. Because if you try to load five years’ worth of data into a report, naturally everything's going to kind of crunch. And when you're accidentally the DBA, especially for Orion, you're not preparing for that on the backside. And a lot of you guys who are just starting out, and you're saying, "Oh, we're going to save five years." And you're only good with maybe about six, seven months. So you're just starting to see what the actual timeframe of holding the data can actually do. Because it's not just a resource that you can pull data from, it's resource-intensive to be able to retrieve and to use that data. So is it need? Like, do you actually, needfully have to have this data?
So, what we ended up doing with DPA was, you don't keep the details for five years, you have roll ups and summaries. So you have your daily detailed data that at some point rolls up into hourly. And that hourly data all goes out a few months and then after that period of time, that gets summarized into a daily piece of data. And that's usually enough for most people. And significantly, it reduces the amount that you're carrying. It's close enough to a proper archive strategy. It's just that when you ask somebody, "Hey, how often have you had to go back in time and run a report on detailed data from an event that was four and half years ago?" And the answer's usually, "That never happens." "But if I needed to, I'd really want the data to be there."
But I totally could.
Oh, that's what a restore could do for you. Take a backup of your database and you could restore it, run a report, and get the data you need as well, and do the comparison. Would it take a little bit longer? Yes. But does it save you a lot in terms of the sin of gluttony? Absolutely. Faster queries, faster processing for the Orion data itself, for the real-time data that people usually need, more current than the stuff that's a little bit further out.
And that's like with the roll up that we do within the database management on our side too, is that we actually roll up the data and do everything for you. But it's when you get the ones that like to tweak the information in there.
Because we have those suggestions based upon actual data from the people that we use of what is more pertinent. So when you start making those adjustments it does have an effect. And that's something that, there's always that cause and effect that you have out there. So anytime that you're increasing that detailed information you're increasing the size and load that's on the backend.
A company I'm familiar with, what they had done for their archiving strategy— they don't want to be gluttons about it, they just want to take a backup of the database every month. And I said, "That's great." "I'll take a backup." And no, no, you've got to save it. Well actually, take a backup, rename it, and leave it there. And so every month, I was taking a backup of the database and I said, "How long do I have to do this for?" And the answer, of course, is seven years. So, I got to the point where I had the 84 databases and I went back to the same people and said, "Great, I can get rid of that first one now." "No, no, you have to leave that." So I had now 85 databases of this system because that was their archiving strategy. That's not an archiving strategy. That's data hoarding. And as much as I would tell you that you can take backups and do restores, that doesn't mean that they have to be online all the time. You can go back and get your data later if you need it. All right, next sin: greed. So we've gone through lust, gluttony, you're going after all the data, you're keeping all the data and hoarding it, guess what that leads to? You need somewhere to store all that data. And that leads to bigger and ‘badder’ data centers. The more expensive data centers, what do they do? They price it by the square footage. So they add up all the cost of the equipment and everything else, cooling and heating, and they say, "And how many square feet?" and they get a price per square foot. With the lust and gluttony that goes on, that price that just keeps going up and up and up. And it's great. You consolidate these racks from two down to one but you know they just expand back into that space again. They're not really getting rid of any of the data, and it just ends up costing more and more money, which is now why cloud becomes such a viable option, because it reduces that type of earth-data-center-cost. But make no mistake about it; this becomes the end result of the lust of data and the gluttony for data. And the end result is the greed that becomes, "Oh now I can go and get the hardware that will solve all of our problems." And how do we see that represented in Orion?
That is a great question because a lot of the times, we want to go out there, especially on the applications side, because there's so many different things that we can use in templates and we just want to turn everything on. Well, when you turn everything on, it's not just affecting the database, because it will affect the database, because you're adding new rows for monitoring that you're doing for each one of these. But it also is using your resources for how it's going to be gathering the information. It's going to be increasing your CPU and your memory on your pollers because it is so intensive. Like, some of these can actually have 70-some components within one to actually monitor. But if you don't even know, and that's what we're finding out a lot of the time, is people will turn everything that they have on their servers on, but they don't understand exactly what each one of those components might be. If you're not understanding what it is, we have great resources to help you out on that with the experts on the actual application itself. But why are we turning that on just to turn it on? You're not going to assume a problem if you don't know what it is, because you're not going to know how to fix it if you don't know what that area is. So, you really want to stay in realm and you don't want to get too greedy and just turn everything on, because you're going to create a problem that you didn't really need in the first place. And you need to focus in on what you know and what you need to have to monitor with. So you kind of go to get that tunnel groom.
What I have listed here is--I have this little bullet point that says 27 different monitoring tools for three teams. Yes, I know of a company where that was the case. Where you sit there and go, "Well, we have a tool, say it's sales operations manager, and it can gather just about everything you would need." And then another team, say a network team, shows up one day and they say, "We just bought a really great tool from this little software company in Austin." And then some other team says, "Oh, we need our own specialized tool." And now you have lots of overlap. And you can actually hit a point where, if you do the math, you find out that the biggest consumer of IT resources is IT itself. You end up with more servers than people. And this is the greed I'm talking about. They just keep going after more and more and they say, "Well, we need all this special stuff." And there's lots of overlap; there's no real collaboration. Everybody's just going after all the hardware resources that they need for their own little ‘silo-ed’ piece of IT. And you can see that represented very clearly in Orion by just looking at all the nodes that end up getting mapped and collected. And do I need every little piece of data from every little node? Maybe, maybe not. But that's usually becomes the end result, is that IT becomes a bigger and bigger consumer of its own resources.
You get also less productive. I've been around a lot of them that had different monitoring tools themselves. And you get in that little kind of a he-said she-said type of a thing. Where it's like, well this one says it's this, and this one says it's that, and it's off just a little bit. And you don't want to have to, but sometimes you have to break it down. It's like, well they may be off on poll cycles, and when this one does that, but you're spending that time arguing over monitoring tools that are summarizing the same information. But that's not helping collaboration within teams and that's actually kind of taking your eye off the pie in a way.
Oh yeah, there's tools like, a lot of stand storage tools, do a summary of data. Here's what that metric was at this point in time, but they don't tell you is that's an average over the last minute. And then you try to compare that to somebody else who's actually doing it every second and they have a different number. And why are these two numbers different? Because they're two different tools and they're doing measurements in two different ways. All right, next sin: sloth. More data—we talked about all this. More data leads to slower queries, usually. Maintenance will take longer. I have RPO and RTO. That stands for recovery point objective. That's the point in time that you can go back to do a restore. Let's say you're doing transactional log backups every 15 minutes. Your RPO would be about 15 minutes of data loss. RTO is recovery time objective. That's how long it would take you to do the restore. That's essentially your down time. Because when you're doing the restore, people won't have access to the data. Sloth, I put in this thing of where, hey, if we're lugging all this extra data around, there's just a general slowness. But sloth is also this idea of that you aren't being able to get to the tasks that you should be doing. That's like the technical sin definition for sloth, I believe. It's not just that things are slow. It's like I'm slow to react. And there's lots of reasons why. I'm actually going to jump into a demo of DPA to give one of my best examples I can give when it comes to what I think sloth means.
So what I have here is the front page, say, for DPA, looking at a test server we have where we've been running some workloads against Orion. This is all Orion data that we're seeing. And you'll notice, first thing I want people to notice on the left-hand side, this is the amount of time in wait. And we can see thousands of seconds. How many seconds an hour? Like 3,600. So many, many hours of wait in a particular day. This is all executions, and every color you see is a statement and is an aggregate amount of information. So look at this. This statement is UPDATE APM_CurrentComponentStatus SET.
You guys have probably seen this before with any kind of an activity monitor or something.
Right, and so it's setting component status ID and it's fairly common statement inside of Orion. It runs a few times. In this case, it runs 30,000 times a day. As I usually ask people as we walk through the day, like, how many times a day should it run?
And they're like, what do you say?
Right, well it kind of depends.
On average, it runs in 7/10ths of a second. That's probably good, right? It's less than a second. But in terms of an aggregate amount of wait, you can see here: five hours, 57 minutes for just that query itself of resource consumption time, wait time on the server. The total wait time for the whole server, for all statements, is 15 hours. So 15 hours of wait in a 20— I'm sorry, that's 15 hours of wait in a one-hour time period. From 3:00 to 4:00 a.m. How can you have 15 hours of wait in a one-hour time frame? Workload, right? That's an aggregate amount of data. It is all executions and it's a summary amount of the wait that you're seeing. So, what you have now–I can drill a little bit more here to get an idea of what was happening, say, at 3:00 p.m. Let's do that. Now you get an idea of the statements. It really boils down to these statements. Top two statements are the ones that have the most activity. There's a whole parade of principal where 80% of your waits will be caused by 20% of your queries. You'll find a lot of that hold true when you look at DPA data. And what I see here is this particular statement, this is the update statement as well, and I can see that I had two primary waits. One was exclusive schema lock and one was an intent exclusive schema lock. And I know that because I know what these are over here. If I didn't know that, though, I click, I get a description. And I can see it's an exclusive lock used for data modification. So while this update, or an insert and delete is happening, it's placed a lock on an object somewhere--probably that component table. And that means, likely, if that's the table level, nobody else is going to be able to access that table for that 7/10ths of a second while this runs. So now the question becomes, well, how many people are trying to access that while this one's running? And so on and so forth. So, what DPA is going to let me do is, it's going to let me have an understanding of what— where's the real issue? So when queries are made to SQL server, you can be one of three states: running, runnable, or suspended. Running and runnable simply means that you're either currently executing. Or you're runnable, which means that you're waiting for a scheduler to execute you. So when you're suspended, you get one of these things. It's a lock, it's a write log, it's a page latch--although that's inside memory. We have things that are wait events, and so when you go through and you look at the aggregate amount of waits, the idea is very simple. If you know what your query is waiting for then you know how to solve it. Is it waiting for memory, CPU, network, or a disk? Or is there locking and blocking? Let's come out back to May 11th, and you can see at the 3:00am time frame, if I come down here to blocking, maybe not as much at 3:00, but look at what was happening over here. So here's another update statement. You can see that, as a root blocker, this was only waiting about 38 seconds. Then the wait time impact on all block sessions, though, was nine minutes. So for this particular query, this update statement, it was causing nine minutes of wait. And not for wait that had to do with memory, CPU, disk, or network. No, it was simply due to the transactional nature of what's happening inside the database. What you have to understand here is that there's no amount of hardware I can throw at this that's going to solve the problem. I can't throw more memory to solve what is essentially a transactional isolation problem. I can't...
That the basis of what we all think.
And I say "we all" as the people that are accidentally happen to be in this database role. And maybe we didn't plan out for it. But literally some of the times, even with my history, of being with Solar Winds, it's like let's throw more CPU, let's throw more memory out there. And when you're looking at this, I'm laughing on the inside because I understand how SolarWinds works. So, when this is actually going through around that 1:00 a.m. mark, this is most likely when their maintenance and things are going through, too.
It's going to be causing this. So I just can't just throw...
Right. Money at maintenance, because we have to have maintenance. So it's the understanding of the relationship of how the locking and blocking correlates.
Right, and the biggest point and the takeaway I want people to have from this, is that when they're experiencing issues with Orion, and they think they're having trouble with Orion, and it's one of these deadly sins, maybe, is to have that understanding. Is it— Can I scale? Can I throw hardware at the problem? Well, what's the problem, really? Is it locking and blocking? Is it something that's logical inside the database? Or is it one of these physical hardware resource constraints? Is it a memory, disk, CPU, or a network? And in this case, the bulk of what I believe the waits are for this particular instance is we just hit a particular point where it was a lot more locking and blocking and I can see that if I simply look at the top waits. Lock, lock--look at that. I mean, far more than 80, probably closer to 90 plus percent is simply a couple of exclusive lock and intent exclusive lock, before I even get to memory CPU. So, a faster disk isn't going to solve this. You can go buy all the flashier ways you want. You're still going to have contention for the object itself. So this, it gives people an understanding of when's the right time for me to scale Orion and how? Can I scale with hardware, or do I have to think of things a little bit differently? Maybe now, maybe I don't need to poll everything. Maybe that component table could be smaller because I don't need to track 50,000 nodes and all these details and things of that nature. So, I really like getting people to understand that they can use tool like DPA to look at Orion itself in order to make their installation a little bit better.
And something else that you can use this for, though, is that a lot of times, we need the documentation to backup why we need resources, right?
So when we actually can use this and say, "This was memory and CPU was the number one, and it was at 80%." Well, that gives us the go-ahead to actually ask and request for those resources to build out. If it's not and it's on the locking side and the blocking side, then those are things like, if it was an application that was in-house, they would be able to take it to a developer and it actually helps them to design better products and to do.
So, that's pretty awesome in my eyes. It sounds like you're pretty much helping us to save money and wasted time.
So I wanted to take a little time to show a little bit about DPA, and how it shows the data. Especially to get an understanding of where to look for say, locking and blocking versus resource contention, and the usual wait time names that we see. And I mentioned here RPO and RTO no longer being met because that's important. If you see those wait times going up a little bit higher, you might start to not be able to meet your SLA. So, if somebody's running Orion, they want a report. And they want it, say, in 30 seconds. Maybe they need that report a little bit faster than you're able to give it. And that's information you can go look and say, "All right, why is it taking 30 seconds? Is it just, is there locking and blocking or is it waiting on some other resource?" It's going to help you kind of avoid that sloth because this is the sin we're talking about. Sloth, it's not just things getting slower but you're not able to get to the tasks you should be doing. And you can address a little bit more of that, how that relates to Orion itself.
And so this picture is kind of a funny play on alerts gone wild. [Laughs] So this is an inbox that is just full of actual emails that are going through there. But like you were saying earlier, if you're already getting all these and you're creating rules and getting— you're missing the action that's needed on the alerts. But if I'm constantly having to do this, I'm either going to do two things: One, I'm going to ignore these. Which, why are they there if I'm going to be able to ignore them? Or two, I'm not going to be able to see everything in there because my outside workload—because I'm diving into all of these alerts and I'm trying to fix all these little fires that are going out there. When my actual job of maintaining and viewing things is getting left behind.
And that's the thing that you don't want to do. And when you're managing your apps, but you're not managing your systems, it's kind of like you don't have the best of both worlds at that point. You can only do so much on one side before you become over-weighted. And everything you're doing here is kind of like just for the heck of it. Because it's not going to relay to the other side and balance things out. So we have to be able to have that, kind of those cycle speeds, and know where do we need to be at? Where are we going to stay at? So just that we're aware of everything that we're going to do and not get lost in our own creations of monsters.
Yeah, for someone like me, I had built my own in-house system. My team had put together our own little management system of monitoring, collecting data, things like that. And what happened over time was, I was more of managing that in-house monitoring system than I was actually administrating the data in those servers and being like a real DBA and tuning queries. I was just trying to keep up with the monitoring of all of it because I had done it in-house. That's when I realized, looking at a vendor solution made a lot more sense than me trying to build it all myself. But that's what led to the sloth, was that I wasn't able to do the tasks I should be doing. I should've been a DBA, not an application administrator. And of course, the cycle just repeats itself over and over. You fall into this trap, you think you're getting out of it, and you're really not. You're digging yourself a hole somewhere else. All right, next sin: wrath. Yeah. So change is hard, right?
Change is hard, kind of leads to frustration. You ask somebody, "I need you to do this." And maybe or maybe not, they have that look of despair like you see here. With wrath, of course, there's also when you're kind of being told, "Your system just isn't good enough." So, in my case of building my own in-house system, you get in a meeting and you find out whatever the root cause is, and they're like, "Well, why weren't you monitoring for that?" I didn't conceive to monitor for every possible thing. Oh, let me just start tracing for everything now.
Just in case. And this is how it all starts, because you have this wrath. You're like, "Don't tell me my stuff isn't good enough." "I can do this." And, but there's also this thing of, "Oh, maybe my stuff's not good enough, so now I've got to go buy something new or I've got to do things differently." And then that leads to that frustrating feeling that you're just not making any progress. So with Orion, that leads to a handful of— that shows itself on a handful of ways.
Yeah, so it's hard to build your own monitoring from the ground up. When you're not relying on things like we have, such as out of the box, or like industry standards ways of monitoring. So when you're trying to create that on your own, you can get lost. You can actually literally be working, like you said, on the application itself for weeks at time and you're not really getting anything out of it. And that is so frustrating for you. You just get mad and you just want to throw it away, and you want something different. I need to solve that pain point right now and my goodness, I'm going to find it right now. [Laughs] So you're always making changes, too. So, like custom properties. We love custom properties; they help you do so many things. But when you get to a point to where you're not doing the basic descriptions of them, and you just have all these custom properties that are out there, nobody's updating them when they add a node in, they're not actual data points, right?
You're not actually fulfilling your custom properties. You're not putting them in every time. You're not doing your diligence. So it's not being maintained. They aren't as useful as what you're hoping them to be. And then the actual overlapping alerts. That's like; you have 10 alerts, when one would actually do what you need it to do. There's a lot of resources that you can go out there when you're just, you're throwing everything you got into it. And so you're literally just diving into your monitoring to where you dove so far past reality, you're now looking for a way out of it.
And it's usually by being upset, right? Like you're usually upset and you're like, "I'm going to find something to replace."
Just how you put that. I've overshot reality.
That's exactly, definitely. And I've done it, too. Especially being here for so long, and I have helped the other customers do things, and you're like, "We can do so much!" "I want to use every feature imaginable," right?
They didn't need every feature imaginable. Maybe they don't have everything they say. You create a noise. So, learning that scalability to scale yourself back and to actually see reality, and face it. I think that helps you a lot with anything with your monitoring and managing.
So the custom property management that you have shown here, so sometimes that can be a bit unwieldy.
Oh, yeah. It's kind of like a wild child. And that's what I'm saying. And that's like, I cannot right now. [Laughs]
No, I cannot.
I just can't. And it's because I've been there. I've helped people do this and they'll sit there and try to use their custom properties with the best intents. And they're like, "Yes, we're totally going to do this!" And then you come back later and they're like, "Well nothing's working where I'm— it’s not catching this in my alert. It's not doing this." And you find out we've added a thousand-some nodes and only one custom property got set up here--these didn't get updated here. And when you're constantly managing that, it's a nightmare to go through there. So, there's easier ways to do that. We do export and import the value, but you have to do it. It's one of those things where it's like; you just have to be able to actually maintain it.
So you said "frustrating," and that's the word to tie back to the sin of wrath or anger. It's a frustration feeling and you're angry. And this is kind of one of those ways that you get there. All right, next sin: envy. Other tools always look better. You're like, "Oh, I really want what that tool's doing." And this will consume you because you're never going to feel that whatever you've done—like the in-house thing you built, it's never going to be finished. It’s never going to be perfect. And you can even buy a vendor product that allows a high degree of customization, and custom properties, and all that. And there's still a feeling of, "Oh, I need it to do a little bit more." Or, "Wait, I just found this custom script over on THWACK from what this other guy was doing. That really looks cool so let me put that in. Because I really want this other stuff." And it just snowballs from there. It's just—you're always comparing yourself or your monitoring system to what somebody else has. And you've really got to think about how it's actually— there's a difference. What they have is right for them. That doesn't mean that it's right for you. And having that envy of always, "Oh yeah, yeah, I got that. Yeah, we monitor for that. Yeah, we do those same things." No, you really can't think that way. And so this is how it manifests itself, mostly with Orion.
Yeah. A lot of the times, you'll do that, where you think your alerts just aren't adequate. And so you're like, "But I've seen this." Or, "I've read something online that they have they have this stuff out of the box." Or, "They're doing this." And you're like, "Yes! That's where I need to be!" And you feel you need more of the customizations, because you look over here and their charts look like this. Or you look over here and you're like, "Well, this is a pie chart and I want it to be a graph. Theirs is a graph. It makes more sense this way."
And so, then you want to buy these additional tools and you're wanting to just keep branching out. But you start to spiral a little bit out of control. Like, you can't just sit there and think, "I just need to buy everything." You could. But you start going back to what you're talking about, with the 27 different monitoring tools. Then you're trying to figure out what is the right data, what is not the right data. I mean, it all depends on what is going to work for you as the user. That's the only thing that we can help with. And, well, I always say it, as well. I don't care what you're using, as long as you're using something that works for you.
Yes, that's right.
So, literally that's all we're trying to convey here is just what is your need point that you can get and actually solve from? And that's what we're trying to focus on.
Absolutely. The point is you should, it doesn't matter what you're using, as long as you're using something. As long as there's a way to get the help that you need. That's really what the focus is always. All right, last sin. Number seven: pride.
Definitely. Putting your needs ahead of others. Bragging about your systems. "Oh, my stuff does all that." "Oh, I'm sorry you're thing doesn't." But there's also this feeling of general hubris, right? That's pride. This thing like, nothing's going to happen to us. Nothing's ever happened. I don't have to worry about this or that because it's never happened before. It's never been a problem before. This is, it really rears its head in Orion, I think, in a handful of different ways. Not just general data management, but specifically for Orion. And you can talk to that.
Yeah, so my biggest thing is the false sense of security. So there's things that we can do within our product that are automated—remediation, automatic remediations. And then when you think about this, it kind of scares some people because they're like, "I don't want it to automatically do anything on my network." But then when you kind of calm it down a little bit, and you're like, "Oh, but you can just do banners, things of that nature." You can kind of slowly get into there. But when you have that false sense of security— for instance, if you have an automatic remediation for different kind of shapers or things like that that are on your network, that you want to make sure that are applied or to do. What if that gets out of control because you're not doing your customizations right? Like, if it's based on custom properties or if it's based on something else, or you're grabbing the nodes. You have to be able to have confidence in what you're doing. And you don't need to just set it and forget it. That's not a way to do any of your monitoring or your managing. The more data, more of a risk. And so I talk about this a lot, especially on the security side. When we store data and especially— no matter if you're going hot, cold, things of that nature, when you're keeping you data at points, more information means that there's more potential for somebody to want that information.
And you have to consider security on your backups just as much as you do as an active. Because your information is in both areas. So, I always tell everybody: when you want to accept the more data, you need to accept the risk. And you need to accept the resources. You need to properly secure those. And to have those, actually, they give you the integrity that's within them. And then, you've been lucky so far! Like everything's just been fine! [Laughs] You think. [Laughs]
We've never had a problem. We've never had a breach. Nothing's ever gone wrong. I've never lost data, right?
Exactly. Now, since we're talking about the actual automated remediation, I'm going to show you real quickly just the flow pattern of how NCM handles this.
All right, so the network configuration manager helps you out with the automatic remediation. But I wanted to focus on is that false sense of security. I want you to understand how these work. So I'm just going to dissect this a little bit and show you how we actually handle this within our client here.
So I'm going to go in to the actual configs and I'm going to go in to compliance. So you can see these are out-of-the-box reports that we already have within here. And these help you for any basis for a security policy, as well. Because these are the ones that are already done from Cisco, from the actual STIGs, PCI compliance stocks, things of that nature. But to truly understand how automatic remediation could come into play is--we're reacting to something there or not there within a configuration. So when we go into, say, the security audit, I'm going to open up this report and you're going to see these at the top. These are your rules. So when you think of these, think of a pyramid. And I always like to just take it back to basics a little bit, because it takes the fear from the compliance reporting out of the equation. You have a rule which you can group up into however many rules you want. And that rule is for one specific thing. So such as this one itself, I can click onto it, and it's going to show you that it is looking for the security authentication failure rate between zero and three was not found. That is a specific one thing that it's looking for as a rule.
When we take this, you can group these up, however many you want, into different policies. So, if you have a security trend that you're going to do a password you can look for those each rule then you would bring those into an actual policy. And then that policy you can name whatever you're wanting to group those within there. Now once you do that grouping and you have this within your policy, you can gather and pick and choose whatever policies that you want to do, and put those into your report.
Now the great thing about that is that we allow you to be that customizable, you know how we were talking about if you're doing the customization sometimes it gets a little crazy. The most that I've ever seen people that actually uses the compliance report is they get hung up on thinking that they have to create this massive report with all of these different config finds for it to be valid for them. And you don't have to do that. So you literally just go from a rule into your policies, and your policies into your actual reports. Now when you have these out there, as you can see here, here's my individual rules. And then up here, I actually have the policies of which they are underneath. And this itself creates the report. Now when you click into these, we can actually see that we can do an execution and remediation onto all of the nodes, or to this node in particular. Now here's something that I like. When you were talking about the locking and blocking and the things that were going on there. When you run these and you have automatic remediation, how that actually works, is that you're able to run this report as a job. So if you're running this report and it has an automatic remediation, which means it needs to do something to the device, and then it's going to re-download the config. So if you're doing, say, you're doing maintenance at night, and it's trying to put the configs back in there, you need to be aware of the transactions that your database— and especially with your Orion—are actually doing. So know your maintenance windows and let's not be planning jobs and scheduled tasks that have to be done, because you actually will cause yourself a problem of the waiting and the suspended. So, it's just something to keep in mind.
So when we use these, and we can actually go back through here, and you can choose either all of them or just one of them, you have to think of that how you were saying. When you were like, "Uh, I'm just going to that false sense of security and nothing's going to happen. I can just do this."
Well, if you don't understand the nodes that you're grabbing, what if you're sending out a script on things that you do not need to be sending out a script on? So that's kind of that— you want all the data and you want everything that's going through there. However, you don't want to just hit on there and say, "I need to execute it on all nodes," without verifying that the nodes actually need to be done there. That false sense of security, especially when this is showing you, you don't want to just rely on and assume that it's going to do exactly what you have in your head.
Because they'll always be perfect.
Oh, oh, always. Definitely always. So that's the main point with the actual remediation with our compliance. And the best way to do these, I usually tell everybody that to make sure you schedule your compliance reports for when you're actually at work, not to do them at night. Especially when they have an automatic remediation. Because with that automatic remediation, you want to know what happened on there, correct? You want to know, did it go through? What was the things that were going on? And you want to be there if something went incorrectly wrong or something in that nature. So you need to plan for your mindset and how you're going to maintain your network. And you don't need to just always think that setting it and forgetting it is the best solution, because sometimes that can wind you up in not a great solution.
Right, because if you do it and you figure, "Oh, I'll let the report run overnight," which is a common thing, but if it fails, then the next day you're going to spend the time trying to fix it anyway. So you might as well run it while you're at work and it'll just be a more efficient use of your time and of the time on the server in terms of the processing that might be happening inside the database engine itself. So, thanks for the demo and all the remediation. And before we leave, I'd just like to go through a few tips and tricks on how to identify, not just the sins, but how to get around them. Make life a little better. If you recognize that perhaps you are committing some sins, or of some things that you can do to make life a little bit better. So the first, I always remind people: focus on data recovery. DBA, even network admins, recovery is prime job, number one. If you can't do that, you really can't keep your job. So think about things in terms of RPO, RTO, and MTTI, Leon's favorite. Mean Time to Innocence. So if you ever find that you're falling short on any one of these goals, chances are, it's one of the seven deadly sins that's raising its head. Treat every piece of data as if your business depends on it, because it does.
Without data, your business does not exist. It's that simple. Archive the data you don't use. Archiving doesn't mean you just put it into a different database on a different server. It means let it go. It's okay. Put it somewhere else. On a tape, I don't care. Put it somewhere over there, and if you need it, you can recover it. You don't really need to have every bit of data online all the time. Are there some businesses where that might be true? Yes. That would be rare. That's not the norm. For most people, especially for people using Orion, you don't need five years of detailed data out there. You need a few months’ worth of details, and after that, the summary should be good enough. If you are that worried about it, you have backups of your data that you have stored somewhere and you can go and get that, recover it, run a report from there. Make sure your storage can keep up with the demands you're putting on it. You can't escape all of the sins all of the time. Every now and then, you fall into one. Yes, I do need to go and lust after that big piece of data. And you're going to go get it. You're going to have to do your work with it. But make sure the storage can keep up with all of the demands that you're putting on it through all of these different sins that crop up time and time again. It's really easy to fall behind. I feel, these days, when I look at systems for customers, I just see that the storage just slows down over time. And then, of course, they want to go buy flash because that'll solve everything.
Of course, that costs money and there's greed.
To be fair, I asked you that one time.
I remember that. I actually literally texted him up and I was like, "Hey if I can get flash over here, wouldn't that make everything speed up even if the queries are bad?" He was like, "Oh God." [Laughs]
Would flash help if I have locking and blocking? And the answer is no. So, you've always got to know when...
But now I know.
Now you know.
I didn't know that before.
And that leads to this: know when to throw hardware at the problem. So, maybe flash could be the answer if you know that you have a disk bottleneck. You shouldn't just guess at that, though. You should be able to look at the queries, see what's happening, see what the waits are, see if it's locking and blocking, and see if you can actually throw the hardware— the right hardware, at the right time, at the right problem, in order to make things a little bit better for your Orion data. So, thanks for joining me, Destiny. This was fun, right? Talking about sins. And thank you to all the THWACKcamp viewers out there.
Well yes, and we should do this again. But next time, can we please have a confessional?
Oh, I think so. We should have a budget to build a little bit extra into this set. You know, we can get that. And as a matter of fact, I think we already have that. Isn't it called our annual performance review?
That's a good point.
[[Laughs]] I'm Thomas LaRock and thanks again for joining us today.
And I'm Destiny Bertucci. Thanks.