Hey, what the frill is that?
Oh, that's—I don't know, it just shows up at the same time every day.
Tom, it's an alert.
How long has it been there?
Maybe a month?
Tom, you can't ignore alerts. You have to find out what they want, you have to deal...
Look, look, you touch it, you own it, okay? You want to take it home? You want to cuddle it? You want to name it? You want to feed it? It's yours. If it's an alert, I don't need it. I've got enough to do with my job already. I can barely handle the ones I have now.
Oh, we need to deal with this right now. Hi, I'm Leon Adato.
And I'm Thomas LaRock. And welcome to SolarWinds Lab. And on today's episode, Leon will lecture me continuously on why ignoring alerts are bad.
Okay, not just lecture, we're also going to both share ways to make alerts more meaningful, so that they're less likely to be ignored in the first place.
And will you be using that laptop?
Yes, yes. So, there'll be plenty of hands on. Which reminds me: you should have your hands on your keyboard right now and talking with us in the chat box you see over there past the alert.
Yes, and if you don't see the chat box, that's okay. It just means you aren't watching this episode live. If you want to watch us live, head over to lab.solarwinds.com where you can sign up for reminders about upcoming episodes. You can also offer us suggestions on what we can cover in the future.
That's a bacon sandwich, isn't it?
So, to start off, before we get into the hands on, I want to just tee up: why do alerts get such a bad reputation? Why are so many organizations just labeling it all as noise?
Well, for me, one of the biggest reasons I would consider alerts to be more noise than anything else is when as a DBA, I buy say, a certain set of tools, and out of the box, you would get all the alerts. Because, I mean, the tool provider, they don't know what your systems are, so they want to make sure you are notified about things. That's why you bought the tool.
And I think some tool vendors also like to show off. And so they turn on some of the most complicated alerts that are so sensitive, but they want you to see what you're able to do. And so they turn them all on.
And then you spend a lot of time, once you get that system in place, the first thing you have to do is, figure out noise and signal and turn things off. You spend so much time turning things off or adjusting the thresholds.
Right, and that leads to, I think that the details here are that alerts are often considered noise because you get too much of the same alert. You know, it's the same one, and that can be because of what's called saw-toothing where it's up, it's down, it's up, it's down, it's up, it's down. Because the threshold is too sensitive or not tuned correctly or a few other things that we're going to get into in just a minute. I also think, despite the showing off of "look at how fancy our alert can be," many default alerts don't have enough detail in them. So you get server error at 5:15, or not even at 5:15, just server error and you just have your inbox filled with those.
Right, yeah, just so an alert that scrubs for an error log and says anytime an entry is written just send an alert, but sometimes error logs are also informational logs.
Right, and to me there's a big difference between an alert, which is something I consider requires action, versus just a piece of information.
Right, or again, going back to badly tuned, I did have one customer once who said, "I just need to know when the word error appears in this message." Not realizing that every five minutes, the application would report "no error found." [LAUGH] And so every five minutes, he'd be getting a ticket until we finally turned that one off.
I also think that, what I get a lot is, I'm working with somebody and I say okay, so this is a good ticket, this will work for you, and he says well, "I don't actually do anything when I get the first ticket. I wait for three of them to come in in half an hour, and then I know there's a problem."
Yeah, I've seen that as well, I've also seen where people use alerts in order to give them self-information for a particular system. So, if you have this entire, say, ETL or a batch load or process.
Well, once you hit the end of phase one, alert me that things are continuing to work.
Yeah, everything good, everything's still good.
No, that's not an alert.
Yeah. And that's a key thing that you want to know, as monitoring professionals, is there's no such thing as an FYI. I am diametrically opposed to the idea of FYI alerts. An alert has to have like you said, an actionable response. A human being actionable response. If there's no human response, there's not... I'm not saying that you can't send out informational messages, and when we did that episode on Slack, integrating SolarWinds in Slack, that's a wonderful channel to send an associated team.
You know, remote site goes down. The server team wants to know that their systems are unreachable. But they don't have responsibility. And I always tell people, it's like you're in a race. There's a baton. Only one person can have the baton and if the remote site is down, the network team has the baton. You don't have it. You're just on the sidelines watching. Fix it faster, take care of it. But you are labeled to know, but that's not a ticket, so it's important to know. So what I want to do now is, I want to get into some ways to avoid some of this stuff.
So let's go ahead and dive in.
So, the first thing I want to get into to resolve some of those issues is a time delay. It's a really simple way of introducing a little bit of lag. And we'll talk about why, but first I want to go into the how.
So, by time delay, you mean, don't send me an alert for another five minutes?
Well, not exactly, because it's not a "don't bother me." It's slightly different than that. So here, we're looking at a fairly straightforward alert—it's just a disk, a volume alert when the percentage used is greater than or equal to 90. The problem is when you have a disk that's just writing that 89 to 92 and it's just a little bit up and down. So you're going to start triggering alerts all the time. So I don't want to know immediately, I want to know when it's been that way for a period of time.
When it's been greater than 94.
Some of that.
A sustained amount for a period of time.
Right. So, you know, in Orion, you just check off that box. Condition must exist for more than 10 minutes. That sounds like a...
10 straight minutes?
10 straight minutes, that should be fine.
No, it's going to be really a problem and here is why. It seems logical, but what I happen to know is that Orion's collection for disk metrics is every 15 minutes. So what have I done?
Right, so it's going to pull a sample and say every 15 minutes. So that ten minutes, the condition...
All I've done is I've said it's the same data, just wait ten minutes before telling me.
That's right, so ten minutes could exist in between those two poles.
It is almost by definition going to exist between those two poles.
It could be 14 minutes long that was sustained, and you would miss it.
Correct, exactly. So, here's the thing: you need to understand what the polling cycle is for the element that you're monitoring, because you're not really saying time. What you're really saying is, "How many samples have I collected?" "I want it to persist for that many samples."
You know what'd be great? Is if, when you try to set this, the application would come back with a message reminding you what the polling cycle is.
Now, for simple metrics that would work and I'm certain that our UX team has considered that or is considering it.
They are now.
However, when we have multiple conditions where they're on different polling cycles, then you'd have to take the greatest one. But that's a different issue. So here, knowing that it's 15 minutes, I'm going to set it for 16. I'm going to say that no matter what, every 16 minutes. Now that's good, and that will keep me from having this saw-toothing effect that we talked about earlier.
But how big of a drive is this?
So that's interesting, because this is a generic alert, the drive could be anything. It could be...
It could absolutely be terabytes.
So say a two terabyte disc, this would mean there's 200 gigs of unused space and you would be getting triggered on.
Yeah, but it's 90% used, isn't that...
200 gigs is still kind of a lot, don't you think?
I believe it is. So, a solution, and it's not a particularly difficult one, but you have to consider it, is to add another condition. So we're just going to add another condition here, simple one. We're going to say that the volume and obviously I've done this a couple of times so, I'm going to do the let's see, not the percent used. Capacity, there we go.
Oh, a little search for it? I love this search option. There's so many fields that you can pick from now, that being able to pick. There we go, available space.
Mm-hm, there you go.
There we go, is less than or equal to, you want to say 100 meg?
Sure, that seems very low though. Let's do 10 gigs.
10 gig, okay.
Sure, on a 2 terabyte drive, let me know when there's only 10 gigs left.
There we go, 10, wait, million, billion. So that would keep us from alerting when it's 90%. But we have plenty of space, so we're okay. That's good. Now, it's a little bit tricky to talk about with disks, although you can if you have an array or some sort of collection of disks. But it makes more sense when you're talking about clusters or multiple devices but there's an option under advanced.
Advanced options, not just condition must exist, but also, the alert can be triggered if more than, we'll say five objects, ten objects, of the same type have met the specified condition. Now again with disks, it's a little shaky, but imagine, I mean, the example I come up from the network side is access points. If one access point in this big office complex is down, I really don't care. I certainly don't care enough to get a ticket, because I've made sure that I have enough overlapping spaces. But, if five access points in the same area have gone down, now I need to go look at it.
I don't know if you have anything from the database side to...
Well, in terms of that, you would think of things, like you mentioned, a cluster, you would think of nodes, things like that. But for database side in data, it's more about how critical is that particular system? So if you only had a two node cluster, you probably, it wouldn't matter what's on there. But you could have say, a three or four node, but if it's your trading system, you might want to know that one of those nodes is down.
So I don't think it's a equal comparison there, but yeah, there's certainly a lot of places where you might want to think to yourself, you need to know more than just one of those objects and now raise an alert.
Yeah, and this also can be used for different levels of criticality. You can have a basic alert that has a sev five ticket, but then if more than so many devices are affected in an area, you know that you have a wider spread issue. So that's another way to, again, reduce the noise, that's what we're talking about here is just focus in. So I'm going to take a step a little bit of a step to the left and you don't have to move, to take a look at a different way of aggregating. So here, I'm actually in SAM, I've got a SAM template. And the feature, there's two features I want to look at. One I just want to mention and the other one we're going to look at. The one I want to mention is that we have thresholds we can set for warning and critical. And that's going to be important later on, in a later demo. But I think that's important, being able to specify on a per device, per application, per instance level what critical and what warning is. That's nice. But here, when does it become critical or warning? Well, maybe a single poll, that's the default, but you could also say X consecutive polls. Now, that is equivalent to what we were doing before with the delay.
Which is, how many, whether you say minutes, and I know that it's a five-minute polling cycle. So, if I say 10 or 11, that's a two polling cycles, here I can be very specific about it. But the one I love is the one below that.
Yeah, I just saw that one, too.
X out of Y.
Yup, X out of Y.
So, I want to know when it's 3 out of 5. Now that doesn't mean 3 consecutive, right? It could be 1, 3, and 5. But in that period of time, if I've lost, in that many out of the total, now I have an issue. And that's good if the issue you're looking for is intermittent but important.
So, that's the one that I really like. So, these are some ways to help reduce the noise. Now, I want to go back and talk about another issue in just a second. But for now, let's see if we can tune this up a little bit more. So now that we've gotten this tightened up and we have some stuff to show in just a minute, I want to step back and say that one of the soft skills for a monitoring professional is your ability to interview the requester.
Interview or interrogate?
Well okay, yeah, it's a little bit of both. [LAUGH] There's, what you need to do, remember that the person coming to us doesn't really know what we do. As much as when we go to them, if it's the database team or the network, people come with requests, they're coming because they need help. Otherwise they'd do it themselves.
People come to us, they don't know anything.
That's not true.
No, they don't know anything.
They know their thing.
They know their thing, but it's up to you to be able to understand everything about it. They just know the name of an application which may or may not mean anything to me. I know servers, I know instances, right? I don't necessarily know application x by name, but they do and that's all they know.
Right, and their job depends on it. But they know how to do their job, and they know what is it that they're trying to catch. They know the condition they're looking for. So one of the first things that I do is I ask the person, whether it's a, let's say a server admin or network engineer, I say, how do you know that there's a problem?
Oh, my phone rings.
And it has that name.
And that person...
That one, that...
Because nobody ever calls to say everything's working well. They only call when things are bad.
They don't call you on a Saturday morning and say, just want to let you know the batch job, it was so good.
Nope, it does not happen.
Right, and if you happen to have those kinds of calls, please let us know in chat because you are an anomaly. [LAUGH] But yeah, right. So that's how you know there's a problem happening, but what I mean is, you've gotten the call, now you jump on the system. How do you know that it's this problem? This is the issue right now. Oh, I run this command, or I look in this error log and I see this or whatever. Obviously, when they come to you for a request, they have a process. Even though they may not think that that process is important for us to know as the monitoring engineers. So you need to start to pull that out of them. So you get that call, and then you jump on the box, and then what do you do? That's the first thing.
Yeah, jumping on the box is always going to be the first thing. And then what do you do? And it's more of an investigation validation that the alert is actually whatever it was for, disk space or something like that. Let me go to the box and let me see what the disk looks like.
Right, and what you're doing there and the notes that you're taking is that you're going start to collect: oh, these are the metrics, these are the data points I should be monitoring on an ongoing basis. So that now that I have them, I can say okay, so you said you look here and here and here, and if I see that this 50 or this is not cleared, or whatever the message. Now you know what you're looking for is your trigger conditions. But then I ask another question. How do you know it's all better? [LAUGH] They stopped calling. [LAUGH]
Right, I was just going to say well, yeah, if your phone stops ringing, that's when you know things are better.
Right, so you want to know. It's important to know, especially for the next demo we're going to get into what all better looks like, because it may not be the problem is all gone. It may be like if this thing is at 50, it's bad. But it's not all better until it's 10.
It's not that it's not 50 anymore.
Or 60 or 70, or whatever it is. So how do you know it's all better? Again, you want to just ask them experientially. So you jump on the machine. And you look at this file. And when you see this, you know the issue has cleared up. And, unfortunately, sometimes they say, "I don't know, I just know when it's bad." So it's going to take a little bit of testing or time. Then, the last thing I ask is: so, what do you do about it? You've gotten on the box, it's a problem, it's a actual persistent issue, now what do you do?
Well, if you can, or if you have to, you will pass that baton to somebody else. [LAUGH] Or you take the action you need to do.
But I mean obviously, for me as a database administrator, there's only so much I would actually be responsible for. For example, if I knew the issue was at operating system level, that's not for me. That's for the server team to fix.
I manage a piece of software that sits on that operating system.
Right, but here the idea as a monitoring professional is that they come to you and you say, "Oh, that is an OS issue." So now I know that the...
So you shouldn't actually get the call at all. That's the whole point, is to reduce the noise. Not just the noise to the recipients who are valid recipients, but also the noise to non-valid recipients. It goes back to the issue of the remote site is down. The server team actually doesn't want to hear about it, they don't want to know at all. Because the people in that remote site, they can use the servers, they're okay, it's the network team that needs to know. So this goes back to what do you do about it? You are the logical recipient, so maybe if it's a scare, you clear the temp directory. If it's a service, you might try to restart the service.
I mean, the responses are many, and varied, and wonderful, and beautiful. But the point is, is that it leads to well, maybe I should do that. I got some scripting skills, I can do something, right? Maybe if your action is to clear the temp directory, how about I do that for you and then I only let you know if that didn't work?
Right, so the alert triggers an action.
To be done for you.
Right, so then at two o'clock in...
What do they call that?
They call that me staying in bed. [LAUGH]
Right, at two o' clock in the morning, I want the robot that noticed it to also fix it. And only if those two things don't work, then you get your most expensive resource.
Mm-hm. Which is the human, not copy paper, like Dilbert says. It's the most valuable resource, is your humans. You want to only get the humans moving when there's no other automatic actions to take. So let's take a look at how we can do some of that.
All right. So we're back in the same alert that we started off with.
This is our disk alert. We've got our two conditions here.
But actually, I'm going to move to the reset condition area. Now again, the issue here is that "all better" may look differently than the trigger condition and this also creates a lot of noise where the trigger alerts but then it's back and then, again, you have the saw-toothing. So this is another way to reduce noise in a few different ways. Now, the reset condition tab has a few different options. They have changed with the web-based alert module since the Windows 32 version. So it's useful to go over and to know what your choices are. The first in Default Option is "reset this alert is when the trigger condition is no longer true." If I have it set at 90, 89 is no longer true. That's good, but again, you're going to get that little wobble every time. Even with the delay that we inserted, now, okay, now your frequency is a little wider. But it's still up and down and up and down. So there's some other options. So for example, you can reset the alert automatically after a certain number of minutes, or hours, or days, or whatever you want. For this particular alert, I wouldn't recommend it. Although it would work well for say, a log file where there is no reversal. And if you know that you tend to get a cluster of messages in a tight time period, we get ten of these messages in five minutes, and it doesn't happen again, this is a way to avoid triggering multiples. You can have no reset condition just every time. And in certain conditions, you do want it to trigger each and every time. No reset action, meaning you have to manually say, this is all better. You have to manually do it. But the one that we want is a special reset condition which actually isn't so different. It's just, you know, you want to set the conditions manually. I'm going to add a new condition. So I'm going to do a volume alert. Because this was a volume reset, rather, because this is a volume trigger. And here, what I can do, is I can say the volume capacity, percent used is less than or equal to, so far so good. You've got to remember our trigger was 90. It's not all better until it is 75.
Now, what that means is that bad is 90%, but it is not good again until it's...
Until it gets less than 75.
75, so now you've specified that in between that range I don't want to hear about it. It is not fixed as far as I'm concerned.
You could also add that time delay. So now, we said that we wanted it to be bad for 16 minutes, for two polling cycles, but I could say that I want it to be better for 31, which effectively is three polling cycles. Two 15 minute intervals plus one, so now I'm saying ‘bad’ for 2, 90% and you can keep on going, keep on adding. So that is another way to reduce the noise. I'm not sure database implications here.
Oh sure, everything you are talking about here, usually what you are talking about with alerts. When we talk about databases, it's almost always a query performance thing. And a query performance is almost always going to be tied to some sort of physical resource bound that like memory disk, CPU, network. And with SQL Server, you might want to say locking and blocking. But just in terms of databases, as I watched you go through this, the first thing that came to my mind was just a matter of, well, in query performance, maybe I'm having an issue because the plan I'm using right now has changed a little bit. Well, why has the plan changed? Well, maybe my plan cache is full and my plans are aging out much faster than what I was expecting. So, in this case, you might want to say all right, alert me if the oldest plan in cache is less than 15 minutes.
So, I flushed through my cache fairly regularly, but don't reset this until the oldest playing cache is more than an hour. And that way you get an idea of the workload and how things are going, so there's certainly ways that you could translate and use this type of alerting for strictly a database or a query performance environment.
Very good, and what we're going to take a look at later is the opportunity where you can keep re-alerting if the condition persists, also. Because one of the first things that, when you start doing this, the argument I hear is, "well, wait a minute, I sent a message and I created a ticket. The humans closed the ticket, but it wasn't all better."
So I never got another ticket because it was never actually fixed. So you want to make sure you understand, just like you have to understand your polling cycles, you have to understand the human interactions between the ticket system. So that if a ticket is closed and then the problem persists that is flagged in some way, and there are some automation tricks to do it.
There's also the case where a person fixes something here, knowing that this other thing might break. But that's okay, because when that breaks, then they'll go fix that, which will then break the first thing again.
And they keep going back and forth. But boy, they're fixing things all day long. >>That's great.
Yes, they are their own little temporal loop.
[LAUGH] Right, they're just spinning around. That's because IT pros, I never want to do that. Okay, good, and the next part, monitoring, will help highlight that, so we can help that person, we can get them therapy so they don't have to do that. All right, so I mentioned automation earlier and I wanted to go on, so the next thing after you setup this delay, so now you have a high trigger, a lower reset, so they're all better, is the last of those questions we talked about earlier, which is, what are you going to do about it? So, the trigger action, of course the first trigger action that everybody thinks of is, sending an email or cut a ticket.
But, again, that's why you ask about automation. What are you going to do about it? When this alert occurs, what do you do? Oh, well, I reset the service, I clear the cache, I reset the IIS application pool, I do whatever. So how about I do that? So here, I set up this alert. I'm just going to jump into it. This is an action, okay? And execute a script.
A VB script.
Yeah, [LAUGH] well okay, good.
You can execute other ones, we're going to look at that in a minute.
I have already built this tempclear.vbs that clears the temp directory on the remote machine. I'm passing it the name of the machine there. So that's the only parameter the script takes. So this is the action I'm going to execute. So, again, 2 o'clock in the morning, disc is full, what do I want to do? I want the robot to execute the clear the temp directory. And if that fixes it, it's all quiet, it's all good. Just to go over that step by step, you add an action. We have lots of actions, there's a huge amount. And depending on what modules of SolarWinds you have, there are different actions there. I want to point out that Execute an external program is right next to Execute a VB script. This where you would do, say, PowerShell, which is definitely the way to go. But the server team already had the clear script VB there. Why would I not want to use it if they already wrote it? They own it, they're responsible for it, that's the way you want it to work. There's other things that you can do here in terms of managing VMs, restarting, remember I said that IIS application pool, you restart...
I was wondering about restarting IIS, because that's a common thing.
What do you do? The websites are stuck, restart IIS, and yeah, solves everything.
And it solves it, and it's not just it solves it for a minute or it puts off the problem, it solves it. So why get a person involved at all? But what happens if it doesn't solve it?
Okay, so we've set up our action, let me cancel out of here, we've set up our action
Well, you can add another item here, wait 16 minutes, and then escalate.
So then I'm going to do my alert the team, whether that's creating a ticket or sending an email, or whatever. And that's important. You get that by adding an escalation level, and you can add multiple levels of escalation. So do this action, clear the temp directory. Wait 16 minutes. What's that? Two polling cycles for my, in this case, for our volume alert.
So it's two polling cycles. If the condition persists, it's going to do the escalation. If the condition has resolved itself, if the problem is okay, we never do that.
So that is really the way that you can, and remember how I mentioned just a little bit ago, that if the problem is persisting and it's not resolved, even if the ticket gets closed, what do you do? Oh, so we've cleared the temp directory, we alerted the team, we've created a ticket. They get in there, they close the ticket, because they've paid somebody, they paid my 15 year old an extra 20 bucks to just, when you get this ticket buddy, just close the ticket. That's all we want you to do, and he thinks he's making a mint, right? Oh yeah, so we're going to add another escalation level, this time after ten minutes, because the problem isn't resolved. No, no, no, now we're going to send a ticket to the boss.
This still hasn't resolved. Then you go back and say, well, how come you closed the ticket at 3:05? Right. Although I know it sounds like I'm trashing the people who we're sending tickets to. I see this a lot. I get mass close, quick close, all of that.
But yeah, you're flooded with alerts. I mean, we're not just talking a dozen. We're not even talking 100. It can be a thousands of these alerts on daily basis. You're doing your best to manage all of it. And sometimes, things get closed. Why'd that get closed? Actually, I don't know. It says I closed it. Maybe I thought I was closing something else. Accidents happen.
They do. When you're dealing with a volume of things like this, it can.
But it's a symptom of not having accurate alerts. Once this is fixed, that shouldn't occur in the first place. And actually one of our, Kevin Sparenburg has said, if you find that you have any email rule to manage your incoming alert messages, you've already lost.
Every alert needs to be immediately actionable. Every single one should an absolute call to action, jump from your desk, run from the room, take care of it. That's the nature of what these alerts should be. And I'll tell you, when you've tuned them, that's what they are. The teams start to take them seriously. But I'd been at companies where they generate 12,000 messages a month. Nobody answers 12,000 tickets a month. All right, so the next thing I want to do is take a look at the kinds of things that we need to gather. So let's tighten this up a little bit and then we'll jump into that.
So, this next thing I want to talk about has nothing to do with technology. And it has everything to do with just improving the rigor and the professionalism of the monitoring process.
Some of these are hard-won lessons that I want to go over for both of us.
Some of it's review, so what's your process of setting up an alert? Well, we've covered some of them. Step 1, gather what the trigger information is. How many elements make up an actual actionable alert, not just, disc is over 90% but 90 less than that, right? We did that. For how long, if there's an aggregation, x set of y polls, or for how many polling cycles. That's the first thing. What does all better look like? I got to know what all better looks like so that I can set a reset that is either equal or less than. We covered that. What are you going to do about it?
Email the boss.
Email the boss. No, [LAUGH] you're going to set up some automation so that hopefully the computers can do computer stuff and then humans can do human stuff. Those are things that we all have already covered, but what's next? Next, no, I just turn on the alert. No, there's next. So the next thing is, and I know it sounds like I'm being a real jerk about it, I want to see a knowledge base.
Oh yeah, like a runbook-type thing where you share information. Say it went off, I've done this, I've made these changes.
Right, because the people who are requesting the alert, frequently aren't the only people who are dealing with the alerts.
There's usually an after-hours operation center, sometimes the application team never intended to get the alert, they just were going to hand it off to the ops group.
And that is not fair. I don't get to wake you up at 2 o'clock in the morning unless it's April first.
So I call this, because it could be yourself, you could be alert. This is a note to future you.
This is what that knowledge base is for, this is you at 2 AM going, why did I do this again? You want to send yourself that note forward in time, so that you help yourself or anybody else that needs to have information about what is happening and why this got triggered.
I will say that when I ask for a knowledge base it can be very terse, it can say, call Bob.
Fine, if that's what you want the NOC to do at 2 AM is call Bob, and Bob is okay with that. I'm okay, as a monitoring engineer, I'm okay with that. But, I need to see an actual knowledge base article that is going to be associated with this. Also because, I can put the link to that in the alert message. So, when the NOC gets the alert, they look and it says, see knowledge base such and such, click here, and they go straight to that. So, I'm trying to save them time because they're the ones who are on the hook for it. Then, the next thing is: the requester has to make the error happen on purpose.
Oh, as part of a test.
Right, because I want to know, because the number of times I've gone in and said, yeah, I'm looking for when it's 50. And then they trigger it and it's 47.
Mm-hm. Oh, it was only 47, that's a problem. No, no, no, that was bad, well no, you said 50, so you have to make it happen on purpose.
With one fairly glaring exception. I was asked to set up alerts for the actual data center and we were doing a temperature trigger to know when the center was on fire. [LAUGH]
So you had to set the center on fire.
We had to set it on fire. Light her up, here we go. No, we did not do that. So in that case, or in cases where you can't make it happen on purpose, I always tell people to do reversals. So instead of above 90, look for below 90. Or below something that isn't absolute, you can do this for CPU cores, it's a really good one.
So, I needed to look when CPU was over 95%. Okay, what are you going to do? I'm going to run that crashme.exe. No, you don't want to do that on your production server.
You just look for when it's below, you might want to narrow down the scope. You don't want an alert when every server is over 90, below 90% CPU, but that's one way to get around the set-the-data-center-on-fire thing.
Or you could just set the data center on fire.
You could do that. That's another option. I'm not trying to take anybody's choices off the table, but I didn't tell you to set the data center on fire. Legal wants me to make sure...
But we could alert you for that.
We can definitely tell you, and then the last thing, and the thing that people miss a lot. Before I will turn your alert into a ticket, I need to see one in the wild. I need to see one actually happen.
So, there are people that want alerts based on things that may never really happen?
Or never happen again.
Or never happen again.
So those people who've read the book, this is a black swan.
It is that major crash that got everyone's attention and possibly costs a lot of money, but is so rare, so bizarre. It's the World Cup two years ago, like no one could have predicted that this would have happened. We don't have enough data, but now we're going to set up an alert for it. So yeah, and I want to burn hundreds of hours.
A knee-jerk reaction to something that probably will never happen again.
Yeah, so I need to see one of these happen in the wild before I create a ticket.
Because at the end, I don't get paid by the bushel of alerts. I really would prefer to have as few...
You don't have a quota?
No, I don't have a quota of anything. So now we want to get into the last piece.
So there's some really cool stuff, just to wrap this up, because there's some interesting ways to threshold things that aren't sort of hard and fast right now
All right, let's do it.
So let's dive in now. So the last demo that I want to get into talks about setting thresholds that aren't really thresholds. Part of the problem with using fixed numbers is that they're never generic enough for the entire environment. You have hundreds of servers or thousands of applications, or whatever you have. And no single number is every going to really satisfy, so how do you get around that?
I thought 42 was the answer.
42 is always the answer.
Always. [LAUGH] But it may not be the right CPU threshold, unfortunately. So, what we've got here, we're back at the template that we looked at earlier, this aggregation template. Remember, I had mentioned that you have a warning and a critical value. Well, I want to just repeat that, that you can set on a per application or even a per instance, as you assign this to different machines.
You can say that warning for this box is at 5, critical for this box is at 50, or whatever. That means that it's going to have a status of critical when, in this case, it's at 10.
What that means is that my alert, and I've got it started over here. So my application name is the Aggregation Example, that's the name of the actual template. And the status is not 5, not 10, not 12, not whatever, it's critical.
What's critical? Whatever it is, you can actually set up application alerts that are very generic. It doesn't have to be even for that particular template. If you do this across your environment, you can say alert me when the applications become critical. Now that might be a little more generic than you really want. It's possible with custom properties and having different things that you can send the messages here and there. But you can keep yourself from having 300 separate alerts by using this. And this is important also beyond the realm of SAM, because SAM was the first place to have that. Going to jump over here in the Node Detail screen, now this is Node Details for NPM, this is just an NPM thing. Recently, a couple of versions back, we added, look at that, Warning and Critical for CPU, Memory, Response Time, Packet Loss.
Depending if you're looking at an interface, you're going to have interface-specific metrics, you can have different things. If I want to override the general thresholds of 80 and 90, I can put in whatever number, and this is on a node-by-node basis.
Right, so this node, right, so if you know it's running hot...
Then you can change that.
So you can say, instead of saying, when CPU is over 90, you can say when CPU is critical. And now you can have one alert instead of the 12 for, heavy duty web server CPU alert. And then the sort of quiet, not-used-much database CPU alert. You don't have to have all these variations of your alert types. But that's not all, it gets better.
There's base lines.
Yeah, there's more. Remember: we're collecting data.
In fact, I always say to people: monitoring is not a ticket, it's not an alert, it's not a poke in the shoulder, it's not any of those things. Monitoring is the regular ongoing steady collection of data. Alerting is the happy byproduct that you get out of an alert. So, I already have the data, so I can say use my dynamic thresholds. Now it just told me that for this particular machine, critical is at 47.
How does it know that?
How does it know that? Well, because we're collecting the data and I can look at the latest baseline details.
Oh, look at that.
Look at that! It tells me where my baselines are over time, and I can actually do metrics over time. It says, "This is how it's actually been running over the last seven days."
Okay, so forget CPU and memory. You know what, I want this on a SQL query basis.
On a query-by-query basis, some queries run hot...
The baseline for this query...
And some queries, right?
Most critical query in the corporation is this particular query and show me when it runs hot or not.
Right, and so that would be something to put into the idea exchange. And when you see things like this and you say, "Oh my gosh, it'd be even better if we could do whatever, on thwack.com"
Idea exchange, putting your ideas...
Sorry, yeah, no, no, content exchange is where you share templates and things.
Which is also incredibly useful, but idea exchange, the idea area for each module is where you can add idea, like, I really love it if it...
And then up-vote them, right. And also our user experience team, which is really who you were just talking to, our user experience team looks at those and says oh, this is a feature that we could really have. So this is a lot of ways, if you follow these steps, your alerts will be much less noisy. They will be much more useful. So, I just want to tighten this one up a little bit more and then we can wrap it up.
See, so much better.
Agreed, and it's not just a case of, say, of having 49 fewer interruptions throughout your day. But having alerts that leverage automation and they only trigger when there's an actionable problem. Well, that will save your business some beaucoup dollars.
Thank you so much for not saying save your bacon.
Well that just wouldn't be kosher.
Tell me about it.
Hey, is that a DPA hat?
Of course it is. You know what else wouldn't be kosher? It's not joining us live for the next episode of lab. So to do that go to lab.solarwinds.com and register...
Calling all units, the CEO is in the data center!
We got to go.
This is not a drill!
My leg is starting to cramp up.
Why couldn't they just use a freeze frame?
How long do you think we have to hold this?
I don't know, are the credits still rolling?