That was a really good session.
It was, it was, I just-
We had so much more to cover.
Yeah well, I mean there's a lot, and we were speaking in such generalities that you couldn't really do a deep dive on anything and well, you and I have talked, we both like to get in the nitty-gritty and really mess with stuff.
We should get some tacos.
We should get tacos. [Laughing]
Breakfast. Breakfast tacos, it's a whole weird thing here.
It's a whole, yeah thing.
So yeah, we had tons and tons, so. Still think it's worthwhile to cover it.
You want to do it now?
You got time?
So no tacos?
Well we get tacos after.
We get tacos after.
Well, you'll get tacos after.
I'm going to skip tacos. Yeah, so okay.
Let's just run through it, and actually what we can do is we can make this, you know, a SolarWinds THWACKcamp on-call session. We can publish it after.
Want to do that?
Yeah, we should totally do that. Because there's always, whenever I left the sessions there's always like, oh but one more thing, I just want a little more, so.
So, where did we leave off?
Let's see. We went through and we kind of hit the generalities, and we didn't really deep dive into how we could do any of the stuff in Orion. So, we do have that though, because you and I have talked a couple times, and I know that you have actually expressed-- You gave me a lot of feedback on this about muting as opposed to un-managing or deleting or something like that. And it's--I've always thought it's critical because I love the fact that Orion had the un-manage. When I had like an entire office, that was going down for a scheduled power outage. Yeah, just tell. I don't care. It's going to come back on. Everything's going to be fine. But if I was doing maintenance on something, and I did this a lot with my Exchange servers, is I would actually put this in place. And then as I was doing the patching, I could watch the services stop and start in the web console, make sure I was keeping all that information, and then when it rebooted, I didn't get any alerts for all that stuff going back and forth. It was great. But for us, it's just flipping; we just built a custom property. And this is the way I did it, and I've done it for years this way.
Yeah. Just to be clear, the mute concept, the mute option is a shut up, you know, feature that---
Yeah, really. Yeah. So, you know, we put it in there because unmanage stopped monitoring entirely, I'm not sure if there's, you know, a similar thing in your neck of the woods.
Yeah, absolutely, and it's-- I like that there's a custom attribute for it here, because it also gives you the capability at some point to make a metric out of it. So, we can actually track how many things are muted right now.
And then we can ask that question, right, like on a weekly basis, like why do we have 2,700 things muted.
Yeah, we'll talk about some of the supporting features that go around his mute concept.
I actually went and had that as one of my lists on my main page, it's like, these are the things you're not going to hear about today. So yeah, this is how we do it in Orion you just go in; this is just a Boolean field you put in. You can assign it to nodes and interfaces and LUNs and basically anything. Nodes is kind of the traditional one, it's the general device or server or hypervisor, or it's kind of like the parent entity and underneath that parent it has disks and it has interfaces and it has CPU and memory specs. So, I always started with the nodes. Put it on the node; literally just give it a real simple value, I always like-- Because it's a question, I did it as "is muted." Everyone has mute or n_mute for node but it doesn't really matter, that's all about flavor. I liked this; this is the camelcase from doing too much programming for too many years, probably. But the beauty of it for Orion is it's literally one thing you change. If you want to do this you can go into any of your existing alerts and just change the scope of the alert which is the first thing we're going to ask and say instead of just firing everything unilaterally, fire for everything unilaterally unless that node is flagged as muted. And then if it is, it's ignored. The entire alert logic just stops and also you don't worry about extra CP cycles on the monitoring system because they're not going any further than the processing.
Right, the good part is that muting is one custom property at least muting the entire box is one custom property. The, I will say, downside, but the work of it is that every single alert you code has to take muting into account. Now you have to add that as part of the alert logic, you know, somewhere just to make sure it goes. But once you've done that step it becomes very, you know, easy to just keep rolling it in, and again very effective.
And the beauty of it, something like this you can add the custom properties to like the manage nodes view and just add that as a one and just say, filter only for these, or yes, and you're like, oh I need turn all those off turn them back on in just a couple of seconds done.
And also, because it's a custom property it will respond, we have in Orion, we have an alert action, which is change custom property. So you can actually mute or unmute certain nodes based on other events that are occurring, you know, as part of the whole process.
And that all comes back to that runbook thing it's like if this, then that, if else then, just have yourself a big case or select statement or whatever your program background tells you to use. And then I always like to extend it. And now, I didn't do this too frequently but I used to do this all the time for my interfaces, but I didn't use this muted. For my interfaces I cared about uplinks, so that's from a switch, like a floor switch, up to the core switch in an office and I cared about the WAN links. So I cared about the WAN, the actual hard connection, and the sub-interfaces that went out to my WAN providers and I cared about PRIs for my phone, and I cared about the internet pipes. Other than that, I didn't care if an interface went up and down. I mean if it went up and down to a server, typically people are rebooting that server, so I'm already going to get an alert about the server being rebooted. And I didn't ever monitor downstream. I always monitored upstream. So I didn't get-- so, if I was talking about a floor switch and a core switch, the floor got checked and the core did not. That way, I knew if one of these dropped, because of the drop on this side the whole child node would just disappear so it was just easier.
Right, if you didn't have typology, you know, dependencies, or parent child, or anything like that, right. So, I like this and I like having the sub-components. You can go way far with this, I've seen people on THWACK who use CPU mute and RAM mute and other things. The way that I personally like to do it is that the node mute is the one mute to rule them all. You know that if node mute is on, nothing gets sent. And then, everything below that, all the other components are each individual, that aren't dependent. And the one thing I want to remind people is that applications have custom properties. They were added as of version 11.0? Maybe back before that.
Applications? SAM 6.0
SAM 6.0. Thank you.
I think it was 6.0.
Right. And applications have custom properties, they're a little bit tricky, you have to go into the template and open the advanced properties and that's where you're going to find the custom properties and you can add application mute. Now that's really important because you're system support people think that certain things are important or not, but the application support people are going to say, "Oh, I don't want to hear any-- When an application person says, "I don't want to hear anything about this server, you know, for the weekend because we're doing maintenance." They actually don't mean I don't want to hear about the server, they mean they don't want to hear about their application going down. And the NOC is still on the hook for that box, maybe. So you have the ability to have the granularity of saying, "Oh sure, your little app, that will be muted but the server for which other people responsible, you know--
Maybe they need to know, because most data centers, at least the way I've worked with them recently is they're no local hands. Their people, there's no one there, so you're monitoring system takes the place of someone walking up and down the aisles checking and making sure things are green.
Right. And to your point earlier, additional fields that should be added with any sort of muting option is the mute reason, you know? Don't mute it if there's-- If it's muted and the mute reason is blank that is immediately suspect. That goes on to, not an alert--it goes into a report that gets generated every day or whatever. Right. Exactly. To say, you know, what's going on, go to the owner. This is muted. Did you ask for it? No? Then we're turning the mute off. And the other one is mute expiration date, just a date field so that when somebody requests it that it's muted and that way you know that this has a limited shelf life. And once again, automation, whether it's through the Orion SDK or whatever, can read those and say, "Oh, we've passed my expiration date I'm now going to programmatically unmute this."
Or even if you want to take the simpler approach is--I want a report of everything that has that in the past. So now, just ship me that report, in the morning I know I need to unmute all of these.
Right, if you want to go manual.
Yeah, if you want to go manual. But let's use the SDK.
I'm curious. Do you ever automate the mute state?
I can imagine in an application context, if I'm doing something like choosing a leader from a zookeeper cluster, like hey, who's the leader? Mute everyone else for this set of metrics.
I imagine you could also do the same sort of thing like, you know, BGP link states or, like HSRP, if you have more than one and you only care about the one.
Right, so you can say if the, you know, the primary circuit is-- If the primary circuit is up, mute the backup server, which is still a valid circuit, but you don't want any alerts about it. It might be fluctuating, and it's not important to me right now. And vice versa, you can programmatically, if the backup circuit is however you determine that, is up. I also did something where we would alert on the cell circuit. That was your backup-backup circuit, right? When that one came up then, first of all, we would alert the cell circuit was up because automatically something else was a problem, you know. But then there was also some programmatic things you could do with that. So, that's another case, but it really is one of those ideas that gives people a plethora of, you know, ideas.
This is kind of like this is your first step and wherever you take this, that's entirely up to you. For me one of the things that we didn't actually even put in our deck, is that a lot of things about the Orion system that really appeal to me especially in the past like five or six versions was the way it did automatic dependency mapping. Huge, huge help. Still a couple places where it doesn't work but there's a reason it doesn't work there. Like the company I came from, the WAN links. Now they didn't have dual WAN links on a router they wanted separation, number of failure domains. Cut it down. Two separate routers. So if one of the routers went down, everyone still communicates with the data center, if that second one went down then you've got a problem. But if either, you had to have one, so I had to do a dependency, which was these are all my WAN routers that's a group, everything behind those is a different group. And build the dependency set. If both of these are down, you don't have to send me the 800 alerts for all the things behind them, but you’ve got to make sure both of these are down.
You'll also see that with switches. With cross-connected switches.
Do you ever, I'm curious, I don't mean to go too off topic, but I'm curious if you actually ever have to answer the why didn't I get an alert question, with it was muted.
Yeah and that's the reason you have mute reason. And typically, if you can do this and if you allow the access to people that aren't necessarily just your NOC people. Like, allow them to do some stuff, like in their own stuff, you forcibly tell them that you need to put a reason in here. And you find out who logged in, and who did not put a reason in, and then you say, "This is why you did not receive an alert. This is the person whose audit log matches that time, you need to ask them."
I was going to say, we added it in v11.0 I think, audit capabilities.
And that audit is unbelievably great, especially if you use something like Active Directory.
Because then that just matches directly back. This one, I've actually gotten burned by this. I did not get burned horribly, this was not like the write three letters situation, but I did get burned pretty bad with it because I didn't realize my polling engine was overworked. I was taken in front of some people that far outranked me and asked why, why, why. So, this was the monitoring system breaking down. One of the questions we talked about but, it wasn't really the monitoring system breaking down, it's that it was scaling itself back on its polling cycles. Because it knew it was being overloaded, so it's--
It's a feature, not a bug.
Yeah, well I mean, it is intelligent because you get to a point, especially if you deal with the unlimited licenses, is that you just want to add and add and add and add and add. But there's a threshold, there's only so much information that any one of these machines can handle at one time. So, we intelligently just scale back. I don't want to say the time; it's actually the frequency we scale out the time between polling cycles. I didn't know this at the time and I ended up getting burned. Because I had something that was down, it was literally down for 11 minutes before someone had to come up and tell me that it was down for 11 minutes.
Right, and I saw this in a previous situation where the SNMP agent is running but it's not responding.
You know, so you have lights are on, nobody's home situation. You can also get that with ping. Actually, the box is up but it's not, or it's responding to ping but it's not doing anything else. So, that was one of the reasons why we've-- I've looked into this a lot, is you just want to make sure that you're getting fresh data. And within Orion we do have the last collection date, the last poll date, so we're able to go back and say that. So, you can look at those things and say-- This goes back to how I as the monitoring engineer don't hate my tools, you know, because they get me into those bad conversations.
I'll mention also for the executives watching that blameless post mortems are a thing and a very important thing. [Laughing]
What is this thing I hear?
Yeah, really. Blameless post mortems, I like it. I like the sound of it.
I know. I'm going to actually--we need a t-shirt. It's like, I attend only blameless post mortems.
Here we go.
Not to be sold at the underwriter's conferences. So the stale data-- [laughing] and the big thing for that is, once I know there's stale data and this is a couple years ago when I implemented something like this. Once I know there's stale data I know I need to take some corrective action. At the time, there was no corrective action I could take. I could literally go to the details, hit poll now, hope something came back. Or restart services, things like that.
Now, the SDK gives a little bit more than that, we actually have the ability now. Here we're using Orion because that's what we use. Here I'm using Powershell because that's pretty much what I use, the SDK is in I don't even know how many language compatibilities now. I know there's a Python one working, or--
We'll have to ask Patrick, or we'll fill it in on the chat or whatever.
It's right here.
It's over there.
But this is after hours, so maybe not, but whatever.
So it's all dark and closed.
If the SDK actually has this pre-built verb in here as part of the SolarWinds information service called Poll Now. So literally, you can say, I want to invoke this verb against this particular end point and do a poll now, and it'll actually kick it off.
And see what happens.
Yeah. So for this, I actually built this one, and I did test it, it's actually really hard to fake in bad polling times, but I was able to do it after a lot of tweaking. And it actually would trigger an immediate poll. So this is like, try to poll it, if you still didn't get anything, the logic here is wait 10 minutes and see if the poll time's changed and if it hasn't, then obviously the polling didn't work, then let somebody know. And the way that our logic works now with the escalation layers and knowing it's of critical so I need to let, you know, day shift know. Well, no one's touched it for five minutes I need to let their bosses know, or I need to let someone remote know and just have that kind of logic in there is really helpful. It's Powershell. I got lazy; I wrote it as a function. Because I can reuse it over and over again. So, literally, this was something-- I remember them putting this in and they used it for VB scripts. And I remember so long ago that I used to have to write, when I wanted the right Powershell I had to write a VBScript that would generate the Powershell and then the VBScript would call the Powershell command link Thankfully that's gone.
All right, all right, all right, all right, all right.
I would, yeah, crazy. Crazy. Now it's cake. Just sent it right through. Except it looks like, nope, okay I thought I missed a quote. [Laughter]
Yeah. And then, since we're talking about runbooks, this is kind of like my real simple-- Because this is me, for me, as the monitoring person, it's still nice to have.
Runbooks should be for you too.
Yeah, it should be for everybody. And this way, in case someone else becomes, because I was a team of one for the Orion, whereas other ones there's entire NOC teams. So, if I happen to be out, like I don't know, go on vacation, then somebody else needs to know this.
And why it's there.
Yeah. What else did we do? Oh, HTML, so great especially for, so great and the emails are actually useful. I mean maybe it's me, but when I first got a smartphone and I got email alerts, they would come in and they'd be plain text and everything would run together, and I'm talking early, the old, old, old smartphones that was like single color, so everything ran together. But that's the way all alerts were, they looked like traps as far as I was concerned. It was, like, roll information. And it was like not useful for me. All of a sudden, you can now do this? It's like, wait a minute, I can have an alert that not only looks good but will actually render decent on a mobile device, or on a tablet, or anything like that? And I can build in as many direct links as I want and not have one master, you’ve got to go here, and then know where your navigation is. I can say this is the exact thing that's firing, this is exactly why, here is the alert, it's actually on so you can go to the alert definition. Here is the parent node, here is the hypervisor, here is--
Here is the knowledge base article.
They gave me everything.
Link to the ticket.
Yeah, this is growing really quickly, the notion of actually taking and putting a formatting language around, right. And now we can embed, like even visualizations or graphs or you know, like whatever else. But, I mean, larger teams like Etsy has an entire service that just does this, like they send every alert through this thing, all of their services are called something-izer. So this is, I don't know, context-izer?
Exactly, yeah, so important. I would make the point though that like send multi-part messages, because there are people I guarantee you in your shop who will not like this. [Laughing] They're still using Vim and Mutt and they will be greatly confused, so.
What they can't read native HTML and process it? [Laughing]
And when they added this, there was, I remember when this first came in and you had the old engine, you had the report, or the alert manager, was it called? Advanced alert manager.
The Win32 app. I remember going in there and doing this and getting so happy, but then actually getting to see one come through my email was a nightmare because I would actually have to wait until it triggered. And if I was just changing it, like all I wanted to do was turn it from a plain text to HTML, I basically spun up a second alert, copied everything over, did all this stuff, then changed thresholds. It was--now we actually have the ability. There's the simulate and the execute button. But, and you called me on this, execute is not truly--
Test doesn't mean test.
It means try to fake it a little bit. And that's a conversation that we've had on Lab a number of times, but I think actually we're probably going to get kicked out of here in a minute. They’ve got to use this set for the next record, so this has been great.
Yeah, plenty of information, and yeah, we talked about a couple things here I think we'll just put some, show notes with this and we'll be done.
Yeah, and thanks everyone for hanging out with us for a little bit of extra.
Yup, thanks again, bye all.
Take it easy.