It was a dark and stormy night.
An alert rang out.
A CPU slammed to 100.
Suddenly, a net flow appeared on the horizon.
Meanwhile, on a small farm in Oklahoma, a head geek was growing up. [Laughing]
Okay. Working in IT sometimes feels like being stuck in the middle of a detective novel. Every day--
Sometimes every hour.
Okay, every hour, you are desperately trying to figure out who done it?
Yeah, before that chalk outline on the ground is yours.
Right. So, today, I wanted to talk about troubleshooting. Actually a little bit of talk, mostly hands on, and how your monitoring solution makes you the resident detective. With me today are some friends who also happen to be Cracker Jack IT troubleshooters. First, there's the inimitable Kate Asaff Very good, nice reveal. And she is one of our Program Managers here at SolarWinds, and also a recognized, tech support genius. Also with me is my former partner in crime, Josh Biggley, who wrangles the monitoring solutions for about 10,000 systems in cardinal health and is also a THWACK MVP. So thank you so much for joining us on this segment.
Thanks for having us.
Thanks for having us.
Okay. So before we dig into the hands on, and it's mostly hands on, I want to talk a little bit about theory because I think that good monitoring starts with good processes. And I know that when people are thrown into IT, they sometimes think that, you know, troubleshooting is really just a matter of, why don't we just, you know, pull things apart until you see what's broken; and then wander around in circles or whatever it takes. So there actually is a process to it. I'm curious about your guys take on that. How does, How do you approach an issue?
I think the first thing you have to do and I think that everyone knows this, is you have to figure out, what's actually going on? When end users walk into your office, or when your manager comes storming in, telling you that something is broken, you have to figure out what is that something. Some people might say, 'Oh the internet is down.' Well that's code for, maybe a router's down, maybe there's been a change to the config, maybe someone put a back hose through the fiber outside you building. You just, you never know. So you really have to understand what's really going on, distill it down to its fine points. It requires you to listen a whole lot more than it does talking. You have to ask those leading questions but in the end, you’ve got to get down to just the good stuff.
I think that Josh's point that's really important to get an idea of the full scope of the issue, because you never know what seems like a minute detail to a user is actually a critical piece of information that will lead you down a totally different path.
Okay so, really good questioning, as the other Head Geek, Tom LaRock says, 'Good interrogation skills' often help. Okay great so. You've asked as many questions as you can, so now what? What do you have to do next?
Hey look, I love to try to repeat what the issue is supposed to be. If someone says, 'I click on this and this thing happens' okay click on that thing. Now I'm sure that most people watching will understand that as CIS admins and as troubleshooters and we'll show up and say, 'Hey show me what you told me over the phone is wrong.' Inevitably, they're going to respond with, 'Oh it's working now' Apparently, I scare IT systems and don't just working by my sheer presence and I understand that but really you want to be able to replicate the allegation against the systems. If you can replicate, now you can start to do some troubleshooting on your own and to really sauce out some of those more complex evidences that you might see. You can look into event logs. You can look into application monitors that you might have set up. You can look in places that end users or people experiencing the symptoms just can't see.
Right. And just to be clear though, I think that if a problem isn't repeatable, isn't immediately repeatable every time, that's still evidence, right? I mean that just tells you a different situation.
That's one of the nice things about having a really good monitoring solution with SolarWinds is that you have a history, built up with your monitoring, so when somebody can't necessarily repeat the issue on demand, you have the option to sort of see, okay well what was happening around you when that was going on.
Right, okay. So we got good questioning or interrogation skills; see if it's repeatable and what are we looking to do after that?
Building a solid hypothesis using that experience that we have with the infrastructure and sometimes it's just, well I know that I changed something and suddenly people started screaming at me. So maybe I should go and look at that thing that I changed. Not that we ever want to admit to it. But really you have to build a working hypothesis. You've asked the questions. You've replicated in a laboratory environment what the outcome was and what the symptoms look like. And now, you've got to try and put your hypothesis to the test. You've got to build it and, just like a good scientist, start to experiment with it.
Okay, so just to recap. You've got asking good questions, repeating if the problem is repeatable, building your theory or hypothesis but whatever it is, starting to do your testing and basically just get to the work of fixing it. Great. But we've been dancing around a concept which was about monitoring. You mentioned it Kate a minute ago. I think that having good monitoring in place, really helps jumpstart that process. I know that there's lots of other troubleshooting processes that, you know the seven-step process, the ten-step process. I think monitoring can really create shortcuts for that because it's done some of those steps for us. But what does good monitoring mean? What does that really entail?
Hey, I've got a great story and it actually involves you, Leon.
Uh oh. Nobody saw it on tape.
No, it's true. So when I started working at Cardinal Health, two years ago, I was familiar with setting up SolarWinds' products. Of course, everyone knows that they're easy to install. They're easy to get up and running. You can start collecting a whole bunch of statistics but one statistic that I'd never given much thought to was CPU queue length. Of course, CPU queue length and CPU usage, you mash the two of them together and it gives you a great view of how you servers responds to the requests that are being, the load that's being placed on it. Well, your Ultimate CPU Alert that's posted to THWACK, gave me that insight. And of course, being the way we are at Cardinal Health, we've got this large broad environment we have to monitor. We realize we want to dig into that a little more. So we made the ultimate CPU Alert reloaded and also the Ultimate CPU alert for large environments. But it was that seed of understanding that we weren't mon or that I in my wasn't monitoring CPU queue length that made me realize, I always have to be morphing my monitoring. I can't ever be happy with the way that things are today. I have to be looking for better ways to deliver, not just the same service but a better service, to both myself and the consumers of the services that I offer.
Excellent. And I think Kate, you've run into this also, where people take the out of the box stuff and then they're upset that it's not working. I think there's a reason behind that.
Well that's what I tell people all the time is the out of out of the box stuff is there for a jumping off point. But there's so much information available for any network device or server that you want. You have to figure out what's important to you to pay attention to and what's a little less important. So that you can focus on what's really going to be critical for your business.
Right, so like any recipe, we're giving you these base recipes but you guys need to salt to taste. You need to take those basics and then run with them and make sure that they're working in your specific situation. Okay great. So we've set a nice foundation, I think, of the structure and the process, now what I'd like to do is actually go through a particular scenario and just sort of troubleshoot and just walk through it okay. So let's go ahead and dive in here.
Let's run it from the top. What are our symptoms?
Well the user started telling us that the application is slow.
Well not all the users. Some of them say it's just fine.
Okay well can we identify which users? It is on a floor of the building? Is it a specific department? Are they all users who have the same role?
Well external users say the app is always slow. Internal users are a mixed bag.
Okay so what does our monitoring data tell us about the application?
Well the data that we have says the application is fine.
What do you mean the data we have?
Well, there's some gaps. For some reason, it just doesn't respond, so there's these. We have chunks of data, then a gap, and the some more chunks of data.
Oh and one more thing. Users are reporting that sometimes they get an application error that says, Application Timeout
Okay so we've run down all the symptoms. We've collected it like talked about before. Have you figured out the culprit? Well, we're going to keep troubleshooting and see what comes up
Alright, so when we get that alert, remember we talked about we have to build that appropriate application monitor. In this case, we've got an application monitor in place for exchange performance counters, specifically RPC requests. Now remember when you send an alert out, you want to give people a place to go to. Something to look at. In this case, we're looking at these RPC requests. We come down here to our RPC requests outstanding widget and oh, we're way past the warning threshold. Things are getting a little crazy for us. So, we know that something with the application isn't quite right but we're not sure. Have you guys ever seen anything like this before? Do you know what it might be?
Not yet. Okay. Well, all right, let's keep digging.
But we have seen. I think a lot of us, if we monitor exchange at all, you've seen this on occasion or weekly or whatever. It's a pretty common thing.
Well certainly not a smoking gun for us though, just another symptom, another clue in our look. So let's jump into the node itself. The exchange server. Now, of course, we're all concerned that maybe this particular exchange server, maybe it's a little overworked. That sometimes happens. I get a little overworked sometimes. I don't send up any sort of alerts you know or notifications when that happens but sometimes our boxes do. So when we start to dig into the node, we want to scroll down real quickly and look at CPU and Memory Performance. Of course, we know that, every good CIS admin out there wants to make sure that they have enough resources for their boxes. In this case, we see that the CPU is running right around 50 percent. We're not bad. It's got a little bit of headroom; we can grow into it but certainly not the smoking gun again. So we have to keep digging. So we're on the vital stats tab of this node, we're going to dig and look at some of the statistics that we can get from either our physical or our virtual server. We can see of course, there's no stress on the box now, CPU or memory, 98 percent for an exchange box, pretty normal Top CPUs by percent load. When we look at the overall statistics across all CPUs, it looks pretty normal to me. 50 percent utilization, give or take. Seems like this box has got a fair bit of headroom to work with. We're going to flip over and take a look at the memory use and we're going to open up and take a look at the historical behind the memory used.
And so our memory utilization chart for today, looking not too bad, 9.4 Gigs, overall used or total memory, right around that 98% mark. I don't see any spikes here, so to me this says, probably not the smoking gun again.
All right so let's hop into one of my all-time favorite views, Appstack. I know it's been around for a little while. I know we've probably talked about it a few times but for me and the folks at Cardinal Health, they hear me talk about this Appstack all the time. I'm super excited about it, because it allows me to see the application all the way through the infrastructure, all the way down to that storage infrastructure. In fact, we use it to try to map storage lineage. Connect the applications to the actual storage that's running on. But it's a great way for us to take a look and see in one nice view what we're after. So applications, nothing going up the application level. Some issues with logins. That's a little weird. Outlook web access, having some problems. Get all the way down here. Virtual clusters, throwing some warnings but nothing that I would consider overly scary. Another red exclamation point. Now the data store status says that it's online but there's something going on. We need to dig into that. But before we get too far into that, let's make sure that the network isn't the problem. We know how much that people like to blame the network, so let's go and hop into that real quick. So we take a look at the quality of experience. Now, I view quality of experience as this compass that points only two directions. It either points to the application or it points to the network.
All right, so top 10 application, time to first byte; this is what we want to take a look at. Well, drill enters just the pass a little bit here. And we're looking at CIFs, five seconds but that's probably not exchange related in this case.
Not too bad.
Yeah not too bad right? Network response time: TCP Handshake, this is where we would look for, is it actually the network that's causing problems?
We've got a bit of an outlier here, right, SSL. So we're going to drill in and take a look at our network response time. We want to understand, is this a network issue? Now, the applications on this node are showing that the SSL QUOE application has an average response somewhere above 190 milliseconds. Something's obviously gone terribly wrong because our network response time is 2.2 seconds. So now, we need to go figure out what's gone wrong in our network. It's time to drill in and figure out, did something change?
Right now one of the things about QOE that we like to joke about especially when we're at conventions and stuff is that the QOE page is basically helping resolve the MTTI. The meantime to innocence. It's really just, so I point the finger, like you said, one way or the other way. Which one is it? So know we know or we have a suspicion that it's more network centric than the application. We're still not sure but that's the route you're going to take it down.
Yeah, absolutely. Now, at least we know, we can set aside the application. We know that it's probably not exchange but exchange is just experiencing the symptoms of something else that's acting upon it.
So let's go ahead and drill into that. All right so, the beloved NetPath. We couldn't possibly do troubleshooting at least in my opinion, without NetPath. We've heard a lot of talk about NetPath I've done a lot of talking about NetPath. People that I've worked with have heard me probably stand on my soapbox and preach about it enough, a lot, but I love it. I'm not going to dig too much into the details but we right away, see this great flag config change. Now config change, of course, requires that you have NCM installed. So if you don't have NCM installed, hint, hint to some of my colleagues, we should probably get it installed and use it. Right away, we see a degradation in the path. We see maximum response time of 500, 66 milliseconds. We see some issues. Well, let's hop in and click on config change and see what happens.
See what happens.
See what changed. So when we click on the config change, should get up this great comparison. We see the current config, and we can see the last config and we're going to hop back in time just a little bit and go see what happened yesterday. Configs are pulled up; we'll scroll through, seeing a line-by-line comparison. Maybe we'll find something that's changed. Ooh, look at that. Somebody has been playing with the configuration of the router. So it looks like, someone has forced some artificial latency into our environment. Obviously, there are nefarious forces at work here.
This is definitely a smoking gun. No obviously, in our lab, we've set up a few things just to illustrate what you might find. You, probably won't find these particular commands, if you do, it's really time to again figure out who made those config changes and maybe have a HR conversation with them. But…
We call those resume generating events.
So we should probably fix this.
I think that's a great idea.
Yeah so, let me jump in and show you how easy it is to actually take care of this if you have NCM installed. If you just come over to the new details page, you have configs already loaded up. Come down here and just pick a last known good config, which we knew that on the 20th, it was good. So we will pick that and we will hit upload. And then the job will kick off and we will see the traffic begin to flow the way we want it to.
Perfect. All right so, we've got this all wrapped up. We just clean up a little bit on the screen and we'll be good to go.
Perfect. [Clapping sound]
Detective Kojak, what are you doing here?
I'm looking for a razor. I never had a beard in my life. Yeah. Actually, you're very smart. You know, I couldn't a done it better myself, except if I had, I wouldn't be wrong.
Are you saying we're wrong?
But that doesn't make sense. We found the error. We fixed the error and the problem went away.
Well sure. In this case, all right, but everybody out there knows those symptoms could’ve gone down totally differently. So I'm resetting the clock. Consider this mystery unsolved. Who loves ya baby?
So Detective Kojak is right. The symptoms that we gave you could have pointed to host of issues. So we're going to re-run our troubleshooting now as if there was different root cause. So if at the beginning you said, 'Oh it's this other thing.' you might actually have you answer right here. Okay so walk us through this
So the first thing we're going to take a look at is net flow, which is one of my favorite modules because it really gives you a good insight as to what, not just the amount of traffic but what that traffic is flowing between the app server and the database server. So if we look here, we see spiky traffic is normal. Nothing really jumps out at me as something new or unusual over the last little bit. So we'll come over here. We'll take a look at the QOE again and see; we've got some monitors in red. We've got of CIFS and SSL and that's really pointing towards an application issue. NetPath confirms there's nothing really wrong with the path. Everything is green, which we like to see. So let's take a look at the performance counters. These are so useful to monitor in SAM, because even if you look at them in Windows, they're only displayed real time. So you don't really get any kind of historical background on what you're seeing. And this is looking like we've got some spikes towards the end of our monitoring period here.
It does not look happy.
I mean, I know the colors, sort of, are biasing but the fact is…
I am very partial to the light pink, however, I know that that is not what we want to see. So we're going to take a look at this new details page and I'm going to see, oh, the memory is, that's not what you want to see on your host. That looks pretty bad actually. So let's see if we can dig into that in Appstack and oh look, our host is red and definitely showing us some problems. So if we come over here into the IOPs and the latency, I see a huge spike in the IOPS and that tells me that there is something going on there. Some kind of runaway process, or something that is just chewing up a lot of my resources.
Right, so the issue is on the host now. So the virtual machine is fine. The network is fine. But, now the host either has a noisy neighbor or a runaway process or something and that's actually impacting the storage.
All right, so we fixed the problem again.
We saw it again and as frustrating as it was to have that happen.
You know the arch nemesis, our arch nemesis who came in and reset everything made us do this again, probably helped us. You know I find it so important that you don't get tunnel vision when you're troubleshooting. You really need to look at everything that's in your environment. You need to have a very broad but also very consolidated view.
You have to make sure you don't get locked into one hypothesis.
Absolutely, absolutely. I've seen it time and time again. And probably because I'm the one who was doing it but you drill in and you think I've seen that before, I'm going to go do that same thing again to solve it and it's not. That's actually why I love the AppStack view. This particular view pulls together data from multiple modules, lays it out in a nice, clean format. You take SAM through to WPM, through to NPM, to your VMAN integration, to your SRM integration, pull all that data--nice clean. Hey, I know what's failing but you can follow the breadcrumbs until you get to that root cause. It's no longer elusive.
Right, we've made the root cause not hard to find. Now, the thing about Appstack that I love telling people when we're chosen, things is you don't have to buy a box of Appstack. You have it. If you have NPM, you've got it. The question is which modules you have feeding it. Now I'm a big believer, after 20 years of doing monitoring, I'm a big believer in heterogeneous environments. I like to have a Sandy check here and to know what's going on but I was talking with Patrick Hubbard the other day and he coined a term I love, swivel chair integration. This is where you have five or six tools, you have a bunch of monitors, your swivel chair just keeps swiveling back and forth, and that's how you integrate your environment. And that can be tiresome also because they're not talking to each other. So this is so that the solution to that. Or this is the counterpoint to that, is that you have everything feeding into it, single view. Another point is that, Appstack is one of those place you go when there's a problem. It may not be something that you have up on a data center screen, although it might, we could. But it's not something that you necessarily have eyeball staring at all the time but it's a great thing when something's going on. Let me check that. I have a feeling that NetPath will become another one of those. That NetPath is, oh I think something's going on with the network. Let me go there. That's my quick sanity check. Let me just double check that and that will help me drill in. I think Appstack also. Something's going wrong with this application, jump on that page, take a look at it and just walk it top to bottom. Start at the bottom. It's usually your worst problem and work your way up. I don't know if that matches what your experience is, with helping customers either.
Definitely. That's the best thing about Appstack is that it has everything. It's just a quick, easy glance of where to start.
Okay. Very good. Alright, so, just to sort of recap, we were talking about good interrogation skills, and then, trying to repeat the problem but I think within those two things and you're narrowing down the culprits, but in those two things, don't get locked in. So you have your idea. You narrow it down. You start to hut something down but the keep on asking yourself, is this all? Because it's really easy especially even in a Thwack Camp demo, to say 'Well look, it was just that simple.' it's not really just that simple. Problems are multiphasic. Sometimes there's extra things. It could be the network and the application. So you want to keep on iterating around those three things and making sure that you've eliminated all the other things. You've marked things as no good. So there's also a process called the half-split, which is where you've got a problem and you remove half, 50 percent of the elements, and you focus in. Do I still have the problem? Do I not have the problem? If you don't have the problem, everything you threw out, it's over there. But if you do, then you know that you've removed all these extraneous things and you keep on drilling in and just 50 percent, 50 percent. Another idea here is that, we could have troubleshot the same issue. It's just it had a network component and an application component.
All right, well I appreciate you guys coming out. Thanks to everyone for watching and good luck with all of your troubleshooting.