Monitoring Like a Sysadmin When You're a Network Engineer
Maintaining network uptime and bandwidth for the business is, well, what we do! However, there are times when we are presented with performance issues that we have trouble figuring out through our telescope of network monitors.
Join Head Geek™ Destiny Bertucci and Senior Product Managers Steven Hunt and Chris O’Brien to learn how you can apply system monitors to cover your business-critical applications and be proactive about any network issues. You’ll also learn how to verify systems and applications performance after network upgrades or features have been applied. Finally, the panel will discuss ways to break down silos and engage with your systems teams to better monitor your network.
Welcome to our session: Monitoring Like a Sysadmin When You're a Network Engineer. I'm Chris O'Brien, Product Manager for Network Performance Monitor. Joining me are Steven Hunt, Product Manager for Server and Application Monitor, and Head Geek, Destiny Bertucci.
Hey guys! So my question is, what's the benefit to taking a sysadmin mentality? Is it really different than networks? I've been monitoring for a very long time, but then I'm pretty disciplined at no matter what I'm watching.
Well, two things come to mind. First, the application is what we're paid to deliver at the end of the day. So being more app-centric just makes sense. Second, application importance varies, right? And knowing that importance helps us make smarter decisions about what to work on, what to respond to more quickly, and how deeply we want to do the monitoring. So systems guys are one big step closer to apps and end users, so you're really ahead of us on both of these things.
So Steven, so what about the opposite side of the view? As a system administrator, do you need to put on a different hat to know, basically, the differences in a network versus the differences in the monitoring?
Yeah, definitely. If you know no issues exist on the network side, it helps you quickly focus on the application systems they reside on. The network can provide contextual data with tools like PerfStack, which can rule out network issues. Kind of important. [Chuckles] Once you know that, you can focus on the elements of the application and the infrastructure stack, and not waste time in the network, where systems administrators don't need to be.
Right. Well, I've always had the mindset to think of the users themselves and their environment while they're monitoring. I feel it helps you to set up your critical areas of concern and emphasis to the business critical applications as well as the infrastructure. And users, who've learned how to configure PerfStack and NetPath views, use visualization to see where the issue lies, and that helps them to pinpoint their troubleshooting more efficiently.
Yeah, so let's try that with an example scenario we often see in customer environments-- monitoring an IIS server. Let's build out the networking monitor first but using some sysadmin thinking. Next, we'll connect together the solid system monitoring that's required there and step back to look at the end-to-end picture. [Swoosh]
Okay. So let's build out the monitoring for that IIS server. As a network engineer, I'm responsible for the transit. So I'm not responsible for the application itself, but I am responsible for a piece of that application experience that the user feels. So I want to make sure that the network is providing what it needs to, to enable a good user experience. So one of the tools we have to do that in NPM is NetPath. So if we jump over to 'My Dashboards' and then 'Network,' and 'NetPath Services,' we'll see all the services we're monitoring in NetPath. As you build out these monitors, you actually specify the destination in terms of a URL hosting or IP. And you also specify the application port. So this is super important because it means we are sending our synthetic probing traffic and pretending to be the application so that we can see the network's performance for that application. So it sort of mirrors out what the user experience is rather than some diagnostic tool that only does ping or something like this. So that's really important. So let's take a look at what that looks like here. We'll go Intranet as our service. That's what's running on the IIS server. Then we'll go from East Data Center. This is where I'm monitoring that from. This is another time, another thing I want to point out--which is wherever you have big groups of users, particularly from a network perspective, when they're geographically dispersed, you want to monitor from these different locations, right? Because network service is not about point A to point B, it's usually about many point As to your point B, or a couple of point Bs. So you want to put these NetPath probes wherever you have big groups of users and monitor that service. So now that we're on this screen, we can see here over on the far left, we've got our East Agent. This is the one pretending to be our user, sending that synthetic probe traffic toward our application. All the way on the right, we've resolved that application URL, this intranet.demo.lab over to the server, West S-H-P-N-O-1-V and there's a 'T' somewhere as well. So now, we know what machine that's running on. And we know all the hops in the middle on the network side that are responsible for delivering transit to that machine.
So Chris, when you're looking at traffic between an origination point and the destination, in this case the IIS server, what's important to understand in between those? Me being a systems administrator, I care about what's happening for the users getting there, but just the fact that it's happening, right? So what's important for you to understand about the path of that traffic?
Yeah, I mean, so that mapping's really important, right? So if you look at the very bottom of the screen, you see end-to-end performance. And this is considerate of any of the paths you're going through, and how much of your traffic is going through each path, and how their latency is different, and so forth. So this is the end-to-end view. This is for what the user experiences. And then those nodes that are in the middle are the components that go in, the variables that go into that formula. Right? So this picture tells me, you know, over here at something like 9:10 a.m. my latency is a little higher than normal-- 12 milliseconds, not really bad. And if I click on that, then it will load for me that graph for that time period and I can see each one of these individual components. Now, starting from the extreme left-hand side, I see my probe reaches its default gateway and I know that it's 4506-- I recognize that, right? That's one of my machines. I know what that is. I've also got a little router symbol here. So this is doing routing for me. Jumps over to my East 2821-WAN, so already WAN router-- so this is my sort of internal segment, right? And in that chunk.
So you know that area right there? That is very, very familiar to you.
This is my stuff.
But then once you go past the WAN...
Yep. I start getting into internet stuff. And in our lab, these are sort of generic internet-looking IPs that are just 10, 10 dots. So we don't have all of the internet networks, but you would see that start to populate here. And you definitely get the performance hop-to-hop. And so eventually, you reach the destination network. Here I've got one of my devices again. And eventually, I reach this end node. Now, I didn't put in that end node-- end node's host name, right? So when I'm talking to you, I'm talking about intro.demo.lab as far as like a network destination, that's where I'm seeing, but I can tell you the actual server name, NetPath did that mapping for me. The other thing that you notice here is like, we're talking about the latency went up a little bit, went to 12, and not really a problem, right? Particularly over a WAN. But because each component has delay, and we can see the imposed performance penalty for each and every hop, we can see that delay came here, right at the last node. So that looks like maybe a server issue. All of the transit all the way up to the last network node looks fine. So something's going on between West 3850, which is a multi-layer switch-- you can see from that icon in there--and this server. Could be on my egress port. Could be on your device. It's not packet loss. It's latency, so it's probably not the cable, but something's going on right in that last hop. But I know which devices in here are mine. I know which devices I control, which devices I pay someone else to control, my WAN-- and then there's other devices where I may need to talk to you, because I don't know what that West S-H-P thing is.
Right, that's my domain. Right, that's the application server. That's where I live and breathe. But I think what's really important, for a lot of you out there, is to be able to understand the traffic, wherever it potentially could be coming from, right? I, as an application administrator, may have users everywhere. Right? They could be within the local data center network. They could be across, you know, the region. They could be across the globe. All trying to access, maybe, that particular website. So how can NetPath help them understand the difference between traffic from Site A or traffic from Site B, coming down into that particular-- they're all accessing one singular URL, right?
Right, one URL, so at some point they'll converge, at least at the server, at some point before that. And so one of the pages that we often overlook is this NetPath Services. But this is a really powerful summary, because comparison is extremely helpful in troubleshooting. So when we go and look at the intranet and we see from East Data Center is fine. From the New York office is fine. From Los Angeles is fine. But from the West Data Center is not. Right? Already with that very small set of data, we know that it's probably not an application issue. Right? We know that only one location is--that the location--the source location for the network traffic is deciding whether the service is healthy or not. So that's probably the problem.
Something that I'd like to add, though, is that just as you were saying with that page, the visualization there, you can see the comparative, right. And that helps people to know like if a remote office is-- where things are at. So like for you, for the sysadmin guy, just like he just said, when it's red like that, that's very visual for anybody-- network or sysman--to actually see that. And then when you click into them, at the bottom of it, when it was showing the actual graphing of the latency of how it was happening, to me that's a visualization of which that I really enjoy it from a network and a security standpoint. It's because I can actually see the flow and know if there's things that are happening during the day that could actually contribute to that, and go back and see the path, and see if there's any changes, or see if anything that was coming across there.
Right, and it's really important for me to understand where the traffic's coming from. When someone's saying, I'm having a problem with the website itself, I need to know where that user originated because it could be related to a particular network path, right? It could be they're at home, trying to access it over the VPN. It could be they're sitting at their desk with everyone else, and there's more users in that area having that problem. So it's very, very important to understand where the origination of the traffic comes from.
Yeah, yeah, I think so. I also think that we talk a lot about blame and finger pointing. But it's--so there's always this negative connotation. The other piece of it is troubleshooting quickly requires accurate isolation. It requires that you figure out whose problem this is, so the right expertise can work on that problem right away. So, blame or isolation is super important and we want to do it well. So if I can take two minutes and isolate to you, or you can take two minutes and isolate to me, that's great. I don't want you working on network stuff. You're no good at it.
Right. [Chuckles] And I don't want you wasting your time. I want to work on it, right? And so isolation's super important, but there's always this sort of undertone of blame. And I think one of the things NetPath does for you is it helps to show due diligence has been done here. I know what of my infrastructure is involved in delivering this transit. I know the server I'm going to. I know the application I'm reaching. And I know the components of that performance and sort of intent. I know the whole picture here. So now, we can have the more intelligent conversation about how to move forward. Another thing you can do here is you can click on any of these nodes, of course. But if you click on the last node here, we will actually map this out to your Orion nodes. So we'll start bringing in things like the interfaces on this node, as well as CPU and RAM. So you can get quick indicators of if there's a RAM problem, for instance. You can also click on the node and we will go to 'Node Details' for that node. So all of the data set that you're used to for starting to troubleshoot a node-level problem.
And as a network administrator, you can get a sense when going to that node, and see which applications are actually being monitored on that because when you have the combination of NPM and SAM, now you've got all of the information with regards to the network aspect of it, as well as the systems administration aspect of it.
Yeah, so we've had a lot of ways to tie that data together in the past, but it's always been kind of kludgy and takes a lot of time. So I'm going to show you one more piece on the network side and then sort of start tying it over to the sysadmin side, and see what you have to show. So this is, if you think about types of monitoring, this is very clearly synthetic monitoring. So we generate new packets that look like application traffic, and then sort of probe and measure the response. That's synthetic. The other one that's really big-- when you're really trying to focus on what's my end user experience, what's the app experience-- rather than test traffic experience and all that sort of stuff, is real user monitoring. This is a whole 'nother category, and there's all sorts of technologies in this. But one of the ones we introduced a little while ago is quality of experience. You find that under 'My Dashboards' and 'Home.' And then 'Quality of Experience,' of course. So we introduced this a while ago. This is a feature of NPM that's no extra cost or anything like that. It just works for you. There's a bunch of information available-- how to deploy this and when to deploy this. But the main thing I want to focus on here is, when you're trying to measure application performance, this is a powerful tool. It's a powerful tool because it lets you see the end-to-end response time, as gleaned from real-packet data moving across your network or moving at your server. And it even breaks it down to the average application response-time versus network response time. So this is sort of network response time is how long does my TCP three-way handshake take, whereas average application response time is time to first bite. So you can start seeing that, again, performance metric but this time tied to real user traffic.
So you can actually determine if the ACP traffic, or the HTPS traffic on the particular web server, is having a slow response based upon the actual traffic that's happening on the server.
Yep, yep. So real user gets you sort of this super-high accuracy, where synthetic gets you sort of the more consistent user-usage model, and shows you that data even when users aren't on it. So it's like two pieces of the puzzle. And you can see here all of my application response times are horrible, and my network response times are great. That indicates that I built the demo. That's how all my demos look. Okay, so a couple really big tools in NPM to focus on the application, focus on the end user, and sort of tie that together. But the next thing is how do we connect that together with the sysadmin side? So one of the things we introduced recently is PerfStack. So I'm going to go to 'Home,' and 'Performance Analysis.' We saw in the NetPath that we weren't getting a lot of latency. We weren't having a big variation in the end-to-end latency; we saw that in the graph at the bottom. I'm just going to take a quick look at that server that NetPath found for me over on the right hand side. I'll type that in here. West. Here it is, All right. I don't know what you guys are doing with your naming convention, but whatever, that's up to you.
It makes sense to us, as the systems administrators.
That's great. So, I've got my node here; I've got some data. But I know I want to look at all of the data, so I'm going to click this 'add related entities.' This'll bring in data sets from all sorts-- like the nodes interfaces, not just the node, NetFlow, all sorts of information about that-- like everything we know about that node.
So, being on the network side of things, you're using this entity in the way that it's intuitive for you. It pulls in the information and the metrics for you. So you don't have to be on the sysman's side to be able to look at this stuff. So even if the nodes are actually there from the system's side in your network, you're trying to figure out the things that are going on-- this helps you to immediately at least grab the entities and the performance issues that you can grab from there.
Like I see the CPU Memory here. If I want to take a look at that, I can. I don't have to go and do extra work to find that machine or find that data. All of the data about that machine is here. So as the network guy, I'm responsible for the end-to-end response time. I'm going to pull that in. We saw that from a NetPath perspective, which didn't vary very much from the regular response time here, so I'm going to show you that. And this allows me to correlate versus the other data sets that I pull in. So pull that in.
And this specifically is the response time of the server.
Yes, yes. From the poller, in this case. I'll pull in percent loss as well. That looks fine. My latency is varying a little bit here, which is a little weird. But it doesn't map to sort of the extreme slowness that we're having in our scenario. So what other information do I have here? A bunch of virtual machines.
One of the things you can look at is WPM transactions. So we were talking about the type of traffic that's coming to the server itself. WPM can take a look at that, record/playback real-user interaction, and then show you the response time associated with that. So that's another response time metric that you can use, as a network administrator, to determine what is the actual response time of the user interaction, not just a synthetic probe of the TCP port, but what is the actual user doing in the web pages, and see what that response time is. So you can grab that as well.
Okay, so you-- so with WPM you guys go through like the, all the way down to the steps of pulling the pages, clicking on buttons and all that sort of stuff. So even further down the user-experience track. So I see East Data Center. That seems to be related to what I've got. So I'll see what metrics or response time, okay, for this whole operation. So I'm going to bring in the average duration here-- 22K, 39K, oh, in milliseconds, okay. So otherwise known as 20 to 30 to 40 seconds. So there seems to be some delay here [laughs].
Some users wouldn't be happy about that.
Yeah, yeah, definitely. So at this point, the network's looking pretty clean. We saw that little bit of extra delay in the last hop in terms of the TCP port, but there's something going on here on the server. So I'm going to jump over, we'll grab this URL. And I'll IM that over to you and you can take a look.
So I can take that URL that you copied, sent it to me. Once I paste that into the browser, now I can see exactly what you were looking at. So I'm looking at the same exact information, the response time from the server itself, from the WPM transactions. I'm looking at that, I can see what you see, and you don't have to go through, there's no, you know, trying to save, load. It's literally just pass a URL and we're off and running.
Yeah, like even if you select a specific time period or any, like the ordering, everything's kept the same.
Right, right. So the other thing that some people may notice is the metric palette was full of a lot of information. But when you share that URL, you're focusing in on the entities and the metrics that were associated with what you actually had mapped.
Right, exactly. But I can still take that, the 'related entities' button and I can go find all the exact same information that you were looking at. So we were talking about that a little bit earlier, around how do we get this information, right? This set of relationships--whether it's from the interfaces that are running on the server-- you can start to see the switches that are mapped associated with that, as well as on the application side, right? I can look and see what servers, what WPM transactions, what virtual hosts, what databases, what storage layer, right? I can grab all of that information, and I can start to pick apart different metrics to understand where the problem exists.
I think the thing that I like the most about it though, is that he can give you-- where he kind of honed in on where he assumes the problem could be on your side. You can take it to that next level with PerfStack because he's using it as in like, these are the things that I feel like we should monitor and see. And maybe it's server versus the network. But then you, as the server guy, can actually drill into the transactions. You know what you need to look for more in-depth, to pull that insight out from the actual PerfStack.
Right, so he's been able to tell me, hey, from a user perspective, right, through the connection to the server itself, there's a problem with response time. The users are having an issue and it's not related to the network infrastructure itself.
Yeah, 39 thousand is not a good number.
It's not a very good number. Usually customers get a little bit upset when they get above like four to 10 seconds of response. We're way over that, at this point. So you know, as a systems administrator, now there's potentially a lot of different areas that I could go to. It can be a needle in the haystack, and the intention here is to be able to identify where that needle is. Through breadcrumbs. So I can take a look at different aspects of the server itself, because when I click that 'related entities,' I know that my server is here. I know which virtual host that it is actually running on. And I can start to investigate maybe noisy neighbor issues within my virtual environment. I can see all the way down into the LUN and the storage array where the VM is actually running. And again, there could be noisy neighbor issues right there-- maybe another virtual machine that is taking up a lot of IO in the storage layer, that could be causing a problem. All of these things lead to, ultimately, a response-time increase and the response-time increase is the symptom of the problem, right? We're trying to find the root cause, or where the problem might be.
Yeah, so again it's relationships with NetPath. It's the relationship of the hops that deliver that packet flows through. Whereas for you, it's sort of like layers of technology-- whether it's VMs or storage or database, all that sort of stuff. It's still relationships.
Absolutely. And that's where it's really important to understand where those relationships exist. If we don't know where those relationships are, we don't know where to start digging into to find the problem, right? I know it's on the server; I know that it's server-related. So I can look at the application but if I don't find it there, where else do I look unless I know the relationships? So what I'm going to do is, I'm going to start digging into the application itself. This is one of the most common steps that I, as a systems administrator, or most out there, would probably do. Is they know that there's an issue with response time. They've been told and given evidence that there's not actually a network infrastructure problem. There's nothing getting tripped up between the WANs, there's nothing happening where...
The WANs and LANs, they're looking great.
They're all fantastic. The world of WANs and LANs, which I know nothing about. I need to focus on my area; you focus on your area. So if I dig into the application, I can start to understand all the metrics that are in here. And we've got a lot. And that's kind of important, right? To be able to collect as much information as you can about the application, so you can start to dig in and see where a potential problem exists. Is it the actual application? The performance of the application itself? Or is there an issue somewhere else within the infrastructure stack?
The application itself can be running perfectly fine, but if there's an underlying LUN or an array, or something that's going wrong that's out there, then that can be actually increasing the response time. But the application itself, such as IIS, is running, functioning correctly. So it's good to actually see the relationship so that you can go all the way down, as well as get all that insight from the details.
Well, that actually brings up a really good point. So, one of the most common things that people want to look at is how's my CPU doing, right? And if my CPU is characteristically higher than normal, then that may be the problem. But just looking at CPU isn't an immediate way to diagnose a problem. So just looking at those basic server metrics doesn't necessarily tell us where we need to be. Again, it's that needle in the haystack. We have to start looking for those breadcrumbs. But you have to somewhat know a bit about your application to know what it is you're looking for. We know that this is a website. We know this is an IIS server. So there's certain elements we can start looking at to understand where that exists, instead of just kind of CPU, memory-- we want to dig a little bit further into application performance. So if I look at the application that's being monitored, we can look at a whole bunch of different types of metrics. As you can see, there's a lot here. You can start to become overwhelmed if you're not terribly familiar with this.
Yeah, I don't know what many of these—like, I know what average percent CPU is but many of the other ones in this long list, like Application Host Helper, I don't know what that is. Windows Process--I don't know what that is.
And so that's the great thing. When you're the systems administrator and you focus on your realm, then you start to know. But there's an aspect to Server and Application Monitor that can help you understand a little bit more about that. So real quick, I'm going to go into 'My Summary.' I'm going to take a look at IIS running on one of these servers, and just show you quickly how you can understand what these metrics mean. So if you're not an application expert, which many of them aren't, but they've been kind of thrust into this need to manage and monitor these applications. So we want to try to give them a little bit more context of where problems, or what problems are and what they could be. So when we look at AppInsight for IIS, we can start to dig into all of these different metrics and understand, if there's a problem with that metric, what does that problem mean? If you start clicking through metrics, we'll look at total connection attempts. And we give some expert knowledge in terms of what this metric really means. Again for those that aren't application experts, they can take a look at these different metrics and they can understand if there's something out of the ordinary--if this is uncharacteristically high-- what could that mean to the performance of the application?
Okay, so I'm actually amazed by that because I didn't even realize that that was there. So this would actually help me if I was looking into this and seeing an issue. As I was just saying earlier, when it said 90% memory, I'm like, ah, it's the network group. Oh, there's your problem. [Laughs] When you can drill into here and actually see the information and know how to react if it's-- you know, is that supposed to be normal? What do you do when this happens, and we actually give you the guide? I think that's very intuitive and very helpful for the users.
Boy, and that's the important part; if you don't know where, necessarily, about the application, especially if it's in a complex application, there's a lot of information to understand. We can help provide that information. So, let me go back to our... PerfStack screen. All right. Since there's a whole bunch of metrics here that we can dig through, we're going to search for one that we know might be an actual problem in request-execution time. So we'll take a look at that. And what we knew from our expert knowledge, if we weren't familiar with this metric, we know that any uncharacteristic spike of information here means that there's a problem with the response of the web page to a request coming from the end user. So what we see here is there are some spikes happening. We can go back into the metric inside of AppInsight for IIS. There's our expert knowledge again. And what we can see there is that average baseline, which SAM can do. And then we see our responses that are uncharacteristic. Now, what that means is likely, somewhere within the application itself, within the web code, there's something that's broken. We can look into our actual transactions and dig into the steps. So we knew that the WPM transaction was long. We've got information from the metric, the request-execution metric, that we saw within PerfStack, and looked at the expert knowledge. We know that that's a problem. And now we can actually dig into the individual steps associated with this metric and we can see that-- like an authentication step.
Yeah. I'm no sysadmin, but that 6.94 is red.
Right, that's so--
And in bold.
Exactly. So, we know that there's a problem with the authentication step that's happening within the website, and that's likely where our problem exists. So now we can go, you know, to the SharePoint administrator, to the web administrator, we can say, hey, we found a problem within this particular step. We know that the users are having a problem with response. We need you to take a look at the actual code into this and determine, you know, where that problem exists. Instead of us trying to figure out where, within this whole stack, this problem exists-- we've looked at the aspect of the user response from the network side. We found that there was a response issue, but there were no issues within the network. We looked at the application itself, found a performance problem within the application, which lead us to understand that there's a problem inside the web code itself, and we can find the particular page that it exists in. So, you know. We've, through analysis of the network data, through analysis of the systems data, we've narrowed this all the way down to the application problem. Wasn't the network this time. It actually happened to be with the application, and we know where it exists. Now we can actually go solve the problem instead of having the finger pointing of it's the application or it's the network. No, we know exactly where it exists.
Well, and this data not only helps you as the server admin and the network guy to actually understand it's not on his side this time, you know where it is on yours. But they can actually go through here and say to the developer and people, and pinpoint it down into the actual authentication, so it saves them time too.
Right, right. The whole point is to find where the problem exists so we can fix the problem-- not to say that the problem's not my side. It's trying to identify where it exists and actually solve the problem. That's the key issue. Because ultimately, we're impacting users and no one wants to do that.
Yeah. We all just want users to be happy so we can go back to drinking our coffee, having lunch. These are the important things in the IT life.
I mean the important things.
An uneventful day is a great day.
Okay. So those were both great, and I'm glad that you guys could actually segment it down on which side and being able to use NetPath from a systems point of view, and PerfStack from an actual networker's point of view. Now what I'm going to do is use PerfStack as kind of how we've done with some of the SWUG events, with some of the users, and they gave their feedback on PerfStack as a network-- how they are going to use it for network or security. So if we look at the demo, we're in PerfStack right now. And this is actually looking at the average response times, the average waits; you have your transmit, your ingress, your egress. Why this is important to me as a networker is that when we have change requests, or if you have a maintenance, or if you have something like that that you're trying to go through there, a lot of times now, they want a statement of work beforehand. What are you trying to accomplish, right? After you're making these changes or these upgrades, what are you trying to accomplish? Or, if we have to roll out patches or update IOSs, they want to verify that it's not causing an issue on your network. So for me to use PerfStack, I actually come in here and if we're doing a maintenance, say we did it as you can see from here, we did it right around 8:30-- and then all of a sudden our average response times started to go up, up, up and away, right? So for me, this is bad. This is not what we wanted to happen. The normalcy is not there anymore. We're seeing an increase in response time. What we implemented is visually showing me automatically, as well as I can now take this link and give it to my manager, or give it to somebody else with my statement of work and say, this is why we need to roll back. It's not a guessing game anymore. It's not, you know, like, oh gee, I don't know if this is what the problem is. It's more focused on I have something that I can give to them as a report now, and actually back up if something went well, or if something went bad.
Yeah. So whether it's like, I totally screwed it up, here's the evidence. Or someone else screwed it up, here's the evidence-- that pre and post picture of your like, a broad set of metrics after you do a change, is really valuable.
So that's kind of important. You highlighted, you know, at the time that you made the change. And you can see the immediate impact to the network, and to the performance of potentially end users, where that change started and the issue that it created. And you can know immediately if you roll that back, what's the change to the metric as well.
Yeah, so one of the things is this auto-updates, right? So you can just have this on your right monitor. Have it going. You're through making your changes, you're doing step five of fifty, and you can keep your eye on that. I love having that constant indicator of like, is it going how I think it's going?
Exactly. Especially when shapers are coming out as well. And we use PerfStack to monitor, say, the traffic, the http traffic that we're going across. Or if we're monitoring our SharePoint, or things of that nature, and I'm making a network change-- if I'm seeing that the response time is going up, and I'm looking also at my NetPath and I'm starting to see that we've got response times, I can look at PerfStack and see how I'm affecting your application. So it helps me to have things up on here so I can actually visually keep a, kind of a, healthy heartbeat, if you would, on your network when I'm making changes, security implementations. I mean like, we all know that when we make the security changes, we're trying not to hinder things but they do sometimes actually make a difference in the network and it needs to be documented. Because for you to validate the reason or the cause for it, you have to showcase what happened and you have to verify it. You can't just say, oh well the response time went up and it was because of this. Well, you need to know 100% and it has to be documented. And this is a quick and easy way that our users, you guys, were showing us that you're using some of this information. And this is something that I like to use, especially with security and with maintenances, just to kind of validate-- and because it helps to back up what I've done, and it helps to prove that it actually accomplished what it needed to do.
Well you highlighted something important. You potentially are making a change for security reasons. You're closing a security loop. There's been a notification of something and you've got to make a change to mitigate that issue. But it's really important to understand, does that have an impact? Is that going to create a performance impact to the environment? And in here, you immediately see it. And so now, to your point, you have evidence to say, while yes, this is a security issue that we need to mitigate, this resolution that we made has a negative impact on our end users and our environment, and we have to roll this back now because we can't have that impact. That's the most important thing. And then you can proceed forward with how to figure out how to mitigate the security risk that doesn't have an impact to the users.
Definitely. And also, this helps you to actually do the statement of work as a response time for how long are you supposed to be down during your maintenance. So we can actually showcase this, and show that it was down for x-amount of time, and you show the whole history of your maintenance window. And I think that's vital, especially when people are trying to, when they're getting into the documentation-- especially if you're learning, or you're starting off in networking-- it helps you to focus in and now you have a history track too. You know that this actually took x-amount of time on these devices. You can actually put these in groups of how many devices that you're wanting to compare them to. They keep them all across there. So when you're actually able to see these, and showcase the maintenance windows, then you can use that to say, hey, when I'm doing my next project, I know it took this. This is what I'm basing my next statement of work off of. This is what we're hoping to have, and we can showcase them together.
This is also powerful for looking at redundancy. So as you build basic connectivity without a lot of redundancy. You go into maintenance windows and you can have a lot of impact to your users. And so you can see that here. But as you justify adding redundancy, and you do the work and the business pays the money and all these sort of things, you start to close this window of user impact. And eventually, if you're doing your job correctly and you can justify and get some of these redundancy things put in place, then there comes a day where you're saying I'm going to start my window, I'm watching all of the metrics that tell me the end user performance, and there's no change. And that's a beautiful thing. And it really also helps you justify, we spent this money so we can accomplish this. If I had to do this midday, I've done it 10 times, I know there's user, no user impact--I could do it midday. And it's very powerful, and sort of makes you feel like your environment is flexible and works with you and it's not fragile. Because no one wants to feel like their environment's fragile.
No, no. And it helps to give you that confidence because you have history there that can actually showcase what you're doing. And it gives you a kind of leeway into, you know, hey, everything's functioning the way that it's designed, and I know that. And it's very powerful to have that confidence behind you.
Well I really appreciate both of you for being here today with me, and all the THWACKcamp viewers. The days of finger pointing seem to be over for the most part. Now, the focus is on how we can quickly meet those internal and external SLAs.
I'm not sure I agree the fight is over, but I think we're taking steps in the right direction. Casting blame is bad, but accurate problem isolation is very important. And it's really a fine line between them. I think application-centric analysis and simple visualizations definitely help with accurate isolation. For SolarWinds, NetPath and PerfStack are our big bets to help with that.
I agree with Chris. But it's important to break down the barriers. Not only are you able to verify the potential of network issues, you can also verify the application issues and performance in the same context. So if there have been any changes, network and systems administrators can work together to resolve the issues quickly, to the benefit of the end users.
Well thank you all for viewing today's Monitoring Like a SysAdmin When You're a Network Engineer. I hope it was as informative for you as for all of us in the monitoring trenches. Hopefully, you can start conversations with your peers, and create more efficient monitoring solutions.