When DevOps Says "Monitor"
IT professionals who work with "traditional" operations-centric monitoring solutions often find an ideological disconnect when interacting with their DevOps-oriented colleagues. Is it because monitoring in DevOps is somehow unlike monitoring standard on-premises equipment? Or because systems created and maintained using DevOps techniques are fundamentally different?
In this panel discussion, our guests will break down the terminology, expectations—and yes, even bad (monitoring) habits—in the DevOps world so that a traditional monitoring engineer can feel right at home.
Good day. Welcome to When DevOps Says Monitoring. That's kind of an interesting title. And the reason why, where it comes out of is that when I began my journey into the world of DevOps about two years ago I was going to conventions and meetups things like DevOps Days, Monitorama. And also watching a lot of talks online. And what became abundantly clear to me very early on was that the person speaking would say the word monitoring a word that I enjoy and then they'd say a string of words that made no sense to me whatsoever. Which was a little disconcerting because I've been doing monitoring for a little while. I thought I kind of understood the topic. And after a number of conversations, both sidebar conversations and then listening to these talks it became clear to me that this is a conversation that we need to have. We in the monitoring community need to understand the differences when, the differences of monitoring in a DevOps context versus a traditional IT operations context. So that's the talk that I want to have today. Is that when DevOps says monitoring, what does that mean and how, if any, you know if at all is it different? So I brought a few friends from those talks that I've been attending with me today. So today, I've got Michael Cote. He is the Director of Technical Marketing at Pivotal, welcome. Also, we have Nathen Harvey who is the VP of Community at Chef. And finally, we have Clinton Wolfe who is a longtime DevOps practitioner and is seeking his next new adventure in the exciting world of DevOps. Welcome, thank you so much for coming.
By way of introductions, I thought it would be good if you told us a little bit about your place in the DevOps community. Your journey to it. But also how monitoring has intersected with that. Your view of monitoring in the DevOps world. So I think we'll start right here with Nathen.
Sure, so some of my history started as you know, software developer, sys admin. And really, one of my early career positions I recognized a need for more server capacity. And so I went to my boss and I said hey, I need more capacity for this app. And he said no problem, fill out this form. We'll get you another server. [Laughing]
18 months later, I could log into that box and it was amazing, it was so brilliant. And then we had to add it to the monitoring system and whatever, but in the very next job, the very next job when I needed new capacity, I called an API. And I had a box that I would never see. I could never touch. And I had full access to that box. And I could spin it up; spin it down whenever I wanted. And so to me, like that was my beginning journey into this world of DevOps. This world of automation, this world of the cloud right? And so monitoring really began to change. Really, we had to think about things in a completely different way. So I think that's kind of my journey, and how I ended up here.
Fantastic, well welcome again.
Clinton, how'd you get here?
Gosh, well I think I'll tell you about my... So I have a long history as a web application developer. And one of the first web applications that I developed was at Indiana University working in the knock. And this was about 2001, and the Indiana University system at that time had approximately 8,000 level two switches spread out across its eight campuses. And we had a system called MRTG if that rings any bells, yeah that we were using to monitor it and we had to feed in the control addresses of every single one of those 8,000 devices into the system. This was far before the cloud, of course, and so we had to use discovery tools and we'd have to do a daily run to try to find new things and to discover how they were connected. It was an absolute nightmare. We had 10 or 11 people dedicated just to this task. And my job was to write the web application that let you add new things manually through the web application or do firmware updates, that sort of thing. Fast forward 15 years or so I'm at OmniTI, that's a company that was doing a lot of DevOps-related work. Especially in consulting. Theo Schlossnagle was our CEO at the time and he later split off to found Circonus a major Software as a Service monitoring service. I also walked by Jason Dickson's desk a lot and said hi to him. I can't really say that I knew him that well. [Laughing] But anyway, yeah. So much work today even is around discovering things and identifying what is supposed to be monitored and that continues to be a huge problem. Even in the cloud.
So, relatively in ancient history I worked at BMC Software on a little product called Patrol. And you know, we could go into how it wasn't the terrible Patrol and the better one but never mind that. [Laughing] And I worked on everything from the UI down to the little agent that would actually go out and monitor things. So I was causing a lot of the problems of monitoring by selecting what was monitored and logs and stuff. And then just to briefly go through it I was an industry analyst at a couple of places and worked on strategy in NMA somewhere else. And now I work at Pivotal and mostly what I do is when pretty much larger organizations want to switch over to improve how they do their software which some people would call DevOps or cloud native or whatever, depending on what's in your marketing PowerPoints. They often have questions kind of at the management and above level of like, that all sounds wonderful but how do we actually do that? And so that's what I spend most of my time nowadays doing. Is ensuring them that, you know it's not just some sort of celebrity fad of how you manage IT but common, normal people can do it. And succeed with it. And I think the second part of what you're asking is kind of like, how monitoring and DevOps like, mix together a little bit, if I remember. And yeah, I mean I think not to dismiss the mystery of it all but I think there's like, the first thing is it's sort of like, well it's still computers and software so you've got to figure out what you need to monitor to fix it. There's nothing magical about containers that's like some mysterious thing. Things still go wrong and you've got all the same stuff. And I think why it gets a little weird is that, like at these conferences you're mentioning. People who go to DevOps shows they're more interested in the human side of things. And the software and the business process. And I think kind of just like I did their assumption is, "Like, well sure. If you need to monitor CPU, you can still do that. You can still do that; you still need to do that. But that's kind of boring, let's talk about human problems and monitoring like, business IT alignment and all this other stuff that's new and possible to monitor."
So that's actually a really good place to jump in. I want to cut straight to the chase. What is it that you mean when you as a DevOps practitioner say monitoring? What are you talking about?
Well, just to be fair I'm not really a practitioner, I just make slides. But I'll pretend like I'm one.
Well, I think... Like you know, I was trying to remember in the Google SRE book, there's a section on monitoring. And Google SRE is basically the site reliability engineer for people who can't initialize. But more importantly, it's the people who make sure that their stuff stays up and running. And they also have a whole philosophy of like, if I'm going to guarantee I can keep Gmail up and running you have to do these things. Which is an interesting approach. I think most sysadmins, the relationship is the opposite way. They're told to keep this up and running and don't go home until you've got good uptime. But anyways, they had kind of an overview of the first tier of metrics. Which is, if I remember it's basically you want to monitor your latency, which is the time it takes to do a request. Which makes sense, how fast are things going? Then you want to monitor traffic, which is the rate at which you're doing a web application how many requests per second do you have? And then they've got this fancy little term, saturation. Which you might think of as capacity right? Like, how big, how much is the pipe full? And then you've got the error rate that's happening right? And this is where it goes into the mysticism of DevOps is we have, and I used to program these back when I was making agents. We have these assumptions about here's all the attributes a computer has. And so we should monitor all of them. And we'll make sense of what it is. But the approach with those four things that you see DevOps people doing is more like, well what do we actually need to monitor? What would, when this system goes down how are we going to recover from it? And we should only monitor those things. And maybe we diagnose stuff somewhere else but like, let's only monitor the things we need to monitor. So that's sort of why it's a little bit weird on the classic monitoring stuff is there's not this assumption of like let's just monitor these 500 things because that's what we do. It's more like, let's study the actual workload or application that's running and profile it. And then as I was saying, the other part is a whole, a huge part of DevOps is like this software should be useful and be usable to the user right? So it should accomplish the goals that you have. Which I'm obviously saying in a comical way but the way that you assure software is accomplishing the goals you have is you come up with some other metrics and you monitor, I don't know what the kids call it nowadays in the monitoring space but like, the business metrics right? And classically, I think it's Werner Herzog or not Werner Herzog, he's a film guy. The other Werner, in our industry. Who basically said the only metric you really need to monitor is like, cash coming in the door right? And so that's a little bit of an exaggeration. Or a lot of one. But it's sort of like, also have these metrics that are telling you the, I don't know the non-hard metrics. And that could be like developer throughput to get releases out, or cash coming in the door or whatever, and I think it's that second bucket of things that oftentimes you encounter DevOps people talking about. And you're like yes, but what about RAM right? And you want to know about these more classical things.
Right, okay so yeah.
To add onto that, I've seen it presented as sort of a three-layer cake. Sort of the idea that the top layer is exactly all of those business goals. So it's, if you're selling shoes how many shoes are we selling? If we're a mobile app and we depend on in-app purchases how many of those are going through. What's the volume on that? And if those, and those are really maybe the only things that you actually alert on. Because if those things are failing you have a business problem and there needs to be a response to that. Behind that though, if one of those things is failing then you should also be monitoring the services and applications that make up those things. So there you're watching things like error rates and you're starting to get into more of the technical things. But you're watching for availability of the say, the storefront. Or you're watching the error rate on the storefront. Or the decline rate on the transaction handling system or something like that. If any one of those things spikes that's going to be very useful as a diagnostic to inform why your business is suddenly not making money. Below that though, that third layer of the sort of systems level CPU, RAM, and memory or CPU, RAM, and disk the sort of holy trinity of monitoring. Those become decreasingly useful. I mean it's, today in a DevOps world and DevOps especially is about flexibility, agility, and automation, elasticity to a large degree. And so you can use those sorts of metrics to drive elasticity. So that you know if you're seeing CPU spikes your response might not be, "Oh gosh, let's wake some people up and go look at it," but rather, "Let's simply provision some new servers and have them up and running to absorb the load." So you use that bottom layer of system metrics to simply drive elasticity.
Got it, Nathen go ahead.
Yeah, I think Clinton touched on something right at the top there where he said the word alerting. And I think a lot of times with DevOps like, there's a lot of hatred for monitoring right? Monitoring sucks, and I think when people say those words like, we don't really mean that monitoring sucks. And we don't mean that traditional monitoring sucks. What we mean is alerting sucks. I hate to be woken up at 3 a.m. in the morning told about something that doesn't matter that I have no control over or don't have the capability to fix. Or even the necessity to fix right? And so, with DevOps as Cote said you know, we think about the people first. We really put people first. And so in IT, we're really terrible about putting our people first. And we know this; anyone who's carried a pager or been on call at any time knows that we're terrible at putting our people first. And DevOps really seeks to change that. Let's put our people first. And put them in top line consideration. But then coming back to it, so we have to separate monitoring and alerting right? I think they're two different... They're connected topics for sure but I think they're different things. But then, of course, you want to start with that. "Is it up? Can my customers pay me money?" That's where you start. And then from there, you will discover, what are the other things that we need to start monitoring?
Great, so I'm going to push back a little bit because what you're talking about you know--monitoring is frustrating because the alerting is challenging or sucks or whatever. That isn't anything new to folks who have been doing traditional operations-based you know, monitoring. And one of my soapboxes has been for a long time that, you know, monitoring is not an alert. It's not a screen; it's not a blinking light. It's not a poke in the shoulder. It's not a whoop, whoop noise coming off of a machine in the data center. That's not monitoring. Monitoring is simply the regular ongoing collection of metrics from a set of devices. That's monitoring, everything else the whoop, whoop noise and all that is a happy byproduct that you get if you do the first thing called monitoring. So if your alerting stinks that can be tuned as long as you have good monitoring coming in. So I don't think that that's surprising. But I think also where there's a point of departure for the DevOps world that I've seen was actually summarized by somebody. Charity Majors wrote about on July 28t--she has an article, “Ops: It's Everyone's Job Now.” And I'll have a link to that in the show notes. And she said, "Compared to the old monoliths," I feel so good about that. Thanks, Charity. "…that we could manage using monitoring and automation. The new systems require new assumptions that distributed systems are never up. They exist in a constant state of partially degraded service, accept failure, design for resiliency, protect and shrink the critical path." I think already people who are doing traditional monitoring may have their heads exploding just from that concept. "You also can't hold the entire system in your head." This is another hard one for folks from traditional ops. "You cannot hold the entire system in your head, or in your inventory, or reason about it. You'll live or die by the thoroughness of your instrumentation and observability tooling. You need robust service, registration, discovery, load balancing, and backpressure between every combination of components. You need to learn to integrate third-party services; many core functions will be outsourced. And you have to test in production." I said it out loud. "You have to test in production. You have to do so safely. You cannot spin up a staging copy of a large distributed system." This is a mindset, again that I encounter time after time after time that just, like it just knocks me off my chair because I'm thinking, you know from again, from a traditional operations-based monitoring. Like no, I inventory my environment. I discover the devices, now I know what they are. Now I can figure out what kinds of things they are. I can figure out what sub-elements I'm going to monitor. Like that's my world. So what is DevOps thinking about that? And what is DevOps thinking is more important than that? And I'll take it back the other way again. So Nathen, you're on the hot seat.
Sure, so I think that... I think that first, what you're talking about there is that we have to care about the purpose of our applications. What are our applications here to do? Who are the business users or the customers that they're meant to serve? And what are the customers trying to achieve there, right? And I think that the distributed service yes, it's all--the components within it are always going to exist in a degraded state, right? But the service as a whole we want to make sure that that's working properly. And when that starts to fail then we need to understand why is it failing? Which components are failing? And we also need to kind of push back on the application developers and ensure that the application developers are building the appropriate telemetry into their applications that give the operators of those applications the observability of those applications. The insight into what's happening in that application. Is this particular instance of this application in this distributed system, is it healthy? And if it's not healthy what are the actions I should take? Should I just discard it? And provision a new one in its place? Does it matter that it's not healthy? Why isn't it healthy? These are the questions that we need to start asking.
Okay, Clinton, anything to add?
Yeah, absolutely. So, building on that. I mean, around the idea of what matters whether or not it's healthy, it's the service. And you might look and find that say, if it's a container-based service or something like that that you might have one container type that keeps living for a certain amount of time. Has a memory leak, and then suddenly it gets killed by its orchestrator because it's exhausted its allocated resource. Do you care about that? You really don't if the service is running. And then developers make these sorts of calls all the time as to whether or not, is that memory leak worth fixing from a business sense? Because that's time that has to go into fixing that. And the system as a whole is going around keeping the service operational. Keeping it working. And so in a sense, this is just one common example of having a somewhat misbehaving service that's degrading itself constantly. But the service itself is still always running. And another example I had once was a... This organization had a spike sequence, in which they would see a large audience in the Eastern Time zone, and then they would see it in the Pacific Time zone, and then they would see it around Korea, Singapore areas. And there were three major spikes across the day. And so their load, which was built to be elastic so that it would ramp up and ramp back down, it would ramp up in the eastern time zone in the evening and then in the evening in Pacific and then in Korea. And then their monitoring system their original monitoring system was going crazy. Because it was like, "Oh, my gosh. We don't have nearly enough instances running suddenly because it's you know, hey it's... you know, three in the afternoon in the Eastern time zone and I expect there to be 200 instances running." And the fact is, you don't need that many. You don't need them until 6 p.m. And so you have to have tools that understand things like the time of day. And that understand that your target is moving. And that your service should be working in the place it is looking for and that's really a sophisticated thing to ask of a monitoring system. So, as time goes on and we work with more flexible workloads and we're able to make those workloads more dynamic and to shrink our costs by putting the workloads exactly where they need to be and nowhere else. Our monitoring has to become smart enough and the alerting tools attached to them have to be flexible enough and advanced enough that we're able to write that kind of query.
Well I think if we were to like, pull a drop of blood from all the sort of sysadmins out there and make this sort of friendly homunculus and we'd see in his head there's... And I guess it could be a she. We see in its head like, this grinding of gears like you're portraying, so that we're not just projecting on you as the friendly homunculus. Like a big first step, especially for like what Charity was getting into is well, let's narrow the scope to the workload we're talking about. And generally, and there's a novel-sized footnote about this that you can get into. But like, generally what we're talking about is custom-written software. Like software that this organization is writing and running on their own. We're not talking about managing SharePoint, right? Now like I said, there's a bunch of ‘footnotey’ stuff you could talk about this. And in that case, when you're writing your own software and if you're releasing your own software every week or every day, on a cloud infrastructure the distributed nature. Then yeah, you're basically dealing with this unknown beast like a mad homunculus that you've got to run around and monitor. And so in that case, everything she said applies, right? It's just like you've got this big unknown ball of craziness. And so the way that you approach it is much different. Now, I think what most sysadmins are used to or whatever they call themselves nowadays. Is like, I have a relatively static workload that I understand, historically, and also because I've done a lot of thinking ahead of time about what I'll need to monitor. And so I know when this set of things happens that bad stuff is going to occur. But if I have these you know, let's say you're at like a big bank and I've got 300 teams of developers releasing new functionality on a weekly basis. Who knows what, right? And so, that's why, like when I was alluding to SRE stuff, there is almost this inversion of like, "Hey, developers. You must run in this contained thing so that we don't burn down the house." But also, it brings in all of this kind of like unknown craziness style of monitoring that you need to do instead because you have kind of no idea what you're monitoring on a week to week basis.
Got it. Okay, so that starts to help us get our arms around the beast. The one thing when we were having conversations to prepare for this that I thought was interesting and useful to go over is what do you three think is the reason for this disconnect? That traditional IT, traditional operations is expecting this fixed environment and that's the world that that they in many cases, are still in. I know I still was. And yet, DevOps has sort of raced over. I won't say ahead or behind. But you know, raced to the side to this homunculus, this mad homunculus who's, you know, running around. This crazy beast of I can't ever put my arms around it. Why is there a disconnect? Why wasn't it just a natural progression? How did we get to the idea that there's this divide that has to be crossed in the first place? So, you know, how did we get there? And I'll keep on going back and forth like that.
Yeah, yeah. I mean I think it's because largely the practice of systems management the phrase we still use was never concerned with all the stuff that DevOps is. It was concerned with running... I mean, not to be all history professor guy, but like, it's probably a reaction to migrating to a Microsoft desktops and running ERP systems. Which are very fixed, long release things that are static and stay the same. And you know, there's also like network traffic and things like that. But the ops people in IT departments were not so much concerned with running the custom-written software that organizations had. And then, you know, there's like the whole 2000s and then you have to write your own software and so forth and so on. And then DevOps really comes from the consumer world where they were all writing their own custom software. And they were a bunch of you know, young people in apartments eating Ramen who could care less about ERP. And then they come up with this whole new science of things. Of how to, of DevOps. And at the same time there's, because cloud stuff is cheaper and a lot of the work that like, Chef and other people did to make it easier to configure things. It's a lot, the risk and cost bar of making your own custom written software is a lot easier. So in some sort of Jevons paradox thing like, all sorts of people do it nowadays. And then on the business side, there's this I don't know if this is top pressure or back pressure. I don't know those metaphors but there's like this pressure of like Uber is going to come and run us out of business so we should learn how to customize our business and do custom-written software. And it's not only Uber but all these other companies are coming in and trying to disrupt. And so you know, you need to write your own software to differentiate your business. And be able to change your business model on a weekly basis. Like, to steal an old Chef line and transmogrify it, you've to have a programmable business. And be able to change the business around based on the custom software you have. And so there's still all those ERP systems. And I think the sysadmins are like all the Eloi who are running all this, are going down to the Morlocks and they're like, "Hey, Morlocks. Stop eating us and help us out." And I think the Morlocks are I mean, I don't know how human flesh tastes but they're kind of into it and they need to change their diet, I guess.
So, are we saying that the sysadmins are Morlocks?
Okay, you've been warned.
I mean, I mean if you're going to--
The Sci-Fi references in these, it's rich.
As I recall, the original text in the interpretations if you're going to choose one of the two that's the one you would want to choose, right? I think they'd do much better.
Okay. All right. Fair enough. All right, Clinton. If you can, anything to add onto that?
Well, yeah. I shall come up with a really bloody, graphic metaphor at some point, I swear. But the, so, yes. I think the biggest thing is ephemerality. This idea, you know, that you've got this, as you mentioned, a fairly fixed set of systems. Even today, I'll encounter groups that are saying, "Well, we start off with our inventory and our inventory is this. And we get notification when the inventory changes." And then they become upset when they don't get notified. And instead... You know that model just falls down whenever you have an ephemeral system. And just to sort of level set here when we think of DevOps one of the big innovations that happened was that people found that they could create and destroy entire compute environments or storage environments at will through an API. Infrastructure as a Service, as they call it. And then you can write it behind code. Infrastructure as Code, so that then you can just have these things create and destroy under programmatic control.
And just to clarify for people who aren't familiar. I mean, we're talking about on a daily or even hourly basis that people would bring up and destroy their entire application infrastructure.
And with container systems, it can even get down to the seconds. It's very, very, very fast. So if your inventory was changing on a second by second basis you have to come up with a new model. A lot of the new monitoring tools... So, like nodules for example. You'd have an inventory. You can put in some dynamic discovery into it but its job is to know what's out there and then go reach out to those things and ask it, "What's your stats on this? What's your stat on this?" It's pulling in information. Instead of that model, a push model has started to evolve in which the systems that wish to be monitored as they come up, they know, "Hey! I'm a thing that should be monitored and I know where to find the monitoring service. So I'm just going to start spewing metrics at them. And I'll send along some tags as well saying I am this application. I am this tier; I'm a web service. I've got this going on with me. I'm this version, I have this set of feature flags enabled." All this sort of stuff. That all goes out to it. The monitoring service sees that coming in and says, "Oh, I guess I should start monitoring this thing." And whether I have one of these or 20 of these, maybe that's an ‘alertable’ thing or maybe it isn't but the monitoring system automatically picks up on what's out there and what isn't. So the inventory problem can be solved in that way. But it's, there's so many steps thereof, "Oh, gosh. Compute is ephemeral. Storage is ephemeral." All these things are ephemeral. If you're you know, a Morlock working on the big mass of systems that don't move often, you know--that's not your world at all. If you're working on you know, a massive Oracle database that's you know, five petabytes and you're not going to move it for anything.
This is not a world that you're familiar with.
Right, and just to bring in a conversation that you've had before. You were talking about a particular CIO who was very upset because he wanted a list of IP addresses.
Tell me more about that.
Yeah, he wanted, so it was a migration into AWS. They had about 400 services spread across about 4,000 physical machines. And it was migration into AWS. And he really, really insisted that he have a, in his configuration management system he had a list of every IP address that they owned at that moment. And in AWS, for those not familiar, a public IP address that is basically something you can rent. It's called an elastic IP and you pay for it by the second or whatever. And you know, they can come and go almost immediately. I mean, there's no sense in tracking this. There's no meaningful, there's no meaning to having one over another. And to be able to have that capability was not a particularly valuable thing to the company. It didn't have any; they didn't have any compliance needs around it or anything like that. If there were compliance needs it made sense but it was just a thing that he was very familiar with having. And so moving to this new world where this particular resource came and went as it needed to and was no longer worth tracking was very strange. Eventually we got him to let go of that and say, "Hey! Watch the DNS stuff. The DNS and the load balancers that's the interesting stuff." So, that was the actual business piece that he was looking for.
We got him on that.
And I think going back to your question about the disconnect, right? Where is that disconnect and why does it exist? I think it goes back to some of what Cote was saying about the focus of the engineers in the particular organizations. So I think in a more traditional organization, I'm an engineer that is a sysadmin, and it is my job to rack and stack servers. And I know I did a good job because I racked and stacked more servers this year than I did last year, right? And so like, but what I care about is the racking and stacking and maybe the configuration and the management of those servers. That's where my, that's the scope of my job. But then when you move into these more consumer-oriented places, web-based companies that are writing all of their own software and wanting to deliver that to customers. That kind of starts to change how you think about things. And then I think in those large organizations where you do have like, here's the network team here's the, sorry the infrastructure team here's the database team. Where you have those different layers and they're just providing service to the business. The business gets frustrated because it moves slowly. And so what happens is we spin up shadow IT or the business is becoming a software-driven organization right? And so the business goes to the cloud, adopts these DevOps practices, and ends up hiring people that are outside of that traditional model that can accomplish the business function thereafter.
So just to give a sense of the differentiation. There's, you know, a number of surveys that have gone out recently and one overlap... There was a set of data that was brought from Monitorama, they pulled a bunch of people from Monitorama and then there was similar questions that came from Cisco Live. This was all in 2017. And they asked people to, there were two questions. One was what's your biggest concern as far as monitoring goes? And the other one is, what's the volume of alerts? We're back to alerts again. And I thought those two bits of information that I was able to sort of find from these surveys revealed a lot. So, at Cisco Live, the biggest concern was we need a single pane of glass. What's the volume of alerts that you deal with? Well, about half said they deal with millions of alerts per month. Per month, millions of alerts. A quarter of them said they deal with thousands. And the last ‘quarter-ish’ was hundreds. At Monitorama, very ‘DevOps-ey,’ their biggest concern was alert noise. That was the biggest issue. And let's say half were dealing with hundreds of alerts. And then we'll say, you know, a quarter were dealing with thousands and the last little bit was in the millions. So almost turning that you know, on its head. So here we have people who are very operationally focused at Cisco Live, who are dealing with millions of alerts per you know, month or whatever it is. And their biggest concern is not the noise of the alerts. Their concern is, "I need to be able to see it. I need to put my hands on it. I need to see what it is." Meanwhile, in the DevOps mindset, you know, I'm dealing with hundreds of alerts! [Whining] And you know, and they're so noisy. And I feel like if I could get everyone together and have like, you know, a drink together. And then say wait, wait, wait you're dealing with wait, what? That's your problem? So just again, exemplifies the difference, the mindset. Which leads me to ask how do we bridge the gap? Should we bridge the gap? As folks who are in a traditional operation-based monitoring world, but DevOps is happening in your company. It is happening in your company, it is. Whether you happen to have lunch with those folks or not. DevOps is happening all around you. Should we be trying to find ways to bridge that gap? Or is it better that we keep them Morlock/ Eloi without the flesh-eating part? That we keep that you know, separate because never the twain shall meet?
Well, I think that first I think it would be very easy on the surface to dismiss what you just laid out for us in terms of the metrics, right? If you think about the audience of Cisco Live and you think about the audience of Monitorama it might be very easy to say, "Well, Cisco Live, that's the big enterprises. Monitorama, that's the smaller, leaner more agile organizations, enterprises, whatever. Right?" So, just the sheer size and even potentially the age of those companies that are attending those two conferences. That's what contributes most to the differences in those metrics, maybe. But, the truth of the matter is as you just said like, DevOps is happening in your organization. The business cares and is driving the way that software gets developed. The way that software gets delivered. The way that software operates and serves your customers. So, that's happening already. So, how do we bridge the gap? Well, if we want to practice some DevOps principles the first and best way to bridge the gap, honestly, is to find the people that are doing that DevOps and go have lunch with them. Sit down and get to know them as a human. Talk to them about what are the things that you're doing? Why are you doing these things? What keeps you up at night? What are the challenges that you're facing? And share those that you're having as well. By making that connection that's the first and best place to start, in my opinion.
Okay great. Where from there?
Yeah, that's fantastic. One of the best pieces of advice I ever received was to never eat alone. And also never eat with the same people all the time. So, I think that's fantastic advice. Another thing to mention going back to the three-layer cake idea or the idea of monitoring business needs first. If you're seeing a million alerts. I mean, you don't have to adopt DevOps to do the three-layer cake thing. In a traditional IT or bimodal IT environment you can monitor the business results first. You can totally do that and alert just on those. And then perhaps still go ahead and alert on service level things as well. That's fairly sensible because that's on a far smaller scale. Places that are receiving millions of alerts they're probably getting alerts at the systems level. And those are almost never actionable. So, that sort of apply the three-layer cake to it. Monitor what matters to the, or alert on what matters to the business most. Do go ahead and monitor those system level things because they are very important diagnostic tools. But you needn't alert on them most of the time.
And at a million alerts per month. You may as well just turn off all of the alerts.
Turn off all of the alerts. Just do that.
Right, so just to pick up on one thing you said about the systems level alerts are never important. It's a very interesting again, mindset. Because I think there's a lot of folks who are watching who would say, wait, wait, wait, wait, wait. This is what I do; this is what I built. And while we've had, at Solar Winds Lab we've had lots of conversations about you know, alerts that are not sophisticated enough to tell you something that's actionable. There are actionable system level alerts. However, I heard a quote recently that I really liked and it did explain this mindset, which is, "You can't monitor blood pressure and heart rate and find out if you're giving a good speech." Like, that's not the metric that is going to help you. And in this world of elastic compute finding out about CPU and RAM isn't going to tell you, when your entire environment is migratory and it changes from second to second possibly. Monitoring CPU and RAM the way that we're used to thinking about it in traditional IT isn't going to tell you anything meaningful.
Well, I would pause. I wouldn't say something quite that extreme. I would say that it is... It's not ‘alertable.’ I would say that it does tell you meaningful things. Especially about trending and capacity planning. And it may identify things like memory leaks or things like that. Opportunities for optimization. Especially on, storage is a little bit different of the three. Storage especially can identify either architectural problems or else actual trends. I'm like, "Oh, yes. We are growing storage and we do care about this data." Or, "We're creating storage on this machine and we don't know why, and it's stuff we don't care about. So we have some design flaw where something's dumping something to disk and it shouldn't be. So, it is valuable but it's not wake up at three in the morning and lets you know, get in the war room and fix this thing.
Got it. Michael?
Sure, I mean I think. I think the first step is like, figuring out I mean, certainly the people that get woken up care. But figuring out if the people who pay them care. Which is to say if you're getting a million events or let's just say, a thousand. Right? A thousand events that you have to respond to spread across 30 to 31 days or 28 in a leap year. And that has been that way for years then the business people either are ignorant or that's actually efficient. Or you're going out of business and you should find a new job. Right? So it's sort of like, "Well, maybe that's a fine way to operate and you just get the short end of the stick of it." But let's assume that you're not short sticking people. And instead, you want to I mean, we've covered this a lot but it's sort of like... I mean, if you're getting that many alerts and you're getting woken up and then you should probably start tracking if it was meaningful, right? Like, if it was meaningful, and it helped you solve a problem, then just plan on being woken up a lot, right? But if it's just nonsense that didn't go anywhere then to take another page from sort of SRE stuff, the first thing you should do and this is why it's important to see if the people who pay you care. Is introduce a new process that says, "Well, we should automate the remediation of this and take time out of our time and budget to actually fix it." Right? So that it remediates. And then also to the point that you're saying is and if I get an event that doesn't tell me what's wrong and how to fix it then we should go back and add that in there, right? And so, you know, that should cut down on the number of events you have, if it's possible. And then again, if it actually is the fact, you know--the thing that you're getting a thousand... If there is a thousand or a million things that are going wrong that need to be addressed within five to 60 minutes, and your business is still profitable, then I guess that's your life, right? But if your business is doing poorly then it's time to like, either just let yourself go out of business unless you're in the government, which is an issue. But you've got to let yourself go out of business. I mean, you've got to take some budget to go fix that problem. So that you can make--
Not let yourself go out of business.
Yeah, yeah, yeah. I mean it's sort of the point of... And this is why too many events is so frustrating is because the people who deal with them know that it's absurd. But for whatever reason they haven't convinced the upper people to do something about it. And it could, this is what I started off. I could very well be the case that that's perfectly fine, right? To use a weird analogy. If you have a leaky faucet in your house, and your water bill only goes up like 10 more cents a month, and you're not afraid about killing the earth, why would you pay a plumber like $300 to come fix it, right? Just let it leak. And so, maybe it keeps you up at night, like a page, but you know you want your $300 instead.
And I think a good way out of that situation, and you know, you talked about automating the remediation, I think before you automate the remediation you should stop for a second and figure out, "Why is this happening? Does it like, are there ways that we can orchestrate or architect around this?"
So that we don't have to automate the terribleness. [Laughing] Like, let's not automate something that's terrible just so that no human has to interact with it. That doesn't make any sense.
Yeah, and I think maybe that is getting to the bridge the gap thing. That is lessons that everyone can take from DevOps. And I think it was Nathen that was bringing this up is to rephrase, think about the humans for everyone's sake. Like, something that DevOps assumes is that we have no idea what we're doing. So, we're going to bake into the process figuring out what we should be doing, right? Which is to say, when we get these metaphorical or literal millions of events let's put into our process figuring out how to not do that, right? Like let's not just say it's written down in a Word doc somewhere that no one ever reads. That we can't, that we don't have the time to do it. But let's bake in some amount of time to like say, "We've got to go figure this out. We've got to solve this problem." And I think that's a characteristic of DevOps-think that I never really encountered in a practical way in system administration before the DevOps.
So it's interesting only because recently we put out an ebook called Automation Not Art. You know, they do that it's not an act of interpretive dance. It's just automation; you shouldn't go crazy. And there's a whole section there about the gorilla in the room. So the book is about automation and adding automation to your alert responses. So you know, if the alert is actionable and the action is repeatable and predictable, then your alerting tool should be able to do if nothing else, the first steps. But you have to consider the elephant in the room, which is, why is this happening? It's great that the service shuts down and it's great that they automatically restart the service but if that occurs 10, 15, 20 times in a night you have to build, like you said build in a time to go back and say but why is it happening in the first place? I have to fix the problem, not just... Not just you know, solve the issue. I have to actually fix the underlying problems. So there is that conversation in the traditional space. But I understand that a lot of people get stuck, never quite getting there. So I want to hit the lightning round. This is the, you know, wrap-up. So the lightning round. In one minute, each of you give me your thoughts on DevOps and the data center monitoring. Better together, or strong fences make good neighbors? So do we keep them separate or should we, as monitoring people who are interested in monitoring from the DevOps side or the systems side. Should we work to bridge that gap?
Well, DevOps tells us that we're better together, for sure. So, we have to bring everyone together.
Okay, easy and simple.
I'll invoke Conway's Law and say that if your organization is built so that the data center people are over here and the cloud or DevOps people are over here. They're probably going to pick separate tools and be happier with their separate tooling that's optimized for their particular roles. Ideally, you would have feeds that go together. But they're going to fight you every step of the way.
Okay, got it.
So, a minute. Sixty seconds, right? I'll try my best.
Go for it.
But yeah, I mean I don't disagree with that at all. But I would suggest that there is a cut line especially since you said data center where like, do you really need to talk to the facilities people? And do you really need to talk to the HVAC people? And do you really need to talk to all the wiring people? Like, you just go up and up until you find this area where it does make sense to be more collaborative with someone. If you're running on anything vaguely like a true cloud infrastructure, it's not like you pick up the phone and call Amazon, Microsoft, or Google people, right? It's sort of like, this is the platform you get. And if it doesn't work for you, there's two other people out there, right? And you know, if you're a big enough account, you can call and all and stuff like that. But the point being that when you establish whatever you think of as your data center or your infrastructure or your platform the benefits of that platform is that you don't mess with the people underneath it and vice versa. They just sort of provide the service for you, and that allows you to act a lot faster than if you’ve got to go to like CABS and change review meetings and stuff that involves all of them.
Got it. All right. So, to summarize some of the things that I've picked up from the conversation is something that I've been saying and we at SolarWinds have been saying for a while. Which is that it really benefits everybody traditional monitoring, DevOps-oriented to pay attention to the business. As a long-time pundit said, Bob Lewis has said, "There are no IT projects. There are only business projects with an IT component." And I think that that is truer now in this world of DevOps and cloud than ever. Is that the more focus that we can start to give in our monitoring tools to the business pressures, I think the better we'll all be. And the better chance we have of merging those things together. So, I'd like to thank you for spending a little bit of time with me. Helping to figure out where that disconnect is today. And I want to thank all of you for joining us on this session When DevOps Says Monitoring. Now hopefully we know a little bit more about what it means. For THWACKcamp, I'm Leon Adato.