Hello, everybody. I'm Bob Plankers, and I'm one of the thwack Ambassadors for the month of March. I'm a system administrator, virtualization architect, ex-programmer, network aficionado, and recovering storage admin. I've lurked around thwack for a long time, and I'm hoping my saying something here, finally, generates some good discussion. I've got a few topics that seem relevant and could be a good start to a conversation. Please feel free to comment, especially to tell me I'm wrong. 🙂
So, is it possible to have too much monitoring?
Part of me says no. The more you know about a system the better off you are, even if you use that data only infrequently, or never. Instrumenting absolutely every part of a system for status & performance data gathering is crucial to success. Since you can't know what data or alarms you will need in the future wouldn't it be best to have them all, now?
Part of me says yes. A friend of mine once said that people shouldn't test for conditions they're not prepared to handle. And if you test tons of things not only do you need to write & maintain all those tests, but you'll also need to maintain epic amounts of documentation about what to do when a certain alarm goes off, in order to handle them. Isn't that what you pay sysadmins for -- to do the complex things? Why can't we just test to see if a service is up and then leave it to a sysadmin to figure out the rest? Besides, all that monitoring consumes system resources that could be used more productively, like servicing user requests.
As with all things in IT the answer is partly "it depends." Is there some formula for computing what the right amount of monitoring is? How do I describe the right amount of monitoring to a beginning sysadmin?
One post from @parkercloud on Twitter is that there's never too much monitoring:
"The answer, of course, is to monitor everything. Everything means everything -- all router ports, all switch ports, every server, every service, every Internet connection, and every storage system."
I think that's a recipe for ignoring your monitoring system. Think "boy who cried wolf."
What about false alarms? Why do you care about everything? Perhaps he means logging. Logging is good, alarming on things you ultimately won't do anything about, or are part of routine maintenance, is bad. Alarms need to be timely and relevant, and if it demands your attention it should be worthy of attention, not just wasting your time as you delete the alert email.
One approach that has worked for me is to get with each team (sys admins, network engineers, db admins, etc) and find out what is important to each of them. Since I'm none of those, it's a good idea to find out what they want to be alerted on, and go from there. It helps when building alerts, web views, etc and it allows each team to have a customized login and alert structure. In working with our network team, they weren't worried about monitoring user switch ports, but need up to date info on ports that are connected to servers, storage, etc. If you know what everyone wants out of the NMS, you can build a tighter system that functions properly.
I think you just nailed it, BTW. For me it's all about false alarms. A "boy who cried wolf" situation is the last thing you want for a monitoring system.
I have a piece of art on my office wall that says "Let's make better mistakes tomorrow." If you don't collect data on everything it's likely you won't have the data you need when you make a new mistake, or see a new problem, tomorrow. 🙂
Filtering of the alarms is a must. The setup of monitoring is not done until there are no false alarms. This will take time.
I suggest monitoring everything one can and set the alarms to "reasonable" conditions. As false alarms come in, filter them as close to the alarm generator as possible and as specifically as is reasonable.
The goal is for you to be the first to know that a "bad thing" happened. Either before your boss and coworkers know, or before the "bad thing" gets worse.
We have multiple environments that we deal with, but in our most critical environment it was dictated that "everything" be "monitored"... Of course, this was from someone that didn't understand the difference between monitor and alert. He wants everything alerted on... so there are no "Monitor Only" resources in this environment.
We have found that as we add more monitors that the overall number of alerts has gone down. This is largely in part because sysadmins and dbas are now more aware of the behavior of their servers/applications. I'm sure some of the cooperation that we've had with other teams is the sole result of flooding their inboxes with alerts until they get on board with setting up appropriate thresholds and just fixing their stuff in general.
So far it has been a very modular approach...
We've had the best gains when our team (monitoring) works directly with a SME on the admin team for Windows, *nix, DBA, Application, etc. to build a baseline "template" for monitoring.
Servers are monitored for CPU,Mem,Interface/Volume Utilization... Basic Server health APM monitors are added as well. We stack on top of this DB monitors, Application Monitors, etc. as well as a healthy dose of HTTP monitors for just about everything. We've recently had the purchase go through for SEUM and this should hopefully round out our monitoring environment.
We are lucky in that we have round the clock coverage on the monitoring team as well as in most of the admin groups that we have to deal with.
With all that said, I would opt for fewer alerts (especially once SEUM is up and running) but keep almost everything monitored. Unfortunately, I think the general finger-pointing/blame-game nature of having multiple groups responsible for the overall end user experience precludes us from removing any alerts in the near future.
We've used the SME approach as well, to build a template that's generally useful. That's worked pretty well, as we can stamp out a system that's got a base level of monitoring and then go from there.
You've got a good point in that alarming on everything can be helpful in teaching people what their systems are doing. It also sounds like you have an environment with people in it that want to run things well. The flooding of inboxes sometimes has the opposite effect, where people build a mail filter and stop caring. Then you need management involvement, and that just gets messy. For me, reduction of false alarms is one of the biggest goals I have when configuring monitoring, if only because I, and my team members, value sleep. 🙂
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.