Large and small companies alike have it easy when it comes to network monitoring. Large companies can afford the biggest and best solution available, and an army of people to monitor every little twitch in the network.
Small companies, on the other hand, either don't monitor at all (this is the much-vaunted, "Hey… the Internet is down" approach--also called "Canary" monitoring), or only have a very small number of devices to keep an eye on. What happens if you're in the middle somewhere? What happens if you're big enough to have a staff, but not big enough to have one dedicated person assigned to sitting around watching for failures?
If you are like me, you buy a great monitoring solution like SolarWinds's Network Performance Monitor (NPM). You do the initial installation, run through some wizards to get some basic monitoring up and going, then you start playing with the "nerd knobs" in the software, and boy does NPM have that in spades. NPM monitors everything from port states to power supplies, link errors to wireless authentications, and everything in between. You can then send alerts to a distribution list that your team tracks, assign responsibilities for monitoring and escalation, and everyone on the team now has visibility into failures and problems anywhere on the network.
Nirvana. Bliss. You have now become the Network Whisperer, a beacon of light in the darkness. Except, now you have 150 email messages about a printer because that printer has been offline for a few hours waiting for a part and someone forgot to kill the monitoring for that device. Easily fixed. Ooh… 300 more emails… someone rebooted a server.
I exaggerate a bit, but you see the problem. Pretty soon you set up mail rules, you shove your monitoring messages into a folder, and your entire monitoring solution is reduced to an in-house spam-generation machine. You find you don't actually know when stuff breaks because you're so used to the mail folder filling up, you ignore it all. The only thing you've accomplished is the creation of another after-the-fact analysis tool. Someone from accounting tells you that so-and-so system is down, you can quickly see that, yes, it is down. Well, go you.
I'll talk about how we work to solve this problem in my next post, but I'm curious how everyone here deals with these very real issues:
* What do you monitor and why?
* How do you avoid the "monitoring-spam" problem?
Yeah, and I purposely avoided drawing too many conclusions from the service provider world. You've got a whole different set of issues (SLA, multi-tenant, etc.) to deal with on that side of the fence. I can imagine that monitoring there is not only important in the same way as in the enterprise, but also for billing purposes.
I pay attention to every alert that gets e-mailed. It can be a little tedious to tweak thresholds and dependencies, especially if the tool you are using doesn't correlate for you, but it's possible with the right set of tools. It's just as easy to get lost in the spam as it is to collect the infinite amount of data you'll never use. And it's just as easy to narrow down the spam as it is to aggregate older data, you just need to know what you're looking at and what you want to get out of it.
The biggest problem I see most people struggling to deal with is getting started. When you're talking about some network devices that can send out 1000's of messages per hour and you have dozens of them. Or servers, we all know how chatty the event logs can be, then you've got all of the different monitors, it can seriously get overwhelming very quickly when you're first implementing a monitoring solution, or heck even after you've had it for a while.
I tell people to take it slow and just remember, it all starts with these two questions for each alert and once the ball gets rolling you'll get behind it.
1. Do I need to know about this?
2. Do I need to know about this at 2am?
We monitor everything related to our deliverables - routers, switches, all servers P and V, hosts for those servers, storage, UPS, cooling, application performance for apps, SQL, Oracle, etc., etc.
We avoid spam thusly - we set up our NPM environment a year ago, and I was the sole recipient of email alerting during our pilot. I got it to a level I thought appropriate and added my boss and our help desk manager, and let that cook for a few days, after which we tuned cycles before notification, etc., as well as the appropriate granularity of our polling. It's an ongoing thing and will continue to be. When our partners in EMEA and Asia asked me to add tons of things, I performed the same process in an accelerated manner, with the latency add of overseas transit factored in. Also, dependencies are great and make things elegant.
We monitor everything. For network gear, we have a specific template per device (i.e.; we only monitor certain interfaces per router, switch, firewall, etc) that we poll with SNMP. For servers and storage arrays, we only use ICMP monitoring as we use a different solution for in-depth server monitoring.
As far as how to avoid the flood; we only alert on items after XX polling cycles and we have a 24/7/365 alert management team that parses through alerts from our entire enterprise and assigns them to support queues as needed. (I guess we fall into the "really big company" category)
Did it take some trial-and-error to find the right number of poll cycles to wait before alerting? Do you find that "down" events are slower to alert because of this, or does the team catch the real problems quickly enough?
My team specifically handles the architecture, design, and maintenance of SolarWinds NPM and a few other monitoring tools. As such, the alert definitions were created with the guidance of the network architecture team. As a company, we provide managed hosting solutions, among a lot of other things, so our monitoring is heavily focused on SLAs and metrics around time to remediation for events.
Our alerts are what I would consider "living", in that they are always available for change at the discretion of the customer(s). However, our basic system of alerting can have an engineer looking at an issue within 5 minutes of an outage (which is pretty good for tens of thousands of devices). There are a few false positives, but there are "safeties" built into our systems and processes that dramatically decrease returning alerts on servers that are not in production or a switch that was decommissioned, etc.
We monitor servers, switches, routers, access points, virtual machines, and more. Depending on how importing said nodes are, we vary our monitoring of other things like disk usage. All of them get at least up/down. Some of them get their interfaces monitored.
Right now we have other means of alerting and aren't using alerts except for a few very important things. We are working on dependencies so we can configure alerts to be reasonable then configure notifications from there. If we didn't do it that way, our notifications would be the most annoying, un-useful tornado of emails ever.
Yup. We also have "other" monitoring solutions (some home-grown, some off-the-shelf), mostly in cases where the functionality is needed from one product and just doesn't exist in another. For instance, we log everything to syslog and Splunk for sorting and after-the-fact analysis, but don't alert from there. It gets even more important to control the monitoring-spam when you get multiple systems all wanting to churn out alerts.
Thanks for the reply.
A couple items I've had to deal with to keep the event stream from being flooded:
NetApp snmpd agent (some versions) constantly change the volume id around on NAS's with many volumes. This creates hundreds or thousands, if you have as many as I do, of volume changed messages, volume appeared, volume disappeared. (I've had to manually edit the table in the database to not log these events to database and likewise not put them in my event stream)
I've noticed that thresholds that might be set to defaults for physical servers... DON'T apply to virtual servers. This means virtualized SQL servers and some others end up having most of the thresholds in SAM being wrong out of the box.
The new hardware monitoring is nice... but some devices for some reason will pump out hardware up green messages ALL DAY LONG every polling cycle. For devices that do this I turn hardware monitoring OFF.
These are just three examples of my worst offenders which if allowed will Generate so many events that you can't see the forest for the trees. <--- This is here for a reason
I have seen the virtual vs. physical monitoring problems here as well. It makes sense that you look for different things, but out of the box the virtual settings don't tend to work as well as you might think. Or, they seem to be somewhat counter-intuitive (what are you monitoring when you monitor "physical memory" for instance). Good points.
1. Network / Server Up down
Disk Space either Percentage or specific size based on criteria
PDU voltage and amperage
Deep application monitoring ( for example we give our app dev team a real time break down of what smart device, os, and os version is hitting our mobile web portal and what the first click is )
2. Work with the LOB's to make sure we are only monitoring valuable data.
Clear, well defined rules for both alerting and alert actions.
2nd question first: How to avoid the "monitoring-spam" problem?
We only setup alerts (email, text message, flashing lights and sounds, etc.) for the events that really are worthy of that type of notification. We also make sure that we've set the duration for the event prior to the alert actually triggering to be something realistic. We don't setup alerts for things like high CPU or memory utilization unless there is something very specific that we are looking for ... because these are regular occurring events and are considered normal in our environment (SQL and Exchange servers regularly consume most of the available memory so it would be pointless to alert on high CPU utilization for these servers). That said, we do still have some "monitoring-spam" but we work to keep it to a minimum by constantly adjusting alerts as needed.
Back to the first question: What do we monitor and why?
We try to keep it relatively simple and only monitor on the things that are absolutely imperative that we know about right away. A lot of other things are monitored and logged to the Orion event log for us to find if we are trying to track down a problem that we did not receive an alert about.
It really depends on the event being monitored ... for low disk space issues, for example, we send those to the "Team" responsible for the application that the server/device was created for. Most of the alerts, however, always go to the our Network Admin team (4 of us). We sit in close proximity and communicate with each other so that everyone knows who is handling the alert. It is the Network Admin team's job to insure that the alerts get a response even if it's not one of us responding.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.