Large and small companies alike have it easy when it comes to network monitoring. Large companies can afford the biggest and best solution available, and an army of people to monitor every little twitch in the network.
Small companies, on the other hand, either don't monitor at all (this is the much-vaunted, "Hey… the Internet is down" approach--also called "Canary" monitoring), or only have a very small number of devices to keep an eye on. What happens if you're in the middle somewhere? What happens if you're big enough to have a staff, but not big enough to have one dedicated person assigned to sitting around watching for failures?
If you are like me, you buy a great monitoring solution like SolarWinds's Network Performance Monitor (NPM). You do the initial installation, run through some wizards to get some basic monitoring up and going, then you start playing with the "nerd knobs" in the software, and boy does NPM have that in spades. NPM monitors everything from port states to power supplies, link errors to wireless authentications, and everything in between. You can then send alerts to a distribution list that your team tracks, assign responsibilities for monitoring and escalation, and everyone on the team now has visibility into failures and problems anywhere on the network.
Nirvana. Bliss. You have now become the Network Whisperer, a beacon of light in the darkness. Except, now you have 150 email messages about a printer because that printer has been offline for a few hours waiting for a part and someone forgot to kill the monitoring for that device. Easily fixed. Ooh… 300 more emails… someone rebooted a server.
I exaggerate a bit, but you see the problem. Pretty soon you set up mail rules, you shove your monitoring messages into a folder, and your entire monitoring solution is reduced to an in-house spam-generation machine. You find you don't actually know when stuff breaks because you're so used to the mail folder filling up, you ignore it all. The only thing you've accomplished is the creation of another after-the-fact analysis tool. Someone from accounting tells you that so-and-so system is down, you can quickly see that, yes, it is down. Well, go you.
I'll talk about how we work to solve this problem in my next post, but I'm curious how everyone here deals with these very real issues:
* What do you monitor and why?
* How do you avoid the "monitoring-spam" problem?
2nd question first: How to avoid the "monitoring-spam" problem?
We only setup alerts (email, text message, flashing lights and sounds, etc.) for the events that really are worthy of that type of notification. We also make sure that we've set the duration for the event prior to the alert actually triggering to be something realistic. We don't setup alerts for things like high CPU or memory utilization unless there is something very specific that we are looking for ... because these are regular occurring events and are considered normal in our environment (SQL and Exchange servers regularly consume most of the available memory so it would be pointless to alert on high CPU utilization for these servers). That said, we do still have some "monitoring-spam" but we work to keep it to a minimum by constantly adjusting alerts as needed.
Back to the first question: What do we monitor and why?
We try to keep it relatively simple and only monitor on the things that are absolutely imperative that we know about right away. A lot of other things are monitored and logged to the Orion event log for us to find if we are trying to track down a problem that we did not receive an alert about.
It really depends on the event being monitored ... for low disk space issues, for example, we send those to the "Team" responsible for the application that the server/device was created for. Most of the alerts, however, always go to the our Network Admin team (4 of us). We sit in close proximity and communicate with each other so that everyone knows who is handling the alert. It is the Network Admin team's job to insure that the alerts get a response even if it's not one of us responding.
1. Network / Server Up down
Disk Space either Percentage or specific size based on criteria
PDU voltage and amperage
Deep application monitoring ( for example we give our app dev team a real time break down of what smart device, os, and os version is hitting our mobile web portal and what the first click is )
2. Work with the LOB's to make sure we are only monitoring valuable data.
Clear, well defined rules for both alerting and alert actions.
A couple items I've had to deal with to keep the event stream from being flooded:
NetApp snmpd agent (some versions) constantly change the volume id around on NAS's with many volumes. This creates hundreds or thousands, if you have as many as I do, of volume changed messages, volume appeared, volume disappeared. (I've had to manually edit the table in the database to not log these events to database and likewise not put them in my event stream)
I've noticed that thresholds that might be set to defaults for physical servers... DON'T apply to virtual servers. This means virtualized SQL servers and some others end up having most of the thresholds in SAM being wrong out of the box.
The new hardware monitoring is nice... but some devices for some reason will pump out hardware up green messages ALL DAY LONG every polling cycle. For devices that do this I turn hardware monitoring OFF.
These are just three examples of my worst offenders which if allowed will Generate so many events that you can't see the forest for the trees. <--- This is here for a reason
I have seen the virtual vs. physical monitoring problems here as well. It makes sense that you look for different things, but out of the box the virtual settings don't tend to work as well as you might think. Or, they seem to be somewhat counter-intuitive (what are you monitoring when you monitor "physical memory" for instance). Good points.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.