Large and small companies alike have it easy when it comes to network monitoring. Large companies can afford the biggest and best solution available, and an army of people to monitor every little twitch in the network.
Small companies, on the other hand, either don't monitor at all (this is the much-vaunted, "Hey… the Internet is down" approach--also called "Canary" monitoring), or only have a very small number of devices to keep an eye on. What happens if you're in the middle somewhere? What happens if you're big enough to have a staff, but not big enough to have one dedicated person assigned to sitting around watching for failures?
If you are like me, you buy a great monitoring solution like SolarWinds's Network Performance Monitor (NPM). You do the initial installation, run through some wizards to get some basic monitoring up and going, then you start playing with the "nerd knobs" in the software, and boy does NPM have that in spades. NPM monitors everything from port states to power supplies, link errors to wireless authentications, and everything in between. You can then send alerts to a distribution list that your team tracks, assign responsibilities for monitoring and escalation, and everyone on the team now has visibility into failures and problems anywhere on the network.
Nirvana. Bliss. You have now become the Network Whisperer, a beacon of light in the darkness. Except, now you have 150 email messages about a printer because that printer has been offline for a few hours waiting for a part and someone forgot to kill the monitoring for that device. Easily fixed. Ooh… 300 more emails… someone rebooted a server.
I exaggerate a bit, but you see the problem. Pretty soon you set up mail rules, you shove your monitoring messages into a folder, and your entire monitoring solution is reduced to an in-house spam-generation machine. You find you don't actually know when stuff breaks because you're so used to the mail folder filling up, you ignore it all. The only thing you've accomplished is the creation of another after-the-fact analysis tool. Someone from accounting tells you that so-and-so system is down, you can quickly see that, yes, it is down. Well, go you.
I'll talk about how we work to solve this problem in my next post, but I'm curious how everyone here deals with these very real issues:
* What do you monitor and why?
* How do you avoid the "monitoring-spam" problem?
At a previous job we monitored every system just for Up/Down status. Now the IT staff, we were used to the alerts and if (more like when) something went down we knew right away if ti was a false alert, waited a couple of minutes and received the " back online alert." Our problem was not the system periodically spamming us with false positives, the micro manager boss who also want the alerts for every division within IT would constantly be asking ask to look at different systems that were other divisions issues. What we really need was Spam filter for him. If we changed the setting on what alerts he received and did not get an alert someone else got he would blow a gasket.
Dividing who gets what alerts is an essential part of managing the IT beast. Staying within the bounderies that are assigned to each division within IT I also an important of avoiding the monitoring Spam.
Yeah, the boss filter can be a problem depending on the structure and size of the organization.
I'm definitely a fan of more granularity than just up/down in monitoring, but every organization is going to be different in that regard. Sounds like you had a system for everything *but* the boss. Maybe that's a good feature request.
Monitoring is the easy part, alerting efficiently is difficult.
We use the "knee-jerk" approach to setting up alerts which takes months and sometimes years to alert on all critical outages.
After setting basic alerts, we wait until there is some critical degradation or outage.
When management asks why they weren't alerted, and a new custom alert is born.
Also here are two methods used to control who gets spammed.
This allows users to add/remove themselves from alerts using the edit/manage nodes GUI in Orion (self serve).
Interestingly (for me at least), I was reading a detailed postmortem of an outage recently - a link to which I have been totally unable to find now so I could share it - where one of the extenders for the outage was that the manager was receiving alert emails (hundreds of them as I recall) saying there was a problem, but had set up a rule that pushed all the alerts into a folder, and he therefore did not notice them until it was already many hours past when he needed to know...
Our approach sounds pretty much the same as that mentioned by RandyBrown above.
We only configure alerts for items we want to get an alert for, and that alert is configured to send to the group that wants it.
The first thing we did when we started configuring alerts was to disable all of them.
There are some items we want to know about but we don't want an alert generated. For these items we have a report that is available for review every morning. This report is a link on our default summary view.
Initially, I like to have NPM discover EVERYTHING. When I'm trying to wrap my brain around a new infrastructure, I need to know what's out there. Servers, switches, storage, hypervisors, even desktops and printers. This way I get a sense of scale.
Desktops are the first to go, though. I don't want to know how many times the visitor center PC reboots in a day. Printers usually go, too. I'll let them send traps if they run into trouble, but I certainly don't need to monitor physical memory utilization of a printer on a constant basis. In the end, it's routers, switches, servers, and storage that I care about.
On monitoring-spam: I agree with earlier posts that tuning your thresholds is a great way to reduce duplicate alerts from any NMS. A 5 second spike in CPU shouldn't trip any alarms. But 95% cpu utilization for 15 minutes might warrant some diagnostics. In these cases, I let NPM run with the defaults for a month or two to establish a baseline for performance, then go back and start tweaking alert thresholds.
Good stuff. For me, I actually find the "out of the box" defaults way too overbearing and tone it down even before deployment. I can, however, see your point in letting it all go to get a sense of scale. I approach it a little differently, but my way isn't necessarily any better.
We do not monitor Access Switch Interfaces; other than uplinks to our distribution boxes. Any new alert gets just me in the email line until the alert is fine tuned. Some alerts are for devices that other departments use and they are plugged into the network so just they and me of course will get the alert. Network related alerts go to a lot of people, all the way up (they like to know). But remedial items; like UPS batteries and some redundant hardware items go to the Operations team to be handled. We use the heck out of UnDP's for status.
Implementation of trap and syslog alerts has been ramped up to cover all ends of a situation. At the same time; triggers in the syslog highlight specific events on nodes that trigger alerts to just me and a few folks. Sometimes a really granular syslog or trap alert helps me to maintain the 'Big Brother' status and keeps people looking over their shoulder.
Not a bad strategy at all: moving the alerts out of "IT" at the macro level and compartmentalizing. Do you do that to the level where folks outside of IT (say, the financial controller) get alerts? ERP system has a problem, for instance, so the finance team gets a courtesy notice as well as the server team?
It is usually a specific request; or someone complaining about the connectivity for their devices that may communicate back to a server that they use for status. When it does't work seeing an email from Orion saying it is not working keeps them informed. Be aware!!! As this does NOT prevent phone calls, it causes more; and emails to. In one case, we have key lock boxes that connect to the network and are managed (PW and key status) through a software. Cheap Chinese electronics caused the devices to dump the tcp/ip stacks on a large multicast network. Even once isolated; the devices have issues if you do not restart the box and plug it in and reconnect at the server in the 'correct' way. So they got a lot of disconnects, but now (after several months of calls and emails) they understand the alert; and know to check their systems if a disconnect shows and my alert did not trigger. Now i do not get hardly any calls unless a box moves, or a new one goes in (but it took several months of taking those calls and re explaining to get there).
Another case is our Telehealth group; they want monitoring on projectors and the digital control consoles for each room.... that too, will be a specific alert to their group, and CC me and my partner in crime so we are informed. Alerts may be a projector bulb has XXXXXXXX life hours, it needs to be replaced. OR as serious as Room Control Console XXX in Room ABC.1234 is not connected (or unreachable) ... of course with this type of stuff it depends how robust the mib table is and how much information I can give them; more than up/down.
Even more so, access layer type stuff goes to our operations team mainly, where as a Distribution or Core switch issue will go to everyone (which includes the engineers).
I have another NOC-Type view that highlights how we seperate these items; using 2 (Two) AjaX views...I will see if i can post that on the NOC view request page before this time tomorrow.
Just an addendum to watching green change to yellow and red and know whether or not there is an alert that you or the Network team should even care about.
I am working with our Rx Group to fill in the app and device monitoring that our 'Other' Alerting system run by the DC people really isn't watching.. also the key points of contact and a proper, non cryptic email suits the suits better than a bunch of #'s and characters that even our team has to decode. But of course Most of the Rx alerts will be going to that group.
**** With this, on a critical system, i will create an alert for our Help/Service Desk with mild information that they can use to inform callers about an outage seconds to minutes after it happens. * This gives us an extra minute or two to get that Network Event Notification email out to all the groups; which has the full outage and tech deplaoyment to fix the issue type info.
we monitor everything... as in, we monitor the monitor that monitors the monitor that is monitoring those monitors...
but we do not alert on everything...
we are just now getting around to cleaning things up, removing interfaces/nodes as they alarm.
as for the spam problem... well, at least its not for cheap online prescriptions...
We monitor everything that is going on with our network except for desktop computers. All servers, Network equipment, infrastructure, power, etc. are monitored on a constant basis. We have alerts sent out when circuits go down,, servers go down, or any componenet within devices has a problem. The alerts are sent to specific groups that deal with the particular item in question. We are not a big organization, but with a lot of the SolarWind tools that we have, we are able to monitor like a big company or organization.
We do not really have an overreaching group that monitors all alerts. We have isolated our alerting to effected admins. We have worked with each admin, or group, to determine what alerts are useful.
Monitoring need a deep knowledge base and a really good cross-relations.
Users need services, but operatars and technical people need to know which system is affected.
IMHO for each environment you need specific tools (or specific "adapters") in order to provide a better monitorability.
I think this is a thin line to walk. We are a service provider so we need to monitor a lot of things on a lot of gear as part of the service that we provide. What I have found is that if you aren't getting any spam from the monitoring system then you probably aren't monitoring enough and you are probably missing things. If you get too much spam your NOC guys start ignoring everything assuming its spam. If your environment is a good size and has any rate of change then keeping the monitoring system on that thin line is a constant effort no matter what software you are using and that needs to be considered and expected as part of the solution.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.