Large and small companies alike have it easy when it comes to network monitoring. Large companies can afford the biggest and best solution available, and an army of people to monitor every little twitch in the network.
Small companies, on the other hand, either don't monitor at all (this is the much-vaunted, "Hey… the Internet is down" approach--also called "Canary" monitoring), or only have a very small number of devices to keep an eye on. What happens if you're in the middle somewhere? What happens if you're big enough to have a staff, but not big enough to have one dedicated person assigned to sitting around watching for failures?
If you are like me, you buy a great monitoring solution like SolarWinds's Network Performance Monitor (NPM). You do the initial installation, run through some wizards to get some basic monitoring up and going, then you start playing with the "nerd knobs" in the software, and boy does NPM have that in spades. NPM monitors everything from port states to power supplies, link errors to wireless authentications, and everything in between. You can then send alerts to a distribution list that your team tracks, assign responsibilities for monitoring and escalation, and everyone on the team now has visibility into failures and problems anywhere on the network.
Nirvana. Bliss. You have now become the Network Whisperer, a beacon of light in the darkness. Except, now you have 150 email messages about a printer because that printer has been offline for a few hours waiting for a part and someone forgot to kill the monitoring for that device. Easily fixed. Ooh… 300 more emails… someone rebooted a server.
I exaggerate a bit, but you see the problem. Pretty soon you set up mail rules, you shove your monitoring messages into a folder, and your entire monitoring solution is reduced to an in-house spam-generation machine. You find you don't actually know when stuff breaks because you're so used to the mail folder filling up, you ignore it all. The only thing you've accomplished is the creation of another after-the-fact analysis tool. Someone from accounting tells you that so-and-so system is down, you can quickly see that, yes, it is down. Well, go you.
I'll talk about how we work to solve this problem in my next post, but I'm curious how everyone here deals with these very real issues:
* What do you monitor and why?
* How do you avoid the "monitoring-spam" problem?
We monitor all switches, routers and wireless devices. Basic monitoring consists of up/down status. As mentioned groups can reduce the amount of emails we get for 1 particular event. From there, I have to fine tune the alerts. Sometimes it's a case of something happend, but I was not alerted. Other times, I can see the need to be alerted for a particular event. Solarwinds keeps expanding what we can alert on making life much easier. For example, temp monitoring used to be done either via syslogs or custom pollers, but now it is built in. Every now and then I go through the alerts and clean them out just so it is easier to see what is being alerted on.
I am sure it was, but 4 years ago I was not here...and Thwack was the sound made when my girlfriend got mad and hit me across the head. Thing's are much better now.
I have a very informative and awesome community/help/OLSM *cringe* ; and the headaches have stopped.
So, beat the dead horse so my head doesn't take the damage!
One thing I have seen (with varying degrees of success) are event aggregators that attempt to do root cause analysis and suppress downstream events.
e.g. if you lose a hub WAN link and are thus unable to reach 10 sites over that connection, you don't really want to receive the notifications that the WAN is down and 10 sites are also apparently down; better to receive a single notification saying that the WAN is down, and this is affecting reachability to 10 sites. Similar capabilities were tied in to the NOC systems so that when a particular element failed, the NOC would know which customers were impacted, and thus could proactively notify them of the issue.
There are a number of approaches to automated root cause out there, and it can be complex, but if you can get it right and tie it into other data sources to pull added intelligence for your alerts, you can make great steps towards minimizing the number of alerts hitting the users who get notified.
That's a really good idea for a feature addition to NPM. Create sites, tie the sites together, create rules around site alerting. I would think it might be easy to define a "site" structure and have all alerting inside that entity follow a set of rules.
Mostly I'm thinking out loud here as to what would be useful to the customer base of Solarwinds. A lot of the really cool features and things like you describe above are relegated to the larger enterprises who can afford to roll custom solutions, or programmatically enhance existing extensible products. Solarwinds could add some great functionality and saleability to NPM by giving some of that to everyone.
Absolutely. I don't know how much capability SW NPM has in terms of grouping alerts or rolling them up in some way, so I threw it out there anyway. Even on a site basis though, if you can't reach the WAN router for a site (e.g. you know the link has gone down), it would be really neat if you could tell NPM that (some list of monitored objects) all sit behind that single WAN router, and if the router's down, implicitly everything behind it will be too, and to generate one big alert rather than 101 little ones for the 100 things behind the router.
Once you go beyond that level, or you have multiple paths to a site, it gets much more complex - now you have to check if, say, BOTH WAN routers are down, and only then suppress alerts behind that failed 'edge'. Tricky, and hard to maintain manually over time.
Dependencies 'sort of' produce this behavior (one alert instead of a ton when a site gateway/router becomes unresponsive), but maybe we're talking aobut a different kind of desired config/behavior here.
The key word is indeed that these things need to be done manually. For example, you've got dual WAN routers and build your dependencies by putting the two routers in their own little group and building the dependency on the availability of the group...great. If you have to do that a thousand times, though, I can see where it gets chunk-style over time.
Dependencies are great, but if you are like us, then Distribution, Core and Routers have a higer capacity UPS than our access layer. So power goes out and the access layer fails first due to the UPS batteries being depleted. It still fires those Access layer alerts (mainly node down in this case) and then triggers the Uplink/Interface Errors from the Distro.
And if the distro is in the same room; wait a few minutes and that node down alert will go off.
All of our datacenters are fully redundant for power, cooling, etc., so that's not much of an issue. If all outside power fails, the batteries hold for the 15 seconds or so that it takes the gentrans to kick power over to the generator. That said, I can see where different environments aren't going to be as resilient and choices have to be made.
Servers in the DC. We are so big that not every building has it's own BU Gen. Access layer interface monitoring only happens in special cases for statistics, or a setup where they have a PC running a robot, or machine or other critical device(But this is monitoring, I rarely setup alerts for access interfaces). Our naming convention, and Hierarchy allows for interface alerting on specific devices very easily and leaves the Access layer off. Access layer changes so dang much in our case, it would be useless and generate too many questions about triggered alerts, and things showing red. Most my monitoring occurs outside our Data Centers; but the tool those guys use is so dang cryptic that I get requests from other department service providers to give them something understandable. DC's have full power TO THE MAX!.... power issues reside in our MDR/IDR's mainly and we are underway to get better power redundancy in place. Most places have normal and an Emergency power setup....most.
Great info, and thank you for sharing it! It's good to know there's at least some level of ability to wrap things up groups for example. I agree though about the issue of manual configuration and the pain that goes with that. For example, one place I worked tied in to a database that tracked all the devices that major applications should be traversing, and would alert not only on a fault, but could also tell you which applications might be adversely impacted by that failure. The problem was that every routing change then required you to check that paths had not changed in that database, and to maintain them as the architecture expanded. Great idea, but tiresome to keep accurate...
We have recently switched our monitoring toolset to Solarwinds and are hoping to use the trending to help us adjust our alerting to more closely match our environment.
We are also going to try out Alert Central to tweak the outbound alerts.
There are also a couple of specific monitoring apps we have that insist on using alert storms based on their polling - most annoying.
Love the Satellite5 reference - not enough "Who" around here
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process.