The network monitoring piece of network management can be a frightening proposition. Ensuring that we have the information we need is an important step, but it's only the first of many. There's a lot of information out there and the picking and choosing of it can be an exercise in frustration.
I remember the first time I installed an intrusion detection system (IDS) on a network. I had the usual expectations of a first-time user. I would begin with shining a spotlight on denizens of the seedier side of the network as they came to my front door. I would observe with an all-seeing eye and revel in my newfound awareness. They would attempt to access my precious network and I would smite their feeble attempts with.... Well, you get the idea.
It turns out there was a lot more to it than I expected and I had to reevaluate my position. Awareness without education doesn't help much. My education began when I realized that I had failed to trim down the signatures that the IDS was using. The floodgates had opened, and my logs were filling up with everything that had even a remote possibility of being a security problem. Data was flowing faster than I could make sense of it. I had the information I needed and a whole lot more, but no more understanding of my situation than I had before. I won't even get into how I felt once I considered that this data was only a single device's worth.
After a time, I learned to tune things so that I was only watching for the things I was most concerned about. This isn't an unusual scenario when we're just getting started with monitoring. It's our first jump into the pool and we often go straight to the deep end, not realizing how easy it is to get in over our heads. We only realize later on that we need to start with the important bits and work our way up.
Most of us are facing the task of monitoring larger interconnected systems. We get data from many sources and attempt to divine meaning out of the deluge. Sometimes the importance of what we're receiving is obvious and relevant. eg. A message with critical priority telling us that a device is almost out of memory. In other cases, the applicability of the information isn't as obvious. It just becomes useful material when we find out about a problem through other channels.
That obvious and relevant information is the best place to start. When the network is on the verge of a complete meltdown, those messages are almost always going to show up first. The trick is in getting them in time to do something about them.
Most network monitoring installations begin with polling devices for data. This may start with pinging the device to make sure it's accessible. Next, comes testing the connections to the services on the device to make sure that none of them have failed. Querying the device's well-being with Simple Network Management Protocol (SNMP) usually accompanies this too. What do these have in common? The management station is asking the network devices, usually at five minute intervals, how things are going. This is essential for collecting data for analysis and getting a picture of how things are going when everything is running. For critical problems, something more is needed.
This is where syslog and SNMP traps come into play. This data is actively sent from the monitored devices as events occur. There is no waiting for five minute intervals to find out that the processor has suddenly spiked to 100% or that a critical interface has gone down. The downside is that there is usually a lot more information presented than is immediately necessary. This is the same kind of floodgate scenario I ran into in my earlier example. Configuring syslog to send messages at the "error" level and above is an important sanity-saving measure. SNMP traps are somewhat better for this as they report on actual events instead of every little thing that happens on the device.
The Whisper in the Wires
Ultimately, network monitoring is about two things:
- Knowing where the problems are before anyone else knows they're there and being able to fix them.
- Having all of the trend data to understand where problems are likely to be in the future. This provides the necessary justification to restructure or add capacity before they become a problem.
The first of these is the most urgent one. When we're building our monitoring systems, we need to focus on the critical things that will take our networks down first. We don't need to get sidetracked by the pretty pictures and graphs... at least not until that first bit is done. Once that's covered, we can worry about the long view.
The first truth of RFC 1925 "The 12 Networking Truths" is that it has to work. If we start with that as our beginning point, we're off to a good start.