Monitoring 101: The Fundamental Concepts of System Monitoring

The answer to troubleshooting network challenges lies in effectively monitoring your environment. But saying “let’s monitor our network” presumes you know what you should be looking for, how to find it, and how to get it without affecting the system you’re monitoring. You’re also expected to know where to store the values, what thresholds indicate a problem situation, and how to let people know about a problem in a timely fashion.

Establishing the “What”

Here’s the bottom line: to build an effective monitoring solution, the true starting point is learning the underlying concepts. You have to know what monitoring is before you can set up what monitoring does.

Regardless of the software, protocol, or technique you use, a few fundamental aspects of a monitoring system exist across the board:

  • Element: This is a single aspect of the device you’re monitoring.
  • Acquisition: How do you get your information? Does your monitoring routine wait for the device to send you a status update (push), or does it proactively go out and poll the device (pull)?
  • Frequency: How often do you receive information? Does the device send a “heartbeat” every few minutes? Does it send data only when there’s a problem?
  • Data retention: Monitoring is data intensive. At its simplest level, data retention determines whether statistics are 1) collected, evaluated, acted upon, and forgotten, or 2) kept in a datastore of some sort.
  • Data aggregation: For example, you might collect statistics every five minutes. After a week, those five-minute values are aggregated to an hourly average; after a month, those hourly values are further aggregated to a daily average.
  • Threshold: The idea of fault monitoring is to collect a statistic and see whether it crosses a line of some kind—a threshold. It can be a simple line (is the server on or off?) or it can be more complex.
  • Reset: Reset marks the point where a device is considered “back to normal.”
  • Response: The response defines what happens when a threshold is breached. A response could be to send an email, play a sound file, or run a predefined script.
  • Alert noise: Alert configuration can be as much an art as it is a science. On the one hand, you want to be alerted when an issue occurs. On the other hand, you don’t want to create alert rules capable of drowning you in noise and ultimately masking real issues. Machine learning shows promise in solving this problem.

Understanding the “How”

Now we know the terms necessary for a foundational understanding of monitoring—the “what.” The “how” is just as important.

There are various monitoring techniques, from classic pinging and using the Simple Network Management Protocol (SNMP) to vendor-specific methods. Additionally, some offerings use agents for monitoring while others use agentless technology. None of these are right or wrong; it’s important to choose based on your own system and agency demands.

At the end of the day, these are the four most important things to consider when strengthening your monitoring process:

  1. Ease of deployment, configuration, and maintenance
  2. Flexibility
  3. Availability of the data (to external systems and other modules within the solution) once it’s collected
  4. Intelligently filtering alert noise

Monitoring may not be the sexiest discipline for the government IT pro, but it’s critical in ensuring systems are optimized and the mission is uninterrupted. Download the SolarWinds Monitoring 101 Whitepaper to learn more about the philosophy, theory, and fundamental concepts involved in systems monitoring.

Find the full article on our partner Carahsoft’s Community blog.

The SolarWinds trademarks, service marks, and logos are the exclusive property of SolarWinds Worldwide, LLC or its affiliates. All other trademarks are the property of their respective owners.

Thwack - Symbolize TM, R, and C