Zen And The Increasingly Complex Art of Infrastructure Monitoring

squirrel.jpg

Have you ever had the experience where you start looking into something, and every time you turn a corner you realize you have uncovered yet another thing you need to dig into in order to fully understand the problem? In my workplace we refer to this as chasing squirrels; being diverted from the straight path and taking a large number of detours long the way.

Managing infrastructure seems that way sometimes; it seems that no matter how much time you put in to the monitoring and alerting systems, there's always something else to do. I've looked at some of these issues in my last two posts (Zen And The Art Of Infrastructure Monitoring and Zen And The Ongoing Art Of Infrastructure Monitoring), and in this post I'm chasing yet another squirrel: the mythical baseline.

BASELINING ALL THE THINGS

If we can have an Internet of Things, I think we can also have a Baseline of Things, can't we? What is it we look for when we monitor our devices and services? Well, for example:

  • Thresholds vs Capacities: e.g. a link exceeds 90% utilization, or Average RAM utilization exceeds 80%. Monitoring tools can detect this and raise an alert.
  • Events: something specific occurs that the end system deems worthy of an alert. You may or may not agree with the end system.

These things are almost RESTful in as much as there are kind of stateless: absolute values are at play and it's possible to trigger a capacity threshold alert, for example, without having any significant understanding of the previous history of the element's utilization. There are two other kinds of things I might look at:

  • Forecasting: Detecting that the utilization trend over time will lead to a threshold event unless something is done before then. This requires historical data and faith in the curve-fitting abilities of your capacity management tool.
  • Something Changed: By way of an example. If I use IP SLA to monitor ping times between two data centers, what happens if the latency suddenly doubles? The absolute value may not be high enough to trigger timeouts, or to pass a maximum allowable value threshold, but the fact that the latency doubled is a problem. To identify this problem requires historical data, again, and the statistical smarts to determine when a change is abnormal compared to the usual jitter and busy hour curves.

This last item - Something Changed - is of interest to me because it offers up valuable information to take into account when a troubleshooting scenario occurs. For example if I monitor the path traffic takes from my HQ site to, say, an Office 365 site over the Internet, and a major Internet path change takes place, then when I get a call saying that performance has gone down the pan, I have something to compare to. How many of us have been on a troubleshooting call where you trace the path between points A and B, but it's hard to know if that's the problem because nobody knows what path it normally takes when things are seeming going well. Without having some kind of baseline, some idea of what NASA would call 'nominal', it's very hard to know if what you see when a problem occurs is actually a problem or not, and it's possible to spend hours chasing squirrels when the evidence was right there from the get go.

Many monitoring systems I see are not configured to alert based on a change in behavior that's still within thresholds, but it's something I like to have when possible. As with so much of infrastructure monitoring, triggering alerts like this can be plagued with statistical nightmares to figure out the difference between a system that's idle overnight seeing its utilization increase when users connect in the morning, and a system that usually handles 300 connections per second at peak suddenly seeing 600 cps instead. Nonetheless, it's a goal to strive for, and even if you are only able to look at the historical data in order to confirm that the network path has not changed, having that data to hand is valuable.

KILLING THE DEAD THINGS

Moving in a different but related direction, knowing if what you're monitoring is actually active would be nice, don't you think? My experience is that virtualization, while fantastic, is also an automated way to ensure that the company has a graveyard of abandoned VMs that nobody remembered to decommission once they were no longer needed. This happens with non-virtualized servers too of course, but they are easier to spot because the entire server does one thing and if it stops doing so, the server goes quiet. Virtual machines are trickier because one abandoned VM in ten deployed on a server can't be detected simply by checking network throughput or similar.

Knowing what's active helps minimize the number of squirrels that distract us in a troubleshooting scenario, so it's important to be able to tidy up and only monitor things that matter. In the case of VMs, SolarWinds' Virtualization Monitor has a Sprawl Management dashboard which helps identify VMs which have been powered down for over 30 days, as well as those which appear to be idle (and presumably no longer needed). In addition, if there are VMs running at 100% CPU for example (and triggering alerts most likely), those are also identified as being under-provisioned, so there's a chance to clean up those alerts in one place (referred to as VM right-sizing). Similarly for network ports, SolarWinds' User Device Tracker can identify unused ports so they can be shutdown to ensure that they do not become the source of a problem. This also allows better capacity planning because unused ports are identified as such, and can then be ignored when looking at port utilization on a switch.

PULLING THE THINGS TOGETHER

Looking at the list of things I want my monitoring and alerting systems to do, it seems that maybe no one system will ever provide everything I need in order to get that holistic view of the network that I'd like. Still, one thing SolarWinds has going for it is that Orion provides a common platform for a number of specialized tools, and the more SolarWinds uses data from multiple modules to generate intelligent alerting and diagnosis, the more powerful it can be as a tool for managing a broad infrastructure. Having a list of specific element managers is great for the engineering and operations teams responsible for those products, but having a more unified view is crucial to provide better service management and alerting.

Do you feel that Orion helps you to look across the silos and get better visibility? Has that ever saved your bacon, so to speak?

Thwack - Symbolize TM, R, and C