cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Network Monitoring Overload and how to survive

Level 11

By Paul Parker, SolarWinds Federal & National Government Chief Technologist

It’s always good to have a periodic reminder to consider what we’re monitoring and why. Here's an applicable article from my colleague Joe Kim, in which he offers some tips on avoiding alert overload.

If you’re receiving so much monitoring information that you don’t see the bigger-picture implications, then you’re missing the value that information can provide. Federal IT pros have a seeming overabundance of tools available for network monitoring. Today, they can monitor everything from bandwidth to security systems to implementation data to high-level operational metrics.

Many federal IT pros are tempted to use them all to get as much information as possible, to ensure that they don’t miss even a bit of data that can help optimize network performance.

That is not the best idea.

First, getting too much monitoring information can cause monitoring overload. Why is this bad? Monitoring overload can lead to overly complex systems that, in turn, may create conflicting data. Conflicting data can then lead to management conflicts, which are counter-productive on multiple levels.

Second, many of these tools do not work together, providing a larger possibility for conflicting data, a greater chance that something important will be missed, and an even greater challenge seeing the bigger picture.

The solution is simpler than it may seem: get back to basics. Start by asking these three simple questions:

  1. For whom am I collecting this data?
  2. What metrics do I really need?
  3. What is my monitoring goal?

Federal IT pros should start by looking specifically at the audience for the data being collected. Which group is using the metrics—the operations team, the project manager, or agency management? Understand that the operations team will have its own wide audience and equally wide array of needs, so be as specific as possible in gathering “audience” information.

Once the federal IT pro has determined the audience, it will be much easier to determine exactly which metrics the audience requires to ensure optimal network performance—without drowning in alerts and data. Identify the most valuable metrics and focus on ensuring those get the highest priority.

The third question is the kicker, and should bring everything together.

Remember, monitoring is a means to an end. The point of monitoring is to inform and enhance operational decisions based on collected data. If a federal IT pro has a series of disconnected monitoring products, there is no way to understand the bigger picture; one cannot enhance operational decisions based on collected data if there is no consolidation. Opt for an aggregation solution, something that brings together information from multiple tools through a single interface that provides a single view.

Network monitoring and network optimization are getting more and more complex. Couple this with an increasing demand for a more digital government, and it becomes clear that gathering succinct insight into the infrastructure and application level of the IT operations within the agency is critical.

The most effective course of action is to get back to the basics. Focus on the audience and the agency’s specific needs. This will ensure a more streamlined monitoring solution that will help more effectively drive mission success.

Find the full article on Federal Technology Insider.

15 Comments

I agree that keeping it simple keeps it simple.

Yet I like having all the data available, so my team can more quickly associate symptoms with causes and then work on correcting those causes.

I'm most interested in your statement "Monitoring overload can lead to overly complex systems that, in turn, may create conflicting data."

I don't disagree with that statement.  But I'd enjoy seeing some examples of it in practice?  What kind of monitoring causes overload, which creates conflicting data?  Are you thinking of monitoring a switch or router or server so frequently that the monitoring itself impacts memory and CPU on the target hardware? 

Certainly WAN Killer can fill a WAN pipe up while one is testing to verify a carrier is providing the contracted bandwidth.

I'd like to learn what other monitoring choices come to your mind that can "create conflicting data."

Level 20

The most important point in this whole piece I think is... "Too much information is just as bad as no information."

MVP
MVP

I am confused as rschroeder​,

granted monitoring can be turned up to the point that it impacts that which it is watching so it should be done with care.

I get the impression the author is advocating only monitoring with a single tool which in itself has its own pitfalls.
It leaves you with only one perspective into the environment.  The gotcha is the Forest and the Trees problem.

Other tools watching the environment but from a different perspective allows you to see around the trees and therefore can see the forest.

A good example is a load balanced web site.  If you just watch the VIP on the F5 or whatever load balancer, the load balancer becomes the tree

that blocks the forest of the web server farm.  You will usually always get a good return...everything is good...until everything is bad.  You also need to watch the
web servers in the farm so that you know when one or more falls out so that if you start getting intermittent issues at the load balancer you know where to go. 
By looking at the web servers allows you to mitigate individual issues without impacting the environment as a whole.  It also gives you a gauge to the health of that
web server environment that you can't see from the VIP on the load balancer.


Also, who watches the watcher ?  You need an out of band tool watching your monitoring tool to make sure it is up and operational.

The tool cannot watch itself as it affects the environment in which it lives...if it is impacted it may not be able to tell anyone.

MVP
MVP

Good article

Level 14

I monitor as much as I can.  It is the alerting that is key.  I try to only alert when something needs to be fixed and I only alert those people who can either fix the problem or who are management who NEED to know about the issue.  That way we have all the information gathered but there is no blizzard of useless information.

Level 13

agreed monitoring and alerting is key. simplicity helps but we already live in a complex world of IT. If it was simple they wouldn't need us.

TRANSLATION: "Tell them what they need to know. Not what they want to know."

I'm a manager. And sometimes I am okay with that...

MVP
MVP

sometimes they don't know what they need to know...sometimes you don't either.

Counsel strongly against "I want everything" (all the metrics) and "tell me everytime this happens" when they think it rarely happens without researching to see that it really happens all the time.  In other words, try and steer them away from the firehose.  There may be times in which you have to give it to them to let them know they are ID 10 T's.

Level 13

So true...... Overload

Level 14

Monitor only for what YOU need.... Report the only the important... Less is more!!

Level 11

rschroeder​ excellent points to bring up. I tend to think of monitoring as a symphony and not a science. As you mentioned there's a lot of different things like monitoring the monitor (this feels like a Watchmen reference), monitoring overload (where there's so much data it's impossible to determine anything), a fun one of monitoring different parts of the network (a.k.a. that's my job not yours) with different tools that can't play well together (so there's literally no way to correlate)... I should think about writing a mini-series on 101 Sins of Monitoring. One of the most frequent that I've come across in the Government is Auditing/Monitoring everything (with no purpose), and send it to a virtual garbage bin (I mean SIEM) with no real iintent. It's not purposeful monitoring, it's compliance and mandate based. Too often the SIEM team isn't properly equipped with how to deal with so many distributed data sources, or doesn't have the time/personnel to write the complex reports needed. Couple that with things like Auditing all AD Object events, but not Auditing Question/Response from DNS (which is also more times than not AD based) you lose a large part of the picture.

That's probably too much random thinking on my part, but does that help offer up a better explanation? I'm interested in your thoughts (and another cup of coffee).

Level 11

Jfrazier​ I apologize for the delay. I've been Globe-Trotting. That definitely isn't a Basketball reference,as I'm most likely to play for the Washington Generals . Also whether intentional or not... 2 Bonus Nerd points for a Watchmen Reference.

Monitoring (like security) is a strategy and not a product. Since I'm bringing up a lot of references this morning; There isn't "One tool to rule them all". As much as we'd love SolarWinds to be the perfect product for everybody in every situation, there needs to be a purposeful intent in what you do. While we can provide insight, nobody here knows your environment (and challenges) better than you do.  If you look towards the bottom of the article, Joe specifically calls out the use of aggregated solutions, in a unified view. I think that this article speaks more towards choosing the right combination of products for an integrated environment.

MVP
MVP

Your 3 simple questions is a great place to start. I think a lot of monitoring happens, just because. There are cases where you legitimately need to monitor every port on a switch, but in many cases you only need to monitor the uplinks and a few key interfaces. Neither case is right or wrong. It clearly depends upon the needs of the organization and where the data actually adds value. Much of my SolarWinds time has been cleanup of what others have done, just because. We have an unlimited license so we should monitor every port on every device and every metric on every polling item. Again, it depends, but I don't know that having SNMP polling looking several times a day at the UPS' in our environment to give me a count of battery packs in the chassis is necessarily something that we need.

Level 14

I once had a data centre UPS which kept failing.  Turned out that the monitoring solution was set to poll the UPS management card every minute.  That was causing the card to eventually panic and crash.  That then gave the "failed" alert.  Set the monitoring to every 15 minutes and all was well.

MVP
MVP

Are we there yet?

Are we there yet?

Are we there yet?

About the Author
Paul Parker, a 25-year information technology industry veteran, and expert in Government. He leads SolarWinds’ efforts to help public sector customers manage the security and performance of their systems by using technology. Parker most recently served as vice president of engineering at Infoblox‘s federal division. Before that, he served in C-level or senior management positions at Ward Solutions, Eagle Alliance and Dynamics Research Corp.