cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Monitoring your Servers from the Real World (part 1)

Level 10

Is it possible for monitoring of your servers to be really effective? Or have they been configured in a way that is just white noise that you have come to ignore?  Server monitoring is imperative to ensuring that your organization functions optimally, and minimizes the number of unanticipated outages.

 

Monitoring Project Gone Wrong

Many years ago when I started with a new company, I was handed a corporate “flip phone”.  This phone was also my pager. When I was on-call for the first time I was expecting that I was going to only be alerted when there was an issue.  WRONG!  I was alerted for every little thing day and night.  When I wasn’t the primary point person on-call I quickly learned to ignore my device, and when I was on-call I was guaranteed to get some form of illness before the end of the week.  I was worn down from checking every little message on my pager all night long. Being the new member of the team, I first observed, but soon enough became enough.  Something had to change; so we met as a team to figure out what we could do.  We were all ready for some real and useful alerting.

Corrective Measures

When monitoring has gone wrong, and the server monitoring needs to change what can be done?  Based upon that incident it became very important to pull together a small team to spearhead the initiative and get the job done right.

Here is a set of recommendations on how monitoring configured wrong could be turned into monitoring done right.


  • Determine which areas of server monitoring are most important to infrastructure success and then remove the remaining unnecessary monitoring.  For example, key areas to monitor would be disk space free, CPU, memory, network traffic, and core server services.
  • Evaluate your thresholds in those areas defined as primary, and modify the thresholds according to your environment.  Often times the defaults setup in monitoring tools can be used as guidelines, but usually need modification for your infrastructure.  Even the fact that a server is physical or virtual can change the thresholds required for monitoring.
  • Once evaluation is complete, adjust the thresholds for these settings according the needs of your organization.
  • Stop and evaluate what is left after these settings were adjusted.
  • Repeat the process until alerting is clean and only occurs when something is deemed necessary.

As the process is repeated, the exceptions will stand out more and can be implemented more easily.  Exceptions can come in the form of resources spiking during overnight backups, some applications inherently requiring exceptions due to their nature of memory usage (e.g. SQL or Microsoft Exchange), or as simple as monitoring of different server services depending on the installed application.  Continual refinement and repetition of the process ensure that your 3am infrastructure pages are real and require attention.


Concluding Thoughts

Server monitoring isn’t one size fits all and these projects are often large and time consuming.  Environment stability is critical to business success.  Poorly implemented server monitoring does impact the reputation of IT, so spending the appropriate amount of time ensuring the stability of your infrastructure becomes priceless.

9 Comments
MVP
MVP

That sounds similar to what I did years ago when taking on monitoring of unix systems syslogs.

It is an appropriate methodology to steer a shop towards management by exception...reduce all that is noise to nothing and all that remains should be of interest.

Even then with server monitoring be wary of the manager, director, or even VP that wants you to monitor everything or furthermore alert of "everything".

Part of our job is to provide guidance and suggest better or more appropriate alerting schemes.

To your comment on thresholds, another stake holder in our organization purchased Splunk.  A fine product I have no doubt, but when they configured the product, they set threshholds so low that Business Units receive literal 100's of email notifications a day.  Well, those emails have become junk mail to say the least.

Shame on them, right?

Unfortunately, their spamming of our customers has a residual effect and monitoring emails can get lost in the noise.

Enter, adamlboyd‌ who leveraged the power of HTML and Variables to create email notifications that are not only backed by properly designed alerts, but also via strong information placed in a visually easy to grasp format that not only catch the eye of customers but also give them the data they need quickly and cleanly.

Level 12

I am going through this process right now, thanks for the pointers, it has really helped!

Level 14

Great write up.  This sounds like my coworker speaking.  Our sysadmins were delighted when he showed them all the metrics that could be monitored.

Level 17

Very nice pointers, I will be starting to bring in more servers to monitor their health and status and some app's and services. These considerations will be strong on my mind!

Level 8

yes nice one.

MVP
MVP

yep - lots of alerts equals no alerts get actioned

Level 8

We are heading in this direction in our shop.   Thanks for the tips.  It is appreciated.

Yes, as we move to a Centralized Solution dealing with the Layer8/9 issues will be more of a challenge than Layer 1-7. 

About the Author
I have been in IT since the dawn of time, even before IBM made personal computers. Working on CPM in the 80's and learning something new every day since then. Now running Virtualization, Server, Network, Storage, Application and Collaboration support team in a modern, forward looking corporate world.