Is it possible for monitoring of your servers to be really effective? Or have they been configured in a way that is just white noise that you have come to ignore?  Server monitoring is imperative to ensuring that your organization functions optimally, and minimizes the number of unanticipated outages.

 

Monitoring Project Gone Wrong

Many years ago when I started with a new company, I was handed a corporate “flip phone”.  This phone was also my pager. When I was on-call for the first time I was expecting that I was going to only be alerted when there was an issue.  WRONG!  I was alerted for every little thing day and night.  When I wasn’t the primary point person on-call I quickly learned to ignore my device, and when I was on-call I was guaranteed to get some form of illness before the end of the week.  I was worn down from checking every little message on my pager all night long. Being the new member of the team, I first observed, but soon enough became enough.  Something had to change; so we met as a team to figure out what we could do.  We were all ready for some real and useful alerting.

 

Corrective Measures

When monitoring has gone wrong, and the server monitoring needs to change what can be done?  Based upon that incident it became very important to pull together a small team to spearhead the initiative and get the job done right.

 

Here is a set of recommendations on how monitoring configured wrong could be turned into monitoring done right.


  • Determine which areas of server monitoring are most important to infrastructure success and then remove the remaining unnecessary monitoring.  For example, key areas to monitor would be disk space free, CPU, memory, network traffic, and core server services.
  • Evaluate your thresholds in those areas defined as primary, and modify the thresholds according to your environment.  Often times the defaults setup in monitoring tools can be used as guidelines, but usually need modification for your infrastructure.  Even the fact that a server is physical or virtual can change the thresholds required for monitoring.
  • Once evaluation is complete, adjust the thresholds for these settings according the needs of your organization.
  • Stop and evaluate what is left after these settings were adjusted.
  • Repeat the process until alerting is clean and only occurs when something is deemed necessary.

 

As the process is repeated, the exceptions will stand out more and can be implemented more easily.  Exceptions can come in the form of resources spiking during overnight backups, some applications inherently requiring exceptions due to their nature of memory usage (e.g. SQL or Microsoft Exchange), or as simple as monitoring of different server services depending on the installed application.  Continual refinement and repetition of the process ensure that your 3am infrastructure pages are real and require attention.


Concluding Thoughts

Server monitoring isn’t one size fits all and these projects are often large and time consuming.  Environment stability is critical to business success.  Poorly implemented server monitoring does impact the reputation of IT, so spending the appropriate amount of time ensuring the stability of your infrastructure becomes priceless.