Can server monitoring be configured in a way that it is effective? Is there such a thing as a monitoring project gone right?  In my experience this is rare that a team gets what they want out of their monitoring solution, but rest assured it is possible with the right level of staffing and effort.

  

Monitoring Project Gone Right

As many of us know, server monitoring is very important to ensure that our business systems do not fail, and that our users are able to do their jobs whenever they need to.  When we are supporting hundreds and possibly even thousands of servers in our enterprises, it would be impossible to do this manually.  The right underlying system is the key to success.  When we are handed a pager (yes, there was a time when we all had pagers) we want to know that the information that comes through is real and actionable.  Throughout my entire career, I have worked only one place that I feel did monitoring really well.  I did not fall ill from being worn down and woken up from pages that were not actionable when I was on-call.  I could actually be certain that if my pager went off in the middle of the night, it was for true purpose.


Steps to Success

So what is the recipe for successful monitoring of your servers? Let’s take a look at how this can be done.


  • Make sure this is a real project with dedicated infrastructure resources.  This will not only allow for development of skill-sets, it will ensure that the project will be completed on a schedule.
  • Put together a Playbook which serves multiple purposes:
    • Provides a detail list of the server monitoring thresholds and commitments for your servers
      • Document any exceptions to the standard thresholds defined
    • Limit the number of core application services monitored to reduce complexity
    • Allows your application owners to determine which software “services” they will want monitored 
    • Allows the application owner to decide what action should be taken if a service fails (i.e. page application owner, restart service, page during business hours only)
  • Make sure you are transparent and work with ALL of IT.  This project requires input from all application owners to ensure that the server monitoring team puts it together properly.
  • Revisit the playbook on a predefined interval to ensure that the correct system monitoring and actionable response is still in place.
  • Refer to “Server Monitoring from the Real World Part 1” for some additional thoughts on this topic.


This may sound like a lot of work, but ensuring that every service and threshold monitored has an actionable response is imperative success in the long-term.  In the end, this approach will actually significantly reduce the amount of effort and resources required to ensure that monitoring is everything your business needs to run smoothly.

 

Concluding Thoughts

System monitoring done correctly is important for both the business and the engineers on your team.  When it is setup correctly with actionable responses, your team will not “tune out” their pages, and will ensure that the quality of service provided to the business is stellar.  Server and application uptime will also be at their best.