cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

Avoid Delivery Failure with Monitoring Implementation Standards

Look, almost all of us have been there you’re slogging through Monday morning after staying up late watching The Walking Dead and Game of Thrones.

That new hot shot application owner is guaranteeing that his new application solution is the best thing since sliced bread, and of course it’s more bulletproof than anything else you’ve never heard of.

As monitoring geeks we know deep down that we can help make that application everything someone else wants it to be. We’re going to monitor it, not just for uptime and utilization, but for application performance and reliability. We have the capability and responsibility to help the business deliver on those promises.

So, how can one even hope to perform this dark magic I’m suggesting?


Standing up for standards


Elementary! As monitoring geeks, we have a bevy of tools in the box; but just as important as those tools, we have standards. Standards that we adhere to, advocate, and answer to.


Setting Monitoring Implementation Standards for hardware and application platforms yields a standardized process tailored to each individual’s environment. This can afford a consistent and streamlined monitoring experience; even if the platforms and applications are diverse - the process can remain the same.


Eliminating the impossible


For starters, we’re going to eliminate some of the guess work by closely examining our scheduled discovery results; keeping an eye out for any wayward hardware platforms that require further inquiry and ensure they’re being attributed with the appropriate custom properties. Then start asking some critical questions to narrow our focus that may include:

  • What does the app do and who utilizes it?
  • What is it running on and where?
  • What OS does it require?
  • What database does it use?
  • What languages are running the application?
  • What processes and services are needed to make it function?
  • Does it have a Web portal and ports that need to be monitored?
  • Who needs to know when there is a problem?
  • What alerts are needed? Up and down? Do you want to know when specific components go into warning or critical states?

Consistently asking these standard questions at the onset of every monitoring activity will help build your customized standard, allowing you to target the unknowns quicker and start amassing data points.


As your monitoring system continues to amass those data points; riddle me this hero: What good is the data if we never look at it?

The devil is in the data

Scheduled data review is incredibly important for trend detections and data integrity. Start digging into that mountain of data you’ve been collecting with canned or custom reports then schedule them to be sent straight to your inbox. Review them weekly, monthly, and quarterly. You might be surprised at what you find (Or what you don’t!).

After consistently reviewing the information you will be able to start to sorting and collating that data into quantifiable metrics to show, for example, the ridiculous availability and uptime of the hot shot’s new application.

This charted data is now powerful business intelligence for decision makers when budgets get tighter, or just a good measurement for regulatory reporting.

By standardizing the appropriate level of hardware and application monitoring, scheduling automated reports, and reviewing the data you ensure the business’ applications and services are delivered reliably time and time again.

What monitoring standards do you utilize to consistently deliver applications and services to your constituents?


13 Comments
MVP
MVP

a regular review of the monitoring requirements of our internal customers.

One to be sure nothing has fallen out or through the cracks.

To allow for adjustments or even decommission of items that no longer need to be monitored.

To identify things that may need to be added.

To make sure expectations are being met and to ensure everyone is on the same page.

Level 14
  1. Monthly audits/comparison of what's monitored by the various network tools (not just SolarWinds ).  This, to ensure that there all nodes in production are monitored and nothing has fallen through the cracks.
  2. Monthly review of Policy Compliance rules -- to make sure the rules are all up-to-date and relevant.
  3. Going after SNMP strings -- especially enforcing removal of public and private default SNMP strings.
  4. Shutting down Telnet and standardizing on more secure protocols (i.e. SSH).
Level 15

Precisely the steps taken here.  Additionally, we do a lot of review surrounding port 80 and port 443 traffic egressing the network.  We try to keep like requirement applications into groups for monitoring and management.

Level 17

Awesome set of questions! I can say that I am guilty of not asking for all of these details... along with the idea of a regular audit or review of what you have to make sure the node/server still responds as well as the application.

  - I had a server admin remove my service account for monitoring from access to a server... not I have no clue what it's doing and if they ever ask for details i'm going to have to remind them about removing the account.

Level 14

I wish I could say that I figured out things all on my own and w/o pain.  If fact, the opposite is true.  I am very fortunate that I work along very smart folks in my firm.  Also, I've learned from my own mistakes as well as those from others I've met or worked with.  And, I've learned to document processes and prepare plans that I and my teammates check and double-check.  As result, we run a pretty well maintained environment and w/o as much pain as we endured in our younger years.  This reminded me of another wise quote, this one from George Santayana...  He said...

Those who do not learn history are doomed to repeat it."  

Mr. Santayana stated the above over 100 years ago!!! That's the beauty of wise counsel, it is timeless... 

Level 15

I humbly offer this quote from Ben Franklin that seems wise words here

“Tell me and I forget, teach me and I may remember, involve me and I learn.”

Level 12

Great set of questions and definitely would help streamline and organize monitoring and alerting, if, one can get the answers to the questions.  I am sometimes stuck making the decision and sending the alert to the party I feel should be responsible.  When your team is not responsive, oh boy!

SNMP vs WMI?

To monitor or to log?

Retention? How long must you keep?  How long is reasonable if you have not requirements for retention?

Baselines are necessary when everything is functioning normally so that you will be able to spot anomalies.

Level 14

I just thought about another monitoring standard, although I have mixed feelings about it. 

In the ideal world, all critical tasks related to monitoring should have two or more persons who know how to perform them.  That way, if one of the persons is on vacation, sick --- or stepped off the wrong curve -- the monitoring tasks won't cease.  This is even more important for those few tasks that require 24 x 7 work.

Why do I have mixed feelings about the above?  On one hand, I can relax a bit while on vacation or if I am sick.  On the other hand, the firm may consider me (or you) as someone they could do without one day and save themselves some $$$.  Or, they could decide to find someone much younger and w/less experience and still save $$$.  So, while having a backup person (or team) is a good practice, it is not without its risk to our employment -- especially in this day and age. 

Level 9

jkump‌ very nice quote. btw @jeremy very interesting post. very enlightening

Level 17

It could go that way with whatever you do. If your shop is small enough to be the "only guy" who can do something they are toast when you hit that brick wall coming back from Friday lunch. Either documented SOP's or "the other guy" needs to be there so they can at least turn the knobs and keep the reactor from overheating.

Level 13

Thanks, this is very helpful. If you haven't clearly defined the scope of what you've set out to do, it's very difficult to know if you're doing it right.

Bless you for starting this thread, and those of you who have contributed to it after the fact. I have been struggling with this for years at my company. I've been conceptualizing this since forever... but I was never able to put pen to paper. This helps tremendously.

What a time saver it is to have these excellent Thwack members sharing ideas!  The above blog and comments are excellent resources to take away and adopt or implement--thank you all for the deep thought you've obviously put into the topic, and for taking the time to share your ideas.