cancel
Showing results for 
Search instead for 
Did you mean: 
Create Post

IT Monitoring and the Five Stages of Grief

Level 12

It’s a fact that things can go wrong in IT. But with the advent of IT monitoring and automation, the future seems a little brighter.

After over a decade of implementing monitoring systems, I’ve become all too familiar with what might be called monitoring grief. It involves a series of behaviors I’ve grouped into five stages.

While agencies often go through these stages when rolling out monitoring for the first time, they can also occur when a group or department starts to seriously implement an existing solution, or when new capabilities are added to a current monitoring suite.

Stage One: Monitor Everything

In this initial monitoring non-decision to “monitor everything,” it is assumed that all the information is good and can be “tuned up” later. Everyone is in denial that there’s about to be an alert storm.

Stage Two: The Prozac Moment

“All these things can’t possibly be going wrong!” This ignores the fact that a computer only defines “going wrong” as requested. So you ratchet things down, but “too much” is still showing red and the reaction remains the same.

Monitoring is catching all the stuff that’s been going up and down for weeks, months, or years, but that nobody noticed. It’s at this moment you might have to ask the system owner to calm down so they will chill out and realize that knowing about outages is the first step to avoiding them.

Stage Three: Painting the Roses Green

The next stage occurs when too many things are still showing as “down” and no amount of tweaking is making them show “up” because, ahem, they are down.

System owners may ask you to change alert thresholds to impossible levels or to disable alerts entirely. I can understand the pressure to adjust reporting to senior management, but let’s not defeat the purpose of monitoring, especially on critical systems.

What makes this stage even more embarrassing is that the work involved in adjusting alerts is often greater than the work required to actually fix the issues causing them.

Stage Four: An Inconvenient Truth

If issues are suppressed for weeks or months, they will reach a point when there’s a critical error that can’t be glossed over. At that point, everything is analyzed, checked, and restarted in real time. For a system owner who has been avoiding dealing with the real issues, there is nowhere left to run or hide.

Stage Five: Finding the Right Balance

Assuming the system owner has survived through stage four with their job intact, stage five involves trying to get it right. Agencies need to make the investment to get their alerting thresholds set correctly and vary them based on the criticality of the systems. There’s also a lot that smart tools can do to correlate alerts and reduce the number of alerts the IT team has to manage. You’ll just have to migrate some of your unreliable systems and fix the issues that are causing network or systems management problems as time and budget allow.

Find the full article on Federal Technology Insider.

54 Comments
Level 13

great article...unfortunately, I have yet to see anyone go from 1-5 directly. The steps in-between are inevitable.

I've seen this exact scenario several times; it's astute of you to point it out and get people thinking about what's right for their environment versus what's possible to monitor and alert on.

Your observations about too much information, then too little, then missing important info, and finally tweaking monitors and alerts to what's appropriate (done by trial and error, best guess and discovery and real-world observations) is also something I've personally experienced.

If you're familiar with electrical circuit design, the process you've described is very similar to electrical damping.  Think of an HVAC environment that has a poorly damped environmental controller or thermostat.  Temperature drops when winter comes, the thermostat turns on the furnace.  Ideally it keeps temperatures at a nice and steady temp when people are present, and allows temps to drop to safe energy-saving levels when people are not present.

Without good damping, temperatures (alerts) exceed the acceptable level.  Under damping means the furnace will remain off until things become too cold (too few alerts), and then the furnace starts up and gets it warmer--hopefully not TOO warm this time.  Then it shuts off & temp drops and the process repeats itself.  Overdamping (disabling too many alerts/monitors) results in a temperature that's too cold (or too few network alerts).  The goal is to have your electrical circuit critically damped, and receive the right temperature all the time (and the right amount of alerts).

pastedImage_1.png

Something that's critically damped ends up with a true report and the right amount of alerts.

pastedImage_2.png

Thanks for sharing this topic!

Level 13

The electrical circuit reference is an interesting analogy.  Having a former career as an Electronics Technician and working with Electrical Engineers, including on this specific example of critical dampening hits home for me.

Level 13

I have seen a lot of these 5 steps as well.  I try to guide people to step 5, but sometimes it seems they have to see the fallout from the other steps first before they will work to get to step 5.

Level 14

Fun read.  The smile never left my face the entire time.

MVP
MVP

It can take years from some companies to make the transition to stage 5.

Usually stage 1 is an over zealous manager or tech who thinks you need to monitor everything.

This is epically true.

Level 10

Great post!  I live steps 1-5 daily.  The Prozac moment really spoke to me.

Level 9

So right

Level 13

O so true...  Monitor everything then after awhile "nope don't need that information" 

MVP
MVP

Very good article to mtgilmore1​'s point I tend to over monitor and then back down. However I do the opposite with alerting as over alerting leads to "Eh, ignore that itis."

Level 16

Nice read

Is it possible to experience all 5 at the same time and just take the Prozac?  HA HA HA

Level 14

That sounds much more efficient.

Level 14

Great article & so true.

Level 11

Here is what I try educate people (read managers) on as far as monitoring. I record all the monitoring available from the system however I do not alert 90% until there is a commitment by the SME of that service to what is meaningful. Also I tend to make a ton of it self service due to different group have different needs which Solarwinds NPM and SAM due fairly well.

This usually breaks out into 2 main items, up/down alerts and defining that correctly and user experience monitor.

I think we will see a bigger trend with monitoring data and putting it into a repository for later "use".

I think the 5 stages on monitoring grief really points out is how bad or good is your operations team and the politics in your organization (we all have this).

Level 21

This is so absolutely true.  Years ago when we were using HP OpenView (I still have PTSD over that) I had a manager that when asked what we should be monitoring; his response was "everything". 

Level 14

HP OpenView.  Oh the horrors.  Thank goodness we replace that about five years ago with our beloved SolarWinds.

Level 13

Still in the process of replacing HP OpenView here.  We are stuck on a couple of complex alerts they are somehow doing, but most of the conversion is done now.

Level 12

Great article.  I have seen this happen so many times over my career (since 1978). Sometimes getting the people (read managers) (nod to slackerman19 ) to move on to Stage Five is difficult at best.

MVP
MVP

I never used HP OpenView, but I did get exposure to several of the Computer Associates applications. While they were probably good if you were managing data centers all across the globe I was trying to use them in a single data center - talk about unnecessarily overly complicated. I've got to say that with our company agreement we got any application for $1 and most of the time they were worth every penny.

Level 16

I was able to overcome from PTSD from OV and SCOM. I solved it in one evening by creating a RESUME and life has never been better.

Was this article written with the mindset that there was a Beginning - Middle - End? I see Monitoring as a lifecycle. The lifecycle could be an enterprise, a datacenter, a VM Host, or an application. But it never ends. It never, ever, ever... ends!

Level 13

I don't think so.  It isn't talking about the monitoring life cycle, but the monitoring and alerting grief when someone, whether team, management or 'other' just says 'turn it all on'.  Monitoring is a life cycle, but I hope going through all of these grief stages only has to happen once.  Or at the very least can skip a few earlier grief stages.  I don't want to keep repeating grief stages 1-2 (with the same customers).  The rest is an adjustment process.

Been there, done HPOV.  It had a sweet implementation back in the 1990's that was built for 100% 3Com networks.  I bought and used used it in a school district with great success; it was, in its own '90's way, better than NPM.  "Better" by being simpler, less expensive, more primitive, requiring less hardware & personal resources.  And it was something one person could use across a network of 33 WAN sites and 14,000 customers.

When I changed jobs and found no network discovery and no monitoring, I bought HPOV and tried to implement it in a Nortel / IBM-AIX / Windows world and promptly began drowning for lack of personnel resources.  It demanded a constant and  overwhelming amount of tweaking and personal support.  After weeks of professional training and too many 70 or 80 hour weeks of trying to get it to work, I finally gave up on it as a bad product for my organization's needs and support capabilities.  I could have thrown three full-time Unix admins at it and still not gotten the needed information and performance and alerts out of it.  And I could've saved a lot of money if I'd known how bad it would be, and how good the alternative was.

I turned to the Internet and IT peer support groups and shopped & tested and compared; eventually I ended up with Solarwinds NPM and it made the difference between failure with depression and success with recognition for achievement.

Level 11

I feel like this article was written about me and what happened when I set SAM up for the first time....did you guys install secret cameras in my office and if so, I need to monitor everything about them!!

I think it's all apart of the learning process and growing pains of monitoring...over time we get better at, get a better idea of whats needed...and hopefully we manage to get to step 5. 

MVP
MVP

Level 13

the cameras are not so secret...it's part of SAM!!!

🙂

Level 13

Some days I wonder if there should be a 6th stage..."why oh why did I ever go into I.T."

MVP
MVP

For Sure

Level 12

To prove man's superiority over machines? 

Level 13

until the rise of the machines...

Level 14

somewhere between stage 2 and 3 beckons some artificial paradise! (Sysadmin's choice of course!)

Great read..... Strive for 5!!!!

Level 13

balance is the key...the pendulum will swing wildly until then...

Level 8

Monitoring grief is real... Too many folks I know think otherwise.

=^.^= Thanks for the read.

Level 20

Maybe we need a new 12 Step program for Network Monitoring Engineers!

Level 13

LoL...

Hi, My name is Steven, and I'm a monitoring addict.

Level 13

12 steps or infinite life cycle of 12 steps?

MVP
MVP

10 step 1

20 step 2

30 step 3

40 step 4

50 step 5

60 step 6

70 step 7

80 step 8

90 step 9

100 step 10

110 step 11

120 step 12

130 goto 10

Ahhh!

MVP
MVP

very basic....

Level 13

is it? or did you mean BASIC

Level 12

That sort of depends.  Basically there are many flavors, puts Baskin and Robins to shame, and some were BASIC and some were Basic.  For an interesting list see:  http://www.nicholson.com/rhn/basic/names.basic.txt  

MVP
MVP

sorry..meant BASIC

Level 13

i prefer dairy queen

What's Up Gold killed our HP Open View install In 2003.  WUG was simple eazy and effective. 

Of course WUG was also buggy, crashed and would sometimes cry WOLF when it marked everything down.   Hey someone reboot WUG!!!

RT

The best way to clean up monitoring issues is to attack them head on.  If it is a new setup build it in sections and clean up as you go.  If it is an existing system for goodness sakes don't let it get out of hand.

We upgrade Orion 2 to 4 times a year.  

The last upgrade we replaced the app server with an updated one.  That happens every three to five years.

Complacency is a career killer.

RT

On Alerts.  Build it, test it, and watch it yourself for a few weeks before sending it to the masses,

RT

Level 14

Absolutely correct 

Build the alert, test the alert (to yourself & probably with completely skewed thresholds or trigger setting so that the alert triggers), fix the alert (cos it's probably not exactly what you intended), re-test the alert (to yourself). then reset all the skewed thresholds and let it trigger for real (to yourself) and (only when you are really happy with the alert content and the number and frequency of alerts) change the MailTo field to be the real recipients. You can always BCC them to yourself as well and just rule them into a folder to keep another check on the delivery.

Level 15

Interesting thoughts.  Trying to get other departments to trust the results of the monitoring is interesting.  They can't believe that they have that many issues or geez we always just blame the network when the interfaces go down, but now that I see it is not the network---time to work with the vendor to ACTUALLY fix the issue.

Level 11

beautifully told article about monitoring.. i am going to use it at my management meeting..

thanks for the info.

About the Author
Joseph is a software executive with a track record of successfully running strategic and execution-focused organizations with multi-million dollar budgets and globally distributed teams. He has demonstrated the ability to bring together disparate organizations through his leadership, vision and technical expertise to deliver on common business objectives. As an expert in process and technology standards and various industry verticals, Joseph brings a unique 360-degree perspective to help the business create successful strategies and connect the “Big Picture” to execution. Currently, Joseph services as the EVP, Engineering and Global CTO for SolarWinds and is responsible for the technology strategy, direction and execution for SolarWinds products and systems. Working directly for the CEO and partnering across the executive staff in product strategy, marketing and sales, he and his team is tasked to provide overall technology strategy, product architecture, platform advancement and engineering execution for Core IT, Cloud and MSP business units. Joseph is also responsible for leading the internal business application and information technology activities to ensure that all SolarWinds functions, such as HR, Marketing, Finance, Sales, Product, Support, Renewals, etc. are aligned from a systems perspective; and that we use the company's products to continuously improve their functionality and performance, which ensures success and expansion for both SolarWinds and customers.