32 Replies Latest reply on Aug 5, 2014 4:06 AM by esther

    Making room for failure

    subnetwork

      “To err is human…”  At least that was the famous declaration made by Alexander Pope.  Despite this bit of truth, many knowledge workers understand that our day-to-day roles leave little room for failure.  We are expected to be experts in our field, and experts don’t make mistakes. We don’t make mistakes because mistakes cost the business money, create security vulnerabilities, and destabilize our systems.

      But of course, we do make mistakes. If an engineer reviews their own two year old design and doesn’t cringe a little, that engineer hasn’t learned anything in the past two years. We all create less than perfect designs, or get left behind by the best practices for a period of time.

      These realities raise the question -- how does your team handle the problems they discover? Does the blame game begin? Is there a witch hunt? Is the last person who left the team the scape goat for everything wrong in your systems? Do you spend hours creating documentation to explain the root cause of all possible effects? Even worse, does the team treat the errors like unexploded ordinance -- ignoring it, afraid that one misdirected breath will set it off?

      As teams, we have to create room for errors when they occur. We must create space, a safety umbrella if you will, that acknowledges that we all occasionally make poor choices. Most importantly, we must focus on resolving the issue rather than creating a public display of punishment and shame. 

      How does your organization handle failure when it is discovered?

        • Re: Making room for failure
          RandyBrown

          I've worked for a company where mistakes were followed by days worth of meetings, discussions, documentation and efforts to figure out why the mistake happened, whose fault it was, what could be done to avoid that mistake in the future, and admonitions to the individual(s) that made the mistake to be more careful.  This type of approach only leads to people becoming gun shy, hesitant to make changes or even do regular tasks during the workday for fear of making a mistake.  Efforts at these types of companies usually create low morale and often result in high turnover.

           

          The company that I currently work for expects that we take precautions (change management) to try our best to avoid mistakes, however, when mistakes do occur, they are treated with an attitude that shows that everyone understands that mistakes are just a part of the world that we live\work in.  Nobody becomes the scapegoat, the mistake doesn't result in an inquisition ... once the problem that occurred as a result of the mistake is resolved, we put our heads together to determine why it happened, have a short conversation about how we can avoid the mistake in the future, then move on with our lives/work.  Nobody around here feels like they will be flogged/fired/shamed if they make a mistake.  That said, we are all very conscious of the fact that we need to exercise caution when doing our work.

            • Re: Making room for failure
              zackm

              My current work environment falls within the same scope as your second example. And, as an engineer, I can fully appreciate the accountability placed upon my shoulders and I have no issue with taking the blame for something that is my fault.

               

              However, I have also worked in the office where I have "8 bosses that come talk to me about my TTS reports". In my humble opinion, management can drain the very soul out of an employee with the scapegoat routine. A boss flexing his or her muscles just to prove that they can force their employees to learn from their mistakes is not a boss I will work for again. It's a TEAM effort. Accountability is mandatory, but scapegoating is completely unnecessary.

              • Re: Making room for failure
                subnetwork

                RandyBrown wrote:

                 

                  Efforts at these types of companies usually create low morale and often result in high turnover.

                I couldn't agree more. I've experienced the same thing in the past. One company I worked at had a process called "The 5 P's". It was a why,why,why type process. I can't honestly remember what the "P" stood for, but I remember that we called it "The 5 Punishments".

              • Re: Making room for failure
                bsciencefiction.tv

                Where I work, finding someone to blame is the key and if you can fix the problem is a bonus.

                 

                Okay, just kidding.  Unfortunately, it is often the last person who left the team who dropped the ball, not doing a dang thing during their notice of not tying up loose ends.  Our team is given the room to fall down, but:

                1:     You have to own it,

                2.     You better have tested it, first

                3.     You need to ensure you put in procedures or preventive means to keep it from happening again.

                • Re: Making room for failure
                  rharland2012

                  Almost all 'mistakes' made at my present workplace are not mistakes at all - but normal operating procedures breaking something that was kludged as a temporary fix months or years before. Lack of documentation was our biggest enemy when I came in about a year ago.

                  Since then, we've instituted a serviceable change management piece, established some decent documentation practices, and other things.

                   

                  When mistakes are made:

                   

                  1. We fix the problem.

                  2. The person who makes the mistake owns it, and no further rancor, scolding or contempt is needed.

                  3. We know something we didn't know before, and mistakes are not made twice.

                   

                  Like every other problem in IT, I've noticed - in my limited experience - that poor communication skills are the root cause of pretty much every screw-up I've ever witnessed or perpetrated. I've seen really smart people break things because they're too uncomfortable, awkward, and/or intimidated to have a proactive discussion with others in the shop. It makes me a little sick, but I see it way too much.

                    • Re: Making room for failure
                      Alen Geopfarth

                      THIS..... THIS 10000000000 times.

                       

                      After 2 years at my current job, I am still finding little nuggets of cringe-worthy gold in the network (2 load balancers that should have been in HA but were not, both online and booted but the second one was not configured. We didn't know that till the primary died and the secondary didn't kick in).

                       

                      I am scrambling to make sure that everything I touch has a paper trail. The reasons for all of the chaos today is the lack of documentation presented to the network support staff 3-4 generations removed from the initial install. If you know you have to take a shortcut due to budget, time or training restraints, CALL IT OUT IN YOUR DOCUMENTATION.

                       

                      When my CIO asks "Why did this happen?" it is nice to be able to show a root cause that leads to more support rather than someone losing their job.

                      • Re: Making room for failure
                        subnetwork

                        rharland2012 wrote:

                        3. We know something we didn't know before, and mistakes are not made twice.

                        So you are saying that you learn from your mistakes? Surely not...  :-)

                      • Re: Making room for failure
                        byrona

                        I think we have handled the errors we find in just about all of the different ways you have mentioned at one point or another.  Our company has matured a lot since I first started working there somewhere around 12 years back.  At this point we are generally pretty good about working as a team to understand the problem and create a correction course to move things to our current standards resolving the issues. 

                         

                        What I personally find very frustrating is people that will deny any involvement when asked about an error that has been found instead of just owning it, acknowledging there is a better way, and helping the team move forward to a better solution.

                        • Re: Making room for failure
                          andrethegiant

                          **** appens

                          Most of the "disasters" are due to human errors or bad plans or bad implementation.

                          IMHO errors handling is really important, but can do with a more general case management system.

                          The most important aspect is how avoid (or at least reduce) un-manageble cases in production. Change track, pre-testing, incidend handling are a must.

                          • Re: Making room for failure
                            lwpeters

                            We do not have a large team, so sometimes these instances don't become widely publicized, if they are big enough, of course they are.  But a statement has been pushed down to our team "fail fast and fail forward". We realize that failure is inevitable, so we must react, learn and move on.

                              • Re: Making room for failure
                                subnetwork

                                lwpeters wrote:

                                But a statement has been pushed down to our team "fail fast and fail forward". We realize that failure is inevitable, so we must react, learn and move on.

                                This is the right idea! This understands that it might happen, and when it does, we expect you to fix it quickly, and then learn from the problem. I like it.

                                  • Re: Making room for failure
                                    jgherbert

                                    Agreed - I like that motto a lot. The key part for me is the LEARN bit. We can be accepting of failure, we can fix quickly and we can treat our people with respect. But if we don't LEARN from that mistake and think about how to make sure that type of error doesn't happen in the future, then we're just going to keep on making the same ******** errors ;-)

                                • Re: Making room for failure
                                  Aforsythe

                                  I like that motto lwpeters, mistakes are going to happen, the only thing you can do is try to plan ahead for them and learn from them. Any company that is providing any kind of IT related or dependent services for their customers should have at least some bare bones HA/DR systems in place and TEST them regularly.

                                  • Re: Making room for failure
                                    Kurt H

                                    With all the monitoring tools that we have in place, we do not leave room for failure. Anything that we see that is starting to break we pro-actively fix the problem before it turns into something major. Thanks to SolarWinds Orion and the alerts we get, we are able to fix pro-actively instead of after the fact.

                                      • Re: Making room for failure
                                        jgherbert

                                        So kurtrh, what I'm hearing is that you do have failures, but you identify them quickly so they can be resolved. I would think that fixing a problem once you receive an alert about it is not proactive; it's reactive, no?

                                         

                                        Thumbs way up for having good monitoring in place in the first place though. That's definitely half the battle.

                                      • Re: Making room for failure
                                        jgherbert

                                        One outage I was involved in saw two conference bridges being opened at the same time. The first was a technical bridge where we were troubleshooting the outage that had occurred; the second was upper management discussing who to fire.

                                         

                                        I've worked in a wide variety of environments, each with their own spin on handling failures (those which translated to some form of partial or total outage). And at each level within the company, the reaction can be different. For example, after a big outage that impacts customers:

                                         

                                        Engineering line manager (e.g. my manager): Typically geeky enough to get that bad things sometimes happen; keen to understand the root cause, see how it came about, and ensure that it doesn't happen in the future. Should be protecting the engineer.

                                        v

                                        Their manager: Annoyed at the negative attention their team just got., Rants at ELM for not managing team properly. Often still close enough to understand that problems happen, but knows that it's going to be a bad day.

                                        v

                                        Their manager: Pulled into emergency exec meeting about the outage; pulled over the coals about reputational or financial impact of the outage. In turn tips the misery right back down the chain. Seeks blood in order to satiate exec team. Must be seen to be doing SOMETHING, no matter how irrelevant that action might be.

                                        v

                                        Their manager: Doesn't care about technical nature of the outage; outages are unacceptable. Somebody has to pay. Demands head on stick; really doesn't care whose it is.

                                         

                                        I'm not describing my company, of course! The point is though that where the rubber meets the road, we're often more interested in the technical aspects of the failure (root cause); at the top of the management chain, they're usually more focused on business aspects (e.g. financial impact) and that in turn drives their thinking. Throughout the company it's probably agreed that failure is unacceptable, and we should strive for zero failures. However, given that even with the best changes there can be unforeseen consequences, it will come down (as kurtrh implies above) to how quickly you find the problem and resolve it, which in turn means you need a very thorough testing and acceptance process even for "minor" changes. When there's an outage, we typically talk about root cause, yes, but you also have to look at extenders - i.e. what allowed the failure to last as long as it did? Why was it not identified and fixed faster? e.g. Did the testers skip steps? If so, they bear some responsibility for the outage too, and there's something that needs fixing there too.

                                         

                                        I've definitely worked in places where the attitude about failures was so negative that nobody would be completely honest about what they did, in case the blame was laid at their door. It makes doing the Root Cause Analysis (RCA) almost impossible, because you're effectively interviewing hostile witnesses who will not offer anything more than the bare minimum to answer your question. And nobody wants to step up and say "I know what happened, I screwed up." But that's actually exactly what you need. If you create an environment where everybody's scared, you simply extend the RCA process, encourage people to lie and cover up problems, and nothing really gets better.

                                         

                                        TLDR, I know. ;-)

                                        • Re: Making room for failure
                                          cahunt

                                          Goal : To comment and refrain from typing up an entire deposition.

                                           

                                          We are so big it could go either way and depends who it was that made it, what team they are with and what their super told them to respond with at that specific time. Normally the admittance ends up coming from a super or manager and then tends to be robottic in explanation about what was done. Our good Technicians and engineer's will send that info out to the team themselves. As our alerts are such that we highlight mistakes or missteps during changes if precursor work has not been done. It used to be that changes were massively reviewed, but politics dictate a different processes these days and being on the monitoring team lets me see it all. As big a variance there is to how a tech or engineer responds to their mistake is as much variance we get from how far up it goes. Being healthcare any issue affecting our patient care of course is highlighted. As long as changes follow policy we have a window to manage any mistake and even revert back to a known working environment.

                                          • Re: Making room for failure
                                            rgeist

                                            I used to work for a company that made you feel like you were the worst person ever if you made any mistake whatsoever.  I remember one guy being called out during a company meeting and reprimanded in front of everyone by the boss. It was horrible.  Towards the end of working there I did not want to make any changes other than very small ones for fear of messing something up and getting yelled at in front of everyone.  Even though I was encouraged to learn and apply new things, I never would because if I made an error during the learning process, I would be chewed out, meetings would ensue, etc etc.

                                             

                                            My current company knows that mistakes will be made sometimes and they need to be fixed. Once they are fixed, learn from them.  My department keeps a log in our wiki of things we find and how they were resolved so we will have a running log of them.  I am much more relaxed here knowing that if I try something new and it doesn't work out, I won't be scolded and humiliated in front of the whole company.  When someone does drop the ball, it really depends on what it was for what action needs to be taken.

                                            • Re: Making room for failure
                                              michael stump

                                              I've worked in the fed IT space for a long time. Most human errors lead to witch hunts and public shaming. All of the earlier posts about low morale are spot on. The result of this type of environment is that people are afraid to do anything, for fear of making a mistake. Productivity drops, and projects slow down.

                                               

                                              The reaction depends on the type of error, though. Was it a typo in a config file that blew away an application? Or a flawed design that introduced a security problem? Everyone makes typos; I'd hope management would at least emphasize on that type of mistake. But if the error is in the design of a system, then maybe you hired the wrong engineer.

                                              • Re: Making room for failure
                                                fcpsolaradmin

                                                Ussually we go down as a team, the primary of the system is the one who writes the 80pg incident report.

                                                 

                                                From the mistake, we learn and move forward. I have been in the spot light a few times, with the question "why wasnt that being monitored?" my reply is typically  "well its now"

                                                • Re: Making room for failure
                                                  rharland2012

                                                  Harping on documentation, I know  - but there's a reason that the best big teams tend to have runbooks and people to keep them accurate.

                                                  I'm not a genius, or even particularly smart; but I can follow directions with context. It helps when working on layered spin-down/up processes - such as invasive power utility changes and things were stuff actually has to be - gasp - turned off on purpose!

                                                  • Re: Making room for failure
                                                    802jr

                                                    Two trains of thought for me, what I have seen and what I think should happen.

                                                     

                                                    First, Yes, we are held accountable for the mistakes we make and/or find in our design, systems, etc... I have seen playing the blame game and the game never has a winner. The manager hunts down all the email and communications about said system or project and knit pick at everythings and everything is taken literally. Sure this is a good way to find who is at fualt. But what does it accomplish, employees start to have bad feelings about their manager, communications start to break down, and nothing is ever the same. The "TEAM" may never again play well together, which is unfortunate because the members of the "TEAM" and most like other departments of he organization depend of that "TEAM" to play well and product quality systems.

                                                    In a perfect world we would all come to work finish out tasks and move the project along to completion. Begin the monitoring phase, if faults are found, analyze what the root cause is and begin to design a remedy for the system. Implement the new design and start the cycle over again.

                                                    • Re: Making room for failure
                                                      subnetwork

                                                      There has been a lot of great discussion on this thread. Check out the latest thread in the series here:

                                                      http://thwack.solarwinds.com/message/209951

                                                      • Re: Making room for failure
                                                        wbrown

                                                        What happens after the fact depends on the scope of the failure and what levels of management were made aware.

                                                        If it is as simple as some workstations were knocked offline because a switch crashed then we simply open a TAC case and try to get root cause.  If it is shown that there is a software bug we push patched firmware and continue on our way.

                                                        If the failure of the scale of a facility or enterprise-wide app being down then root cause analysis is performed and documented.  The analysis is partly to figure out what corrective/preventative actions are required as well as to placate department/facility heads that are screaming for answers.  I only recall a single instance of someone being fired as a result but that was after the same individual created multiple identical failures.

                                                        • Re: Making room for failure
                                                          bluefunelemental

                                                          Being at the Engineering Line Manager level but reporting to the VP means I know how easy it is to make technical mistakes and even get to participate them sometimes while at the same time needing to provide upper management with confidence in our projects and upgrades.

                                                          I have little concern of engineers making mistakes if they are caught quickly and resolved - more so with not testing ahead of time to learn from mistakes and moving on without notifying or asking for help to button up a project or task.

                                                          Measure twice, cut once, then you know what, measure again and if need be re-cut.

                                                          Lets not wait till we have a team of people holding up the walls to find out that the joists are cut short. 

                                                          • Re: Making room for failure
                                                            esther

                                                            "To err is human"  ..... that word rings a bell but not in every case. When a mistake is made it should be handled and managed knowing that we are all humans and can make the same or even worse mistakes.

                                                            • Re: Making room for failure
                                                              network defender

                                                              I must agree that time needs to be alloted for failure.  Very rarely does an build happen with  no problems.  Then there is documantation for your build so you repeat as needed.  Time needs to be allowed ro red line documents and procedures.

                                                              • Re: Making room for failure
                                                                phyllip2004

                                                                bsciencefiction.tv

                                                                Definitely late on my part but super well said. Owning it and being proud of what you do.

                                                                • Re: Making room for failure
                                                                  bkyle

                                                                  I can accept failure, everyone fails at something. But I can't accept not trying.

                                                                   

                                                                  Michael Jordan



                                                                  • Re: Making room for failure
                                                                    esther

                                                                    True bkyle, you just have to try, try, try again till you succeed.