26 Replies Latest reply on Apr 3, 2014 12:39 PM by esther

    Improving From Failure

    subnetwork

      To thine own self be true, and it must follow, as the night the day, thou canst not then be false to any man.  -William Shakespeare

       

      Last week, I discussed the fact that we will all make mistakes. As engineers and administrators, it is easy to get caught up in our emotions after such an event. We worry about our job security, our upcoming review, the contract with the customer, and our co-workers perception of our abilities.

       

      Despite our best efforts, we will make mistakes throughout our careers, and every employer has a different way of handling failure. Regardless of the requirements placed on us, we have a greater responsibility to ourselves. With each mistake, we need to swallow our pride, remind ourselves that we are not demi-gods, and that errors will occasionally happen. Once we have dealt with the shame and guilt, we can move on to analyzing the mistake, and learning from it.

       

      When we analyze a mistake for our own reasons, there are a few questions to ask:

      -What was the root cause?
      -What was the resolution?
      -What was my role in both the cause and resolution?
      -What can I do to prevent a re-occurrence?

       

      These are standard questions, generally asked to the entire team. When we direct these questions toward ourselves, without any pride or finger-pointing, a unique thing happens: we begin to discover areas that need improvement. We can create a list of learning topics, identify personal weaknesses like impatience, and ultimately ensure that each mistake moves us forward in our overall career instead of backwards.

       

      What lessons have you learned through your mistakes?

        • Re: Improving From Failure
          bsciencefiction.tv

          After you have analyzed the problem several times and can not see the mistake, epsecially in SQL Queries, Code, rules, and you can not see the problem... either grab a peer and have htem look over the code, query, rule, without giving them any information that may prejudice their review of it.  They will often immediately see what you have been overlooking.  So often we code the same thing over and over until our mind reads what should be there as opposed to what is there.  If we swallow our pride and get a second opinion, it is so very helpful.

            • Re: Improving From Failure
              subnetwork

              I couldn't agree more. Peer review is a quick and easy way of not only finding problems, but avoiding them to begin with. It requires trust within the team, but is extremely effective when it is used.

              • Re: Improving From Failure
                wbrown

                Not only get the 2nd pair of eyes to agree what is there, but also to agree where there is.

                When looking through a number of different devices for an issue it is very easy to start confusing where the information was found.  For example, which router was I in when I found route X? or which switch was I in when I saw an incorrect spanning-tree root? or which firewall did I see the connection reach? etc....

                 

                Colleagues and I often bounce theories off each other when planning a change.  This helps identify assumptions that can then be verified or refuted before we step into any gotchas.  This process helps not only when planning, but also during troubleshooting and post-mortem.  Such discussion may start with "I saw symptom X, therefore I suspect possible issues 1, 2, or 3. Should be able to coffirm by going to switch ABC to find symptom Y, ....".

                 

                It is important to make sure that the 2nd, 3rd, 4th pair of eyes are technical eyes.  I've been situations where multiple layers of management are literally standing over a shoulder and insisting on some action being taken ASAFP.  Making a change too quickly can easily appear to fix the immediate issue but really just moves the issue somewhere else.

              • Re: Improving From Failure
                cahunt

                Very good set of questions. Some days it is difficult not to take the inquisition personal and remember that investigation and the process is fact finding. Though not everyone approaches it in that fashion... ahem, managers. If you can remember though that you do Care about your job enough to stand up and have the integrity one you are to admit the failures or issues that were projected to the world (at least at the manager/owner state) and discover the cause, affect & effect and also a good procedure/workaround/fix; I think we could put most managers out to pasture. 

                • Re: Improving From Failure
                  SomeClown

                  I have found a couple of things from mistakes I have made:

                   

                  (1) The mistakes I have made which led to significant problems, downtime, etc., all are ones which to this day--even 20 years after the fact in some cases--I do not forget.  If you really sit down and analyse the root cause of a significant problem, and you honestly come to see the role you may have played, it's the kind of thing that stays with you.  As Mark Twain said: "A man who carries a cat by the tail learns something he can learn in no other way."

                   

                  (2) Most of the mistakes I have made are due to sloppiness and over-confidence.  Thinking "I've done this a million times, I know this system, I can skip pre-checks" are what tend to crop up as root causes repeatedly.  If you've worked with something for a long time, and have reached a fairly high level of skill, you can start to "believe your own press" and think you're really that good--a "demi-god" as you say.  The network will punish you for your transgressions.  This is one of the quickest paths to staying humble that I know. 

                    • Re: Improving From Failure
                      subnetwork

                      SomeClown wrote:

                      (1) The mistakes I have made which led to significant problems, downtime, etc., all are ones which to this day--even 20 years after the fact in some cases--I do not forget. 

                      This is right on. We all make mistakes that become the ultimate lesson. I once forgot the "add" keyword when I thought I was extending a vlan to a new switch. Never made that mistake since.

                      • Re: Improving From Failure
                        byrona

                        I also agree with this.  Implementing peer reviews has significantly reduced the number of errors and/or mistakes that get put into production.

                      • Re: Improving From Failure
                        zackm

                        I've got mistakes made in spades... But they are all unique and distinct in that I try very hard to not repeat mistakes at all. The mere act of committing a root cause to memory, in an effort to not repeat it, is a very strong tool to advance as a well rounded engineer (in my opinion). I think that we all can admit that, no matter what experience or tenure you possess, we get sloppy and tend to take shortcuts past standard practice in an effort for speed over quality assurance. It's human nature, but I think you will tend to find the best admins and engineers are the ones who take their mistakes a little personally. Ownership of issues is becoming harder to find for a lot of the reasons mentioned in the original post.

                        • Re: Improving From Failure
                          michael stump

                          Quoting from Hamlet is auspicious and foreboding start to this thread.

                           

                          The last royal screw-up I've been solely responsible for was in preparation for a DR exercise. I vMotioned some hosts to non-replicated storage before the last replication run between prod and DR. And by some hosts, I mean the AD servers. We recovered everything except for AD, which made authentication... interesting. The worst part was that one of my coworkers had confirmed that the VMs were on the right storage just before I moved them.

                           

                          It set me back a bit, and made me realize I had become a bit overconfident. Like many others, I realized that I couldn't edit my own work, as it were. I ended up making peer review part of the change management process, and added peer validation to the change implementation process. As long as you're able to improve upon the mistake, you're growing professionally.

                          • Re: Improving From Failure
                            802jr

                            Almost every network administrator has had to get off your desk, grab the key to a car and drive to a remote site, of course when the proximity is relatively close. Why,, do rookie network administrators have to drive you ask. Well, I quickly learned I to made this mistake. It is to change a router configuration because being aware of an alternate path to that same router or remote device. It was a costly mistakes to copy and paste the configuration of the closest router first and then try to get to the opposite end when you have just broke the link. I would hate to have to have learned this lesson with a remote branch at a completely different city without having some to call at the far end. Needless to said I always make sure now, that I have an alter or redundant path to the remote side even if it is a slow partial T1, heck I'll even take a dial up just to have that reassurance their for me.

                              • Re: Improving From Failure
                                subnetwork

                                I think the other thing that is learned over time is how to recover from a failure in the most graceful manner. As beginning engineers, we often take off running (or driving) when something goes down. As we grow, we can think through various solutions to a problem remotely. From "reboot in X", to copying from startup back to running, or changing the local config to match the remote system so that we can bring the remote back online for the fix.

                                Ultimately, we also start working through the various problems before they happen, and how we can recover IF there is a problem.

                              • Re: Improving From Failure
                                byrona

                                One of my biggest challenges in becoming a system admin was the fear of making mistakes, especially on customer systems.  While I still struggle with it I have had to accept that it happens and just be sure to have a backup in place for when you really mess things up because it isn't a matter of "IF", it's a matter of "WHEN" because it will eventually happen.

                                • Re: Improving From Failure
                                  Scott Sadlocha

                                  I have learned a few things from mistakes, some of which have been mentioned to some degree. But I will add my thoughts..

                                   

                                  1. If you make a mistake, own it. Don't ever try to hide it. Taking ownership and responsibility shows character. Don't play the blame game.

                                   

                                  2. If you make a large scale mistake and realize it right away, and you have a feeling that you might be in over your head, go with your gut and enlist help right away. Trying to fix it yourself when it is beyond your scope just causes the issue to grow and escalate. It is better to have a bunch of people know of your mistake and remedy it quickly than to have fewer people know and have an issue persist, especially when end users are involved.

                                   

                                  3. If you are making a large scale change that will affect multiple users, no matter how well you know the system, get a second set of eyes on the configuration changes before pressing the button to implement. If the second set of eyes isn't available, at the very least step away for a few minutes and then come back and give it a once over. I have found innumerable instances where everything looks good and then I notice a mistake when rechecking or having someone else check.

                                   

                                  4. Going with the above point, if you have already made a mistake and are trying to determine root cause but running into a wall, have a second set of eyes look at your code, query, or configuration changes. Again, if the second set is not available, step away a bit and come back. It is shocking how many times stepping away from a problem and coming back has worked for me.

                                   

                                  5. If you have made a mistake, analyze it and determine how you can grow to avoid it in the future. Try do determine controls you can put in place, both personal and in the system involved, to avoid future instances.

                                   

                                  Thank you for the great topic Jonathan.

                                  • Re: Improving From Failure
                                    RandyBrown

                                    Speaking from recent experience (last week we had a catastrophic failure of our core switch after an upgrade) ... Even when there is a team in place (extra eyes to overlook what is going on and put in their 2cents if/when something out of the ordinary happens), when a problem occurs, it is best to slow down ... yes, slow down ... and evaluate the entire  problem BEFORE jumping in to fix the problem.  In the case of last weeks events, we jumped in as soon as the problem showed up and started trying to fix the one problem that we could see.  Had we stepped back, slowed down, and observed for a few minutes ... we might've/should've seen that there was a much bigger problem and the resolution likely would've come much faster than it did.

                                     

                                    It seems counter-intuitive to slow down when a big problem occurs ... my gut instinct is to kick it into gear and start trying to fix the problem ... but I think it is prudent.


                                      • Re: Improving From Failure
                                        subnetwork

                                        This is a great point!

                                        RandyBrown wrote:

                                        when a problem occurs, it is best to slow down ... yes, slow down ... and evaluate the entire  problem BEFORE jumping in to fix the problem.

                                         

                                         

                                        When things don't go as planned, we assume that we know what the problem is, and that we can fix it. That isn't always the case, and proceeding without understanding the whole problem can create a bigger headache.

                                         

                                        I know of a site that after a planned power outage, which shut down the entire MDF, turned the power on to everything, and allowed systems to begin booting. After 10 minutes, traffic still wasn't being passed, so the technician shut down the secondary core. Once he did so, everything began working normally, so he assumed that the Supervisor had gone bad. The REAL problems involved mis-configured HSRP and Routing. He started a TAC case, and had the supervisor replaced. Since the primary was up, and fully stable, when he turned the secondary back on, all traffic passed as expected. So he "resolved" the problem, right?

                                        • Re: Improving From Failure
                                          michael stump

                                          Completely agree. Don't Panic. That usually leads to thrashing, conflicting reports of "I fixed it," followed by, "Wait, nevermind." Understand the problem before implementing the solution. Break-fixes are sometimes an exception, as long as there is follow-up to make sure the fix is in fact correct.

                                        • Re: Improving From Failure
                                          rharland2012

                                          I've made many mistakes, and will likely make more. Echoing other sentiments here, there are a couple that are etched into my memory.

                                          I'm not ashamed, though. I can be professionally embarrassed and feel a genuine responsibility for the people I may impact with a miscue, but these things don't own me.

                                          Since I was never a quick one, I've used these mistakes - and the things I learned from them - to establish methodologies for work that help.

                                           

                                          1. Change one thing at a time.

                                          2. Document what you're going to do before you do it.

                                          3. If something goes wrong, do the next right thing. Don't freeze - it doesn't help anyone.

                                          4. Realize that unless you're dealing with life-or-death issues, that what's happening - however huge - is not life or death. Maybe you've fudged up really bad. Maybe you'll get fired. It happens every day. You'll work again. Plus, if you're working somewhere where human error is rewarded with termination on a one-strike count, it's probably a poisoned well anyway.

                                          5. Leverage the smart people around you. If your boss, colleague, or SME is a zero-documentation, keeps-everything-in-his-head type, architected everything in the shop, and doesn't have a stitch of docs, start talking, since memory and knowledge cannot be leveraged without context.

                                          6. Ask a million dumb questions - even though it will irritate him or her - and don't feel bad about it. Your job is about deliverables. You're showing that you want to assist your team by documenting. Documentation is demystification, plain and simple. Runbooks are our friends.

                                          • Re: Improving From Failure
                                            Chet Camlin

                                            Hero to Zero in one click. I’ve been on both sides of an IT infrastructure failure.  It is not a good feeling to be on either side.  Always accept responsibility for your actions even when it would be easy to blame someone or something else.  It is not enough to just accept responsibility. To recover from your Hero to Zero situation you must identify what caused the problem and implement a preventative action. If you don’t, you will be seen as someone who tries hard but can’t be relied upon to do the hard work to ensure it doesn’t happen again. 

                                             

                                            Remember your boss has a boss.  If you don’t give your boss truthful information then you risk him telling his boss the wrong thing.

                                              • Re: Improving From Failure
                                                mtvvn

                                                CNET RSS Home

                                                All of CNET

                                                CNET News

                                                Product Reviews

                                                Shopper

                                                Download.com

                                                How To

                                                Marketplace

                                                CNET TV

                                                RSS Readers

                                                To get news, stories, and latest info sent right to your desktop you can look into software or web based readers. Standalone applications are simply programs that process RSS feeds for you.

                                                 

                                                The other type is the plug-in. This app works within an existing program such as Firefox or Internet Explorer. The advantage of having RSS work with an existing app is that you likely have Firefox or IE open already, so the reader becomes a new component of that app.

                                                FeedDemon

                                                See readers on Download.com

                                                Usage Guidelines

                                                We encourage you to use these feeds, so long as you do not post our full-text stories, and so long as you provide proper attribution to CNET, including links.

                                                 

                                                Whenever you post CNET content on your Web site or anywhere else, please provide attribution to CNET, either as text (CNET) or with a graphic (we reference a small 88x31 logo in each feed for this purpose) if you use the feeds publicly--meaning, where anyone but yourself will read them.

                                                 

                                                CNET reserves all rights in and to the logo, and your right to use the logo is limited to providing attribution in connection with the RSS. We don't require anything dramatic, but we do ask that you always note the source of the information.

                                                 

                                                CNET reserves the right to require you to cease distributing the CNET content at any time for any reason.

                                              • Re: Improving From Failure
                                                superfly99

                                                Never make the same mistake twice!!

                                                 

                                                No one notices (or rarely notices) when you do good work but everyone will remember the 1 mistake you made. We have a managed change process in place to help stop mistakes from being made. So if you make a mistake, the audience is much larger than just your boss. I always double check my work to minimise any issues.

                                                • Re: Improving From Failure
                                                  subnetwork

                                                  There has been a lot of great discussion on this thread. Check out the latest thread in the series here:

                                                  http://thwack.solarwinds.com/message/209951

                                                  • Re: Improving From Failure
                                                    esther

                                                    Mistakes are necessary for improvement.... they are blessings