6 Replies Latest reply on Aug 23, 2012 7:06 AM by Jeremy Stretch

    Staging an event "dry run"

    Jeremy Stretch

      Does your organization ever stage critical event "dry runs" to evaluate response processes in a controlled environment? I think it's a safe generalization that most NOCs are constantly busy, but it can certainly be worthwhile to stage a fire drill every now and then to gauge how well procedure is being followed (or, if it isn't, whether the procedure needs to be updated).

       

      This is an idea I've been toying with lately at work. There are a few obvious items to consider:

       

      Who to involve. Obviously, management needs to give the go-ahead for such an exercise, but how far around the network operations group (or equivalent body) should the test extend? Should upper-level support be in on it, or should they too be included in the evaluation? What about non-technical staff? We have to keep in mind that the more people who are brought into the loop, the bigger the odds are of the intended test subjects getting wind of the exercise.

       

      How to simulate the outage. Up/down monitoring is pretty straightforward, as one can simply tweak an access list or community strings so that SNMP polling fails. More complex outages such as circuit overutilization or intermittent conenctivity issues would require a more sophisticated approach, however.

       

      Preventing harmful response. The worst possible result of a staged outage is a real outage. Precautions must be taken to prevent test subjects from, for example, reloading a customer router "since it's down anyway." Drawing the line between experiment and reality is difficult of course because you want people to perform as closely as they would to a real situation in order to make a prudent evaluation.

       

      Can anyone share their experience staging something like this? Lessons learned from successes and especially from failures are greatly appreciated!

        • Re: Staging an event "dry run"
          joelgarnick

          I've thought about doing something like this but have never been in a position to actually execute.  Unfortunately, i'm more accustom to actual outages hitting often enough that there's no opportunity to dig out...that being said, i'm very interested in the responses that come back!

           

          Who to involve:  If the escalation policy is outlined well, this might be self governing...and depending on the scope of the test, maybe escalation is the first milestone where a decision is made to end the test and evaluate the response to the simulated outage (or the second escalation, or third).  If that's all spelled out, then you know exactly who needs to know it's a test and who doesn't....if you're testing levels 1 - 3, make sure escalation level 4 and above are aware?

           

          Preventing harmful response:  Depending on the device(s) involved and what you've got for authorization control, maybe you could just revoke access to execute certain commands that could be harmful?  Of course, that could be self defeating...the tech being tested may see that they can't execute a command and think it's a symptom of the problem, walk over to the physical device (if they have access) and "manually power cycle" it.

          • Re: Staging an event "dry run"
            mdriskell

            In my former life as a NOC manager I would "test" my teams from time to time by injecting a false route into the local routing table on my SolarWinds server.  This would cause SolarWinds to believe a node was down (even though it wasn't) and allow me to watch my teams response.  Stopping them before they actually called the location and gauging how long they took to respond to the alert and handle the issue.

             

            I even had customers who would be onsite for a network upgrade and instead of informing us they would down the equipment to see how long it took us to call...at that point they would inform us of the outage.

              • Re: Staging an event "dry run"
                Jeremy Stretch

                Sneaky customers. That can be annoying, but one does have to look at it from their point of view. They want to make sure they're getting the monitoring they're paying for. It's certainly better than the customer getting a call from the NOC half an hour after one of our engineers showed up for disruptive work because someone forgot a change control. Oops.

              • Re: Staging an event "dry run"
                bluefunelemental

                Widening the scope of this conversation, what about generating the response procedures in the first place?

                While as a former integrator we had to include these in projects prior to "handing over the keys", we usually just started with tabletop exercises where you walk through a mock issue and document. These documents become the response procedures and a few of these procedures can be nominated for surprise dry runs.

                If people don't know what to do then you can only hope they do the right thing. 

                • Re: Staging an event "dry run"
                  byrona

                  When I managed our NOC we did "fire drills".  When we ran these we didn't really stage a true problem; instead we would send a message through the system indicating a problem and they would respond to an email address in the message with what steps they would take to resolve this problem.  We would measure both their response time and the accuracy of their repines and results of all techs would be posted to try and promote a friendly competition.

                   

                  While this method is not as accurate as simulating a real problem, it's much easier to do without any potential negative side effects.

                  • Re: Staging an event "dry run"
                    Sohail Bhamani

                    During my time as a network engineer for a large engineering company in Austin, we created a pretty intensive lab environment mimicking all the various types of configurations we had out in the wild.  This lab included all the same routers and switches and was conveniently cabled to be a part of the actual corporate network.  We had the ability to route specific traffic through the lab when we wanted to test configurations before going fully live in production.  The lab included some open source traffic shaping and traffic generation software running on a host of linux machines so emulating traffic and internet connection types was possible.  We were also able to manipulate latency, packet loss, bandwidth, routing and so on while also being able to simulate hardware failures.  As you can imagine, it wasnt cheap but management was fully on board luckily.

                     

                    Having this definitely made simulating events much simpler and thus allowed us to develop processes and documentation around handling them for the various levels of teams throughout IT.