(Okay! I'm regurgitating this one. Forgive me)
I've been at this company for 7 years. My primary objective has always been to build and continue to improve our DR strategy and Business Continuity Preparedness. I won't bore you with the sordid details but during the first 6 years we ripped out a lot of IT, and installed a lot MORE IT in its place. As a result we've made tremendous improvements and leadership brags about it. However... we had never tested a Full Outage scenario, only partial scenarios with plenty of assumptions in place.
2 years ago I finally convinced my company to make our corporate datacenter 'Dark' and failover all Tier 0 through 3 Services to our DR location and run through a weekend before failing back. I spent months planning, communicating, and preparing for this event. The count for operational improvements and SIP's implemented across all of IT in preparation for the exercise was in the high 30's! All while other projects and issues fought for priority. Not only that, I got everyone in IT psyched up for this. The excitement in the office leading up to 11pm on June 5th was electric..
So we began, communications were sent out... we travelling around the dark side of the moon for a while... systems were shut down, data was preserved, and then cables were pulled! Our datacenter went dark. No connectivity to the outside world at all. Tora! Tora! Tora!
Ironically enough, the movie that was playing on AMC that night (in what seemed to be an endless loop) was my all-time favorite Crisis Management movie ever. Apollo 13!
"Houston, we have a problem!" My Inmage replication SME, Anthony, said something was wrong. (Inmage was now owned by Microsoft and is now part of Azure). He asked to have a minute and we all waited on the phone with baited breath. Minutes felt like hours, and it is now 2:30am with 15 people on the conference call. "Inmage doesn't work. I can't fail over any server and bring it up in our DR location." People panicked! Others wanted to cancel the exercise right there. Service Owners were calling my name on the con call. Around 3:15am it was realized that our Inmage replication technology was woefully misconfigured. The Primary and the DR agents needed to talk to each other. "That's not DR! That's not resilient!"
At this point I had not notified my leadership of this grand canyon-sized gap. (They were all asleep). Now 20 people of all makes and models were calling out to me on the phone. "Rome is burning!" "The sky is falling" "Abort mission!" "Run to the hills!" "Aiyee!" Everyone was giving me problems, failures, issues, roadblocks. That excitement had turned to equal parts exhaustion and anxiety.
So I looked at my TV for inspiration. And there it was... Ed Harris, as only Ed Harris can be!
https://youtu.be/A-qgcYKIQMI I calmed everyone on the phone down and asked for a moment to speak without interruption. I channeled Ed's confidence and reminded everyone that this is why we test. It's not just technology that is tested, but our procedures, our ability to respond to crisis, our flexibility and reactability when things don't work the way they are supposed. I challenged them to look at this problem being the immediate technological failures. 5,000 employees across the country rely on us to keep our SAP 100% available. Because if SAP is down a week this company will go out of business and those employees are all out of jobs.
It was indeed our finest hour. We documented, we developed contingencies right on the fly, we collaborated and problem-solved... and by 10am, 2 hours ahead of schedule of the UAT phase, we had the systems up and available. I had my updates to leadership and we kept going.
The following week leadership was GLOWING with praise. For 3 years the company was sleeping under a blanket of false security. Inmage was trashed for VMware SRM, the vendor who implemented Inmage was thrown out, and all of IT was commended for their dedication and commitment to not only the exercise but the company.
Thanks Ed Harris!