cancel
Showing results for 
Search instead for 
Did you mean: 
steveng
Level 8

The Perfect Storm

           In our field, problems happen.  Some of them we can see coming, others catch us by surprise. Usually, we have fail safes in place which mitigate those problems.  Redundancy of data paths, UPS power, alerts via e-mail or text.  Sometimes those fail safes don’t work.  Sometimes they all fail at the same time.

 

            Our building is secure building.  We had 24/7 security guards and swipe badges.  We have key codes to get into secure areas such as the server room and communication closets.  We have redundant everything NAS to ESXi, and dual path switching and routing.  We have it pretty good.

 

            Due to cutbacks, we lost our 24/7 security guards. The guard shack is now only manned during “normal” office, 7am – 4:30pm.  In an effort to make up for this lowering of the security posture, our management team limited access to critical areas, such as the server room. Access was pared down to three people. If you needed access, you would have to coordinate with one of these three people.  This sucked, but hey, we’re flexible.

 

            One evening, we had a gang buster of a storm come through.  The building lost power, but hey we have an UPS and a backup generator.  No problem.  The UPS kicked over and the server room remained powered up.  The UPS could handle the load for at least 12 hours. However, the building that housed the generator had developed a leak above the main power distribution panel. As the generator powered up, the main breaker feeding our building fried.  Oops.

 

            With the generator power off line, the air handlers for the air conditioning system had no power.  Even when power was restored to the building, the air handlers needed to be reset.  There was nobody in the building to perform that function.

 

            When the first network admin came in at 6am, he noticed heat emanating from the server room. It actually burned when he touched the door. The server room had an ambient temperature around 145 degrees. Some of the gear was over 180.  As Murphy would have it, the network admin was not one of the three people who had access, and so all he could do was make a call and wait.  Once access was provided to the server room, an orderly shutdown was initiated.  Well, as orderly as possible.  Some gear crashed after hitting a certain temp.  Some gear, routers and switches, could only be powered down by pulling the power cord. One of our techs actually received second degree burns to his hands while pulling cables.  Bad times.

 

            After the air handlers were brought online and the server room cooled down enough, the order was given to begin powering up the server room.  Would we be able to return to normal operation?  It took most of the day, but we made it.  There were some failures, but nothing critical. We were able to track the actual time of the air handler failure and watch the temperatures rise using historical system health data from SolarWinds. We could plot our server room’s demise, watching it increase in temperature. At first slowly and then rapidly spiking before the equipment supporting our virtual infrastructure finally died, taking SolarWinds with it.  We have been replacing an abnormal number of NAS drives since the outage, but that is far better than it could have been.

 

In the end, lessons were learned, technicians were bandaged, and yes, there was a tomorrow.

 

10 Replies

Re: The Perfect Storm

Server room melt down...oh no!!!!  Good read steven.goode

Re: The Perfect Storm

Good read.. leadership dodged a bullet, as their decision to limit access could have resulted in millions of dollars in damage..  glad somebody finally got their head out of their ass.

gfsutherland
Level 14

Re: The Perfect Storm

The NAS drive issue is always a "bonus" after things like this... been victimized a couple of time with same type of event.

pseudocyber
Level 12

Re: The Perfect Storm

Add in any solarwinds temperature alarms?  I've used a sensaphone to do this in the past - there was also a circular chart, with a pen, which would make an analog record of temps.

Remote Room Temperature Monitor Systems | Humidity Monitoring | Sensaphone | www.sensaphone.com

0 Kudos
steveng
Level 8

Re: The Perfect Storm

We actually have environmental sensors. Had them before the problem too. The trick, is getting the right people the information, and not letting them go beep beep to an empty room.

d09h
Level 16

Re: The Perfect Storm

Believe the new NPM has heat map functionality (Wi-Fi Heat Map - Wireless Heat Map Software | SolarWinds).  Just saying...

0 Kudos

Re: The Perfect Storm

Wi-Fi is a no-no on a classified network.

0 Kudos
d09h
Level 16

Re: The Perfect Storm

I wasn't serious.  And yes, indeed it is.

Re: The Perfect Storm

Actually, I have advocated a Faraday cage be built over the entire complex and hook up wireless.  That way, on nice days,  we could work on the classified network from a lawn chair, in our flip flops, while drinking a beer.

0 Kudos