In our field, problems happen. Some of them we can see coming, others catch us by surprise. Usually, we have fail safes in place which mitigate those problems. Redundancy of data paths, UPS power, alerts via e-mail or text. Sometimes those fail safes don’t work. Sometimes they all fail at the same time.
Our building is secure building. We had 24/7 security guards and swipe badges. We have key codes to get into secure areas such as the server room and communication closets. We have redundant everything NAS to ESXi, and dual path switching and routing. We have it pretty good.
Due to cutbacks, we lost our 24/7 security guards. The guard shack is now only manned during “normal” office, 7am – 4:30pm. In an effort to make up for this lowering of the security posture, our management team limited access to critical areas, such as the server room. Access was pared down to three people. If you needed access, you would have to coordinate with one of these three people. This sucked, but hey, we’re flexible.
One evening, we had a gang buster of a storm come through. The building lost power, but hey we have an UPS and a backup generator. No problem. The UPS kicked over and the server room remained powered up. The UPS could handle the load for at least 12 hours. However, the building that housed the generator had developed a leak above the main power distribution panel. As the generator powered up, the main breaker feeding our building fried. Oops.
With the generator power off line, the air handlers for the air conditioning system had no power. Even when power was restored to the building, the air handlers needed to be reset. There was nobody in the building to perform that function.
When the first network admin came in at 6am, he noticed heat emanating from the server room. It actually burned when he touched the door. The server room had an ambient temperature around 145 degrees. Some of the gear was over 180. As Murphy would have it, the network admin was not one of the three people who had access, and so all he could do was make a call and wait. Once access was provided to the server room, an orderly shutdown was initiated. Well, as orderly as possible. Some gear crashed after hitting a certain temp. Some gear, routers and switches, could only be powered down by pulling the power cord. One of our techs actually received second degree burns to his hands while pulling cables. Bad times.
After the air handlers were brought online and the server room cooled down enough, the order was given to begin powering up the server room. Would we be able to return to normal operation? It took most of the day, but we made it. There were some failures, but nothing critical. We were able to track the actual time of the air handler failure and watch the temperatures rise using historical system health data from SolarWinds. We could plot our server room’s demise, watching it increase in temperature. At first slowly and then rapidly spiking before the equipment supporting our virtual infrastructure finally died, taking SolarWinds with it. We have been replacing an abnormal number of NAS drives since the outage, but that is far better than it could have been.
In the end, lessons were learned, technicians were bandaged, and yes, there was a tomorrow.
Good read.. leadership dodged a bullet, as their decision to limit access could have resulted in millions of dollars in damage.. glad somebody finally got their head out of their ass.
Add in any solarwinds temperature alarms? I've used a sensaphone to do this in the past - there was also a circular chart, with a pen, which would make an analog record of temps.
We actually have environmental sensors. Had them before the problem too. The trick, is getting the right people the information, and not letting them go beep beep to an empty room.
Actually, I have advocated a Faraday cage be built over the entire complex and hook up wireless. That way, on nice days, we could work on the classified network from a lawn chair, in our flip flops, while drinking a beer.
SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community. More than 150,000 members are here to solve problems, share technology and best practices, and directly contribute to our product development process. Learn more today by joining now.