You probably heard about the GoDaddy outage in September (2012). A corruption in routing tables across the company’s network left millions of Godaddy-hosted sites dark for many hours. To mitigate damage, Godday offered impacted customers one month of service.
Had Godaddy been publicly traded, an outage of that magnitude easily could have impacted the value of company shares. And that brings us to the topic of this post: when public-facing web resources go down, besides being an IT emergency, it can also quickly become a public relations emergency. How big an emergency depends on the company’s size and primary business focus.
Last week both Google and Facebook customers experienced significant outages. Google announced that faulty software updates pushed to production load-balancers in the company’s 9 datacenters around the world led to functioning systems being seen as offline.
Within 5 minutes of the push monitoring software picked up associated problems and in another 10 minutes the push was rolled back. During the 15-20 minutes of service disruption Gmail users were the most impacted—40% of whom could not send or receive mail.
Facebook reported that its service went offline due to a change in their DNS infrastructure. Their monitoring system detected the problem and IT teams resolved it quickly. However, for a service with over a billion users, the short outage still had a big impact, at least in terms of being noticed and requiring press statements.
The lesson in these cases seems to be that the bigger your service becomes, the more you depend on your monitoring systems to detect issues at least as quickly as your users. Along with network, storage, system and application monitoring views of your production site, you should have reliable user simulation agents to serve as your canaries in the bit mines.
Finally, in the auto-escalating alert workflow you put in place, bring an appropriate public relations expert into your incident management discussion as early as possible, especially if your company is publicly owned.