Sour Notes in iTunes

On Monday, iTunes was down. But we all expected that because Apple was holding its “Spring Ahead” event, and was poised to announce a slate of new products.

Today, iTunes was down again (or at least parts of it) and this was very NOT expected.

The first report of the outage appeared on TheNextWeb.com. They noted that  iTunes connect was down, you could see music but not buy it, and several app pages were dead when you click them.

As is the case with most short-term outages (Apple responded and resolved it within an hour or two) we will likely never know what really happened. And that’s fine. I’m not on the iTunes internal support team so I don’t need the ugly details.

But it's always fun to guess, right? Armchair quarterbacking an outage is the closest to sports that some of us I.T. Pro's get.

First, I ruled out security. A simple DDOS or other targeted hack would have defaced the environment, taken out entire sections (or the whole site), and made a much larger mess of things.

Second, I took simple network issues off the list. Having specific apps, song purchasing, and individual pages die is not the profile of a failure in routing, bandwidth, or even load balancing.

My first choice was Storage – if the storage devices that contain the actual iTunes songs as well as app downloads were affected that would explain why we saw failures once we got past those initial pages. It could have explained why the failure is geographic (UK and US) but we didn't hear about failures in other parts of the world.

My runner-up vote went  to Database – corrupt records in the database that houses the CMS which undoubtedly drives the entire iTunes site. Having specific records corrupted would explain why some pages worked and others don’t.

Then CNBC published a statement from Apple apologizing for the outage and explaining it was an internal DNS problem.

Whatever the reason, this failure underscores why today’s complex, inter-connected, cloud and hybrid cloud environments need monitoring that is both specific and holistic.

Specific because it needs to pull detailed data about disk and memory IOPS, errored packets, application pool member status, critical service status (like DNS), synthetic tests against key elements (like customer purchase actions), and more.

Holistic because we now need a way to view the way write errors on a single disk in an array affects the application running on a VM that uses the array in its datastore. We need to see when a DNS resolution fails (before the customer tries it) and correlate that to the systems that depend on those name resoolutions.

That means monitoring that can take in the entire environment top to bottom.

Yes, I mean AppStack.

Hey, Apple internal support: If you want us to set up a demo for you, give us a call.

Thwack - Symbolize TM, R, and C