cancel
Showing results for 
Search instead for 
Did you mean: 

Sour Notes in iTunes

Level 17

On Monday, iTunes was down. But we all expected that because Apple was holding its “Spring Ahead” event, and was poised to announce a slate of new products.

Today, iTunes was down again (or at least parts of it) and this was very NOT expected.

The first report of the outage appeared on TheNextWeb.com. They noted that  iTunes connect was down, you could see music but not buy it, and several app pages were dead when you click them.

As is the case with most short-term outages (Apple responded and resolved it within an hour or two) we will likely never know what really happened. And that’s fine. I’m not on the iTunes internal support team so I don’t need the ugly details.

But it's always fun to guess, right? Armchair quarterbacking an outage is the closest to sports that some of us I.T. Pro's get.

First, I ruled out security. A simple DDOS or other targeted hack would have defaced the environment, taken out entire sections (or the whole site), and made a much larger mess of things.

Second, I took simple network issues off the list. Having specific apps, song purchasing, and individual pages die is not the profile of a failure in routing, bandwidth, or even load balancing.

My first choice was Storage – if the storage devices that contain the actual iTunes songs as well as app downloads were affected that would explain why we saw failures once we got past those initial pages. It could have explained why the failure is geographic (UK and US) but we didn't hear about failures in other parts of the world.

My runner-up vote went  to Database – corrupt records in the database that houses the CMS which undoubtedly drives the entire iTunes site. Having specific records corrupted would explain why some pages worked and others don’t.

Then CNBC published a statement from Apple apologizing for the outage and explaining it was an internal DNS problem.

Whatever the reason, this failure underscores why today’s complex, inter-connected, cloud and hybrid cloud environments need monitoring that is both specific and holistic.

Specific because it needs to pull detailed data about disk and memory IOPS, errored packets, application pool member status, critical service status (like DNS), synthetic tests against key elements (like customer purchase actions), and more.

Holistic because we now need a way to view the way write errors on a single disk in an array affects the application running on a VM that uses the array in its datastore. We need to see when a DNS resolution fails (before the customer tries it) and correlate that to the systems that depend on those name resoolutions.

That means monitoring that can take in the entire environment top to bottom.

Yes, I mean AppStack.

Hey, Apple internal support: If you want us to set up a demo for you, give us a call.

8 Comments
bluefunelemental
Level 15

‌apply application monitor for DNS, be that on the dns sever and also on critical services like say Isilon smart connect DNS names. We simply put on the share named a file named do_not_delete and call it up by unc path \\smart_connect_dns\share\do_not_delete.txt.

Now when everything else is green this file monitor will be testing DNS and AD auth for us.

sqlrockstar
Level 17

I cannot believe you wasted even ONE brain cycle trying to blame the database for this outage. Have I taught you nothing? Wait, wait...don't tell me.

jaimeaux
Level 11

Hey, Apple internal support: If you want us to set up a demo for you, give us a call.

^^ This made me laugh. Amazing.

adatole
Level 17

Our educational conversations have been largely limited to your love of bacon.

cahunt
Level 17

Nice Sales pitch

sqlrockstar
Level 17

You say that like it's a bad thing.

shuth
Level 14

Agreed - nice way to start a Friday morning.

jkump
Level 15

The smallest details can bring down the largest system. 

About the Author
In my sordid career, I have been an actor, bug exterminator and wild-animal remover (nothing crazy like pumas or wildebeasts. Just skunks and raccoons.), electrician, carpenter, stage-combat instructor, American Sign Language interpreter, and Sunday school teacher. Oh, and I work with computers. Since 1989 (when you got a free copy of Windows 286 on twelve 5¼” floppies when you bought a copy of Excel 1.0) I have worked as a classroom instructor, courseware designer, desktop support tech, server support engineer, and software distribution expert. Then about 14 years ago I got involved with systems monitoring. I've worked with a wide range of tools: Tivoli, Nagios, Patrol, ZenOss, OpenView, SiteScope, and of course SolarWinds. I've designed solutions for companies that were extremely modest (~10 systems) to those that were mind-bogglingly large (250,000 systems in 5,000 locations). During that time, I've had to chance to learn about monitoring all types of systems – routers, switches, load-balancers, and SAN fabric as well as windows, linux, and unix servers running on physical and virtual platforms.