In the first article of this series I discussed Facebook as a bellwether for an eventual convergence of face recognition technology and the social web as depicted in the film Minority Report. In this second part I want to elaborate on one big data implication of that convergence.


Big Clouds, Little Clouds

Within their overall footprint (180,000 non-virtual servers in multiple datacenters) Facebook reportedly has multiple clusters of over 6000 non-virtual servers interconnected with a single file system.


So what? You have heard that Amazon’s AWS may have over 450,000 servers in seven datacenters. Bear with me while I explain the important difference.


Yes, you can load up many datacenters with thousands of servers and leverage that impressive cloud-computing power to run a job, for example, that cracks previously uncrackable encryption schemes. Such a job concurrently uses servers in parallel to accomplish work on a very large iterative task that would run much longer than your life if done with serial processing on a single machine. The speed of completing this kind of task is directly related to the number of servers you throw at it. Concurrent iterations within the overall task do not share any data. And cloud computing in the AWS model excels in meeting ‘shared nothing’ data management challenges.


That AWS cloud computing efficiency does not pertain to the real-time data processing required to manage the user experience for Facebook’s more than one billion users*. Instead, Facebook’s operations engineering organization needs to leverage cloud resources in a way that handles interrelated events for all concurrently active Facebook users. The trick is to give work to thousands of machines in parallel while respecting the cascading effects of data I/O; each event ripples across as many accounts as a given user has “friends”. Facebook’s production workflows have complex ‘shared something' data management challenges.


Usually sharing data means the contending resources at some point must wait in a line; and the shape of the line is often called a—say it with me—bottleneck.


Meet Hadoop

Facebook meets their formidable shared data challenges with an open source technology called Hadoop.


Hadoop provides a common file system across all federated servers. Master servers perform the MapReduce actions of dividing up a task across servers in the federations and then reducing the provisional answers that come back into an appropriate overall answer. As a result, 7000 servers function like a massive and coherent computing entity; and the millions of user Newsfeeds and Timelines happily interweave text and image data across Facebook’s social web.


Facebook's great success in terms of operations engineering is to scale the findability and changeability of big data so that users experience real-time social interaction. While Facebook applications pay more attention to patterns in relevant data sets, AWS applications pay attention to everything in the data set. It's a difference between finding a needle by looking at every piece of hay in the stack and finding the needle by looking at only the right hay.


*AWS could of course be leveraged for purposes described here; and in fact support for the Simple Storage Serivce (S3) has already been developed within the ongoing Hadoop open source project. The advantage that Facebook has is in already having figured out how to scale up the largest Hadoop-based cloud-computing implementations.


More about Monitoring

In the next article of this series I will revisit the topics of image data use and data encryption in order to connect some dots related to the data-processing acumen of Hadoop. Here I want to emphasize a previous point that any growing storage solution, and especially one with real-time processing requirements, must in terms of business case spread risk at the time of procurement. It’s better to do business with multiple vendors of more or less equally reliable products than depend on a single vendor that might pose risks related to future pricing and in any case probably will not survive in the long-term in its same form. All the more important, too, is having a scalable storage monitoring system that accommodates both the addition and removal of different products made by different vendors.