Introducing Leaders in Big Data
The craft of generating useful insights from the compost heaps of big data is still in its infancy. Google Tech Talks: Leaders in Big Data panel explores their experience with big data was well their ideas on what's needed to evolve into the next stages of this promising technology. The members of the panel are:
- Theo Vassilakis, Principal Engineer / Engineering Director at Google
- Gustav Horn, Senior Global Consulting Engineer, Hadoop at NetApp
- Charles Fan, Senior Vice President in strategic R&D at VMWare
Reviewing the 3Vs of Big Data
Understanding the 3Vs of big data, volume, velocity, and variety, is crucial when creating a big data strategy.
- Volume refers to the amount of digitized data that must be stored and secured before it is used.
- Velocity is the speed that data is moved, transformed, analyzed, and reported.
- Variety speaks to the different types of data. Planning for this variety of data is integral to an effective big data strategy
The panel discussed how the 3Vs requires a new IT structure to support big data in Enterprise decision making. Big data strategies must include current and future needs of the 3Vs.
Classic data is human-generated and is record-based. It is generally created, read, updated, and deleted. Today, more and more data is machine generated, write-once and read-many, and rarely updated or deleted. Panel member Charles Fan states, "Big Data is C.R.A.P. data" because it is Created, Replicated, Appended, and Processed. Whoever can process C.R.A.P. data will be the big winner in Big Data.
Open source standards and solutions
Open source is the primary standard of big data. Big Data is moving from the data storage model where everything is relational to a more chaotic structure with many models of data stores and many ways to query it. Since big data strives to connect disparate data sets, big data solutions will likely continue to rely on open source tools and open source solutions. Open source standards and solutions include:
- Protocol buffers are an open standard that is an XML-like format of representing data.
- Apache Hadoop is an open source software library that provides a framework for distributed processing of large data sets across clusters of computers.
- Vmware's open source Project Serengeti enables the rapid deployment of an Apache Hadoop cluster on a virtual platform.
Big Data's four layers of functionality
- Big data applications - these are the applications that provide readable, consumable, and relevant information gleaned from the data.
- Big data analytics is the layer of machine learning and other algorithms.
- Big data management is the query engines where you can query the data.
- Big data storage is common sink for all the C.R.A.P. (big data store).
Looking forward, the big data industry might be able to apply standards at the data storage and data management layers. Expect applications to be delivered as a service instead of in a software bundle
Privacy is critical. If users don't trust you, they won't use your product or service. Privacy safeguards are key to successful outcomes. Think of your customers' data as money. If you want your customers to keep their data with you, you need to assure it is protected, otherwise your customers will leave their data at home ( in the mattress ) or will find a more secure service provider.