Monday, April 1, 2013

Big Data Demystified

For me it seems as if Big Data is a bit  like pack-ratting*(*[psych.] compulsive hoarding or pathological collecting)

You’re unwilling to throw away anything, even if it seems worthless at first glance. You believe you might need it one day. Very few can or want to afford an expensive mansion for these messy collections. You more likely rent a number of cheap garages to put everything in cardboard boxes. If you’re afraid something might get lost in a fire, store duplicates of things in different garages.

Luckily, you numbered the boxes and compiled a list of all the different objects. If you need something, you can look it up in this list. You just have to ask one of the helpers available at each garage to pick specific objects up for you. You even have helpers to sort things, applying any rule you define.

OK, we don’t need an interventional TV show here, we need a proper IT solution: This solution must be cheap, but nevertheless reliable, if we want to store objects of uncertain value for our business.

Let’s start with choosing proper hardware for our Big Data tasks: In general, go for most widely used components of good value. We need lots of storage capacity. Some hundred 3TB SCSI disks should do here. Next, we need computing power. We’re aiming for heavy parallel processing. Therefore a high number of threads are preferred over clock frequency. Eight core Xeon CPUs, maybe 2-3 threads per disk should do fine. Finally we need a network. As we’ll achieve reliability by redundancy and need to move big data blocks around for processing, spend some money on a redundant low-latency and high bandwidth Infiniband network.

Now for the software part: We need software that supports distributed storage, parallel processing and analysis of unstructured data. Cloudera’s Hadoop distribution in conjunction with Oracle’s R statistical analysis distribution is an easy-to-use option for the enterprise. A not-only SQL database, as Oracle NoSQL, for fast access of application objects rounds this up. And don’t forget about integration: We might want to incorporate data from more classical systems and we need to dump results into databases, data warehouses and ERP systems for further analysis and usage. Oracle’s Big Data Connectors build a bridge towards SQL databases and -via Oracle Data Integrator- to virtually any IT system imaginable.

All this -both hardware and software- is available as a pre-assembled, pre-integrated, pre-installed and production-ready engineered system: The Oracle Big Data Appliance. Just plug the power and network cables and you’re ready to go. Up to 18 racks can be combined to provide more than eleven Petabyte raw storage capacity and more than five thousand CPU cores for true enterprise-level, large-scale Big Data processing.

No comments:

Post a Comment