IT is unsurprising, given its crudeness, that the phrase ‘big data’ has been co-opted by all manner of IT suppliers to market their wares.
But while the meaning of the phrase has been diluted, the technology it was originally coined to describe should not be ignored, says Forrester Research analyst James Kobielus.
“Big data means extremely scalable analytics,” he says. “It means analysing petabytes of structured and unstructured data at high velocity. That’s what everybody’s talking about.”
One sign of the significance of ‘extremely scalable analytics’ is the industry activity currently unfolding around Hadoop, one of the key big data technologies.
Hadoop is an open source software framework for implementing a method called MapReduce, which splits large analytical workloads into smaller jobs and runs them in parallel. Hadoop was originally developed by an engineer at web media giant Yahoo, and was donated to the Apache open source foundation in 2009.
According to Kobielus, Hadoop can be compared to the Linux operating system, in that many alternative versions (distributions or ‘distros’) are vying for dominance. But while the enterprise IT industry took its time to get behind Linux, vendors are falling over one another to join the Hadoop distro wars.
Database veteran Oracle launched its campaign at the OpenWorld conference in October 2011. The company’s Hadoop play is an extension of its ‘engineered systems’ strategy, whereby pre-integrated hardware and software are sold together at a premium price.
The result is the Big Data Appliance, a stack of Sun hardware pre-loaded with what Oracle says is the Apache distribution of Hadoop, although Kobielus expects that it will have been modified by Oracle in some way. The appliance also contains a non-relational database called Oracle NoSQL, and various tools for managing the data.
As Kobielus explains, there is nothing Hadoop can do that other massively parallel processing (MPP) enterprise data warehouses (EDW), including Oracle’s own Exadata appliance, cannot. So why did Oracle bother?
For one thing, there is customer demand for Hadoop, due in part to its free and openly extensible nature. “I’m seeing interest, commitment and budget for Hadoop in many industries,” says Kobielus.
That makes it an important battlefront in the enterprise IT market. “Now Oracle can say to customers: you don’t need to look at the other suppliers – we’ve got a Hadoop product that will work with your existing Oracle investments.”
Another factor is that Hadoop was designed to handle both structured (i.e. in rows and columns) and unstructured (e.g. text, video, audio) data since day one. Although Oracle claims that Exadata can process unstructured data, Kobielus says “it was an afterthought, and I’ve never seen a customer use it for that.”
He says that many of Hadoop’s early adopters, including Yahoo, use the system to convert both structured and unstructured data into a form that can be analysed in a more conventional database.
“You can think of the data warehousing ecosystem as having three tiers”, Kobielus explains. “There is the front tier, which consists of data marts that are optimised for high performance queries. There’s the hub tier, where you manage the master datasets and where the governance takes place, and then you have the staging tier, where you do extract, transfer and load (ETL).”
With the addition of the Big Data Appliance and Exalytics, a business intelligence appliance that Oracle also unveiled at OpenWorld, it now has an ‘engineered system’ for every tier of the data warehouse ecosystem, he explains. “I bet you they’ll pitch the [Big Data Appliance] as a petascale ETL layer that sits behind Exadata.”
The Big Data Appliance is joining an already crowded battlefield. EMC’s Greenplum appliance is the direct competition, while Teradata recently integrated the (non-Hadoop-based) MapReduce functionality of recent acquisition AsterData into its offering. While IBM’s Netezza appliance supports some MapReduce analytical models, it has yet to integrate Hadoop, but Kobielus thinks that may change soon.
Meanwhile, Microsoft has announced that it is working with HortonWorks, a spin-out from Yahoo, to develop its own Hadoop database, and a number of Hadoop-focused start-ups (such as