Two acquisitions by Google have made headlines already this year. First was Nest Labs, maker of smart thermostats, and then DeepMind, a small artificial intelligence company that does… well, nobody seems to know for sure.
The former acquisition was driven, in part, by the value of data on energy usage patterns, from potentially billions of homes. And the latter, whatever it is, brought in valuable expertise and algorithms for learning from data.
The pairing of the two – lots of data, and learning from lots of data – is more than the sum of its parts. Both acquisitions together were apparently worth $3.6bn, a huge investment even for them.
The sudden attention to data and machine learning is not specific to Google – it has been a rapidly accelerating focus for the big data industry for several years – but why now?
We’ve always had data, and mined it for value, but there were three key issues. Specialist tools were difficult or proprietary. Huge servers cost millions of dollars. Data was scarce. In fact much of modern statistics was born to solve the problem of compensating for scant samples of data.
The ingredients are now cheap and plentiful. Enterprise-ready open-source ecosystems like Apache Hadoop manage storage and computation across thousands of machines. Open source tools for learning like R, scikit-learn, GraphLab, and Apache Spark’s mllib can integrate with Hadoop.
Hardware is cheaper than ever, or available in large quantities on-demand through cloud services like Amazon AWS. And internet, mobile and M2M devices continue to be virtually endless fountains of data.
Traditional data warehouses are now being augmented by the ‘enterprise data hub’ concept that allows businesses to acquire, combine and query in real-time any amount or type of data in its original form.
Suddenly, lots of machine learning projects are profitable for businesses, large and small, and there is a rush to grab the newly available value. It’s no wonder data is said to be the new oil.
>See also: The AI with the mind of a child
The comparison to oil is actually apt. While digital data is not an exhaustible resource, it is not necessarily permissible to copy and share it in the way that open source software can be. It’s an ingredient that one organisation might uniquely own and protect. The same might be said of better software or algorithms.
This has prompted some to wonder uneasily about the implications of these two acquisitions. Google, already a powerhouse of software engineering, hardware infrastructure and data, is buying up more.
Does such a concentration of these three factors mean Google will be too powerful, in a future that will be dominated by software smarts, computing power, and above all data? Won’t they just keep grabbing more?
If so, these tech giants – Google, Yahoo, Facebook, LinkedIn, Twitter, Amazon – are doing a terrible job of hoarding some defensible advantages. They give away, through research papers or open source code, the blueprints to a lot of the software and ideas that enable their large-scale data storage, processing, and even machine learning systems. We owe the existence of much of Hadoop to this type of sharing: inspired by Google’s MapReduce / GFS papers, originally implemented by Yahoo, augmented by tools like Hive from Facebook, and so on.
But will Hadoop always be built on intellectual hand-me-downs? Also not so. The Yahoo, Google, VMWare diasporas are well represented at companies, like ours, that continue to advance the Hadoop ecosystem. Ideas and engineers are not locked inside these companies forever.
Some point out that Google has moved on from MapReduce, but so has Hadoop through integrations with platforms like Spark. And, Cloudera’s Impala is arguably a remix of current-generation ideas at use within Google, from systems like Dremel and F1.
>See also: Man and machine: Cognitive computing in the enterprise
In at least one way, the problem the Hadoop community is solving is actually a harder one than any tech giant is. Facebook or Twitter can build exactly the technology and infrastructure required to solve their own problems, no more and no less. They can make assumptions that are true for their own internal infrastructure and software. Hadoop must, in contrast, work for thousands of companies’ infrastructures and problems.
So Hadoop helps make the software and algorithms “portable” to other companies beyond the big tech giants. The problem is that the final ingredient, the data, is not necessarily portable. It’s not a technical problem.
But data sometimes can’t be shared because of its sensitive nature, or won’t be shared because of its value. For example, we probably agree that learning from data can be used to improve health care, but do we want to bring medical record data to the organisations with the software to learn from it?
Or would we rather bring the software to the organisations with the data? Hadoop developers will be familiar with its theme of “bringing computation to the data” and not the other way around. This is just why Hadoop helps keep level the playing field of the data-driven future.
At the least, it lets any organisation pair the same software, and same cheap commodity hardware, with data. For example, Google’s amazing work with “deep learning” that apparently discovered the idea of “cat” from an internet of images is “just” a distributed multi-layer neural network at heart.
The simple version of this can be built by any sufficiently motivated organisation, on Hadoop, for their own data. Hadoop has a ways to go to match the sophistication of specialised research projects like this, and package it for general use, but they are not in a different ballpark.
We should be glad that tech giants like Google are then investing heavily in pioneering new technology. Instead of fearing the consequences of acquisitions like this, recent history instead suggests it’s more likely to generate ideas, research, open source code, and ex-employee start-ups that rapidly diffuse into the wider industry, to the benefit of everyone. Hadoop has a vital role in bringing all of these advances, especially in machine learning, to our data too.
Sean Owen is a data scientist at Cloudera.