It was sometime around 2010 that “big data” became a buzzword. Data was an untapped resource and technology was going to unlock the riches with it. Leading the hype was an open-source project called Hadoop. By using Hadoop, you could safely store and manipulate large amounts of data with commodity hardware, it was massively powerful, scalable, and a large community grew around it and developed it.
‘These days we don’t talk about commodity hardware so much. In fact, we don’t talk about hardware at all – by using the cloud, compute and storage have become things that you buy on demand. Analytics is a service, to be bought by the hour.
So, what’s happened to Hadoop? Does it have a place in the cloud? Why have so many companies abandoned their on-premise Hadoop installation in favour of the cloud?
Hadoop – the elephant that could
Hadoop’s origins can be traced to the Apache Nutch project in the early 2000s. Nutch was an open source web crawler to index the web, part of the Apache Software Foundation which itself was one of the pioneers of open source software.
At the time, the Nutch project was struggling to parallelise its web crawler – it worked well on one machine but to get it handling millions of webpages – “web-scale” – was out of reach. In December 2004, Google released a paper called “MapReduce: Simplified Data Processing on Large Clusters” which described how Google had managed to index the rapidly growing volume of content on the web by spreading the workload across large clusters of commodity servers.
The data lake continues to grow deeper and wider in the cloud era
It was the perfect fit for Nutch’s problems, and by July 2005 its core team had integrated MapReduce into Nutch. Not long after, the novel filesystem and MapReduce software were spun out into its own project called Hadoop – famously named after the toy elephant that belonged to the project lead’s son.
The project accelerated in 2006 when Yahoo! used Hadoop to replace its search backend system. Soon after, it was adopted by Twitter, Facebook and LinkedIn too – in fact, it became the de facto way to work with web-scale data.
The technology was revolutionary at the time. Before Hadoop, storing large amounts of structured data was difficult and expensive. Most organisations just kept the most valuable data and discarded the rest. What Hadoop did was reduce the burden of data storage – for the first time it became cost-effective to store lots of data – “big” amounts of data.
Realisation – Hadoop is an ecosystem, not a solution
Lots of businesses both large and small set up Hadoop clusters and hoped to gain business insights or new data-based capabilities from their data. However, for many of them, the results were a disappointment.
More-often-than-not, the Hadoop cluster was installed before they had a good idea of a use-case for it. When they tried to execute on an idea – which was often business intelligence or analytics-based – they found Hadoop to be too slow for interactive queries.
How to climb the data maturity scale
What many people failed to realise is that Hadoop itself is more of a framework than a big data solution. Plus, with its broad ecosystem of complementary open source projects for most businesses Hadoop was too complicated. It needed a level of configuration and programming knowledge that could only be supplied by a dedicated team to fully leverage it.
Even when there was a dedicated internal team, it sometimes needed something extra. For instance, one of Exasol’s clients, King Digital Entertainment, makers of the Candy Crush series of games, couldn’t get the most out of Hadoop. It wasn’t quick enough for interactive BI queries that the internal data science team demanded. They needed an accelerator on a multi-petabyte Hadoop cluster which allowed their data scientists to interactively query the data.
Hadoop in the cloud
The world of data warehousing has changed in recent years, and Hadoop has had to adapt. The IT infrastructure of 2009-2013, when Hadoop was at the peak of its fame, differs greatly from the IT infrastructure of today. The public cloud didn’t even exist when Hadoop was created in early 2006, AWS only launched in March 2006. So, the IT landscape in which Hadoop had its formative years has changed immeasurably.
This has caused the way Hadoop is used to evolve. Most public cloud infrastructure providers now actively maintain and integrate a managed Hadoop platform. The most widely used example is AWS Elastic Map Reduce, but Azure has HDInsight and Google Cloud Platform has DataProc. These days the Hadoop-based cloud platform is most often used for machine learning, batch processing or ETL jobs.
Hadoop: the rise of the modern data lake platform
Hadoop, according to Matt Hutton, director, R&D Think Big/Teradata, is difficult to get right
Moving to the cloud has benefited Hadoop. The complicated set-up is taken care of, and it’s ready to be used immediately on-demand. But Hadoop has competition, it is no longer the only option for secure, robust, cheap data storage – so it’s finding itself used for particular workloads rather than being the centre of the data universe, it’s usual on-premise incarnation.
What’s the future for Hadoop?
For certain organisations, Hadoop is still a great on-premise solution. We still see strong demand for on-premise solutions in our installations, including those integrating Hadoop clusters. The demand isn’t going away anytime soon. The simple fact is if it’s working well, then there’s often no need to change it, and Hadoop is relatively easy to scale so it can grow with your business.
However, it seems the majority of businesses are now looking to run their own data warehouse using public cloud services. We just launched our enterprise-grade data warehouse on AWS, and this was entirely driven by customer demand, more and more businesses are asking for it. And for most of these businesses, Hadoop is just another tool in the cloud toolbox. When you need to run a job at scale it’s a great option, and in the cloud there’s a level of ease of use for Hadoop that hasn’t been enjoyed previously.
So, what does the future hold for Hadoop?
Hadoop was designed as a tool for a job. Originally, it was the means of building a web crawler to index the web. These days it’s best suited for batch processing, data enrichment jobs, or ETL at scale. It’s a great on-premise solution for those businesses which understand its strengths and weaknesses and need to store large amounts of data on commodity hardware.
Many on-premise technologies are finding themselves demoted to legacy technology. However, Hadoop’s legacy may well be in the cloud, where it has longevity and staying power. It’s a fantastic tool to have in your cloud toolbox when you need to run a batch job at scale. With the cloud, Hadoop enjoys a level of ease of use that hasn’t been enjoyed previously.
Written by Jens Graupmann, VP of product management at Exasol