Hadoop, while it may be synonymous with big data, and while it may be free to access and work with, engineering teams globally will admit that behind every Hadoop undertaking is a major technical delivery project.
Failures are so commonplace that even the experts don’t have great expectations of 2017: at the recent Gartner Data & Analytics Summit in Sydney, research director Nick Heudecker claimed that 70% of Hadoop deployments in 2017 will either fail to deliver their estimated cost savings or their predicted revenue.
It shouldn’t come as a surprise. Hadoop was designed for big data storage, but it wasn’t designed as an actual big data application. Hadoop and Spark are incredible enabling technologies.
>See also: Purifying the Hadoop data lake and preventing a data swamp
However, for many, it has been notoriously challenging to successfully implement big data solutions on the Hadoop stack due to lack of available engineering skills and big data experience, accompanied by inflated expectations around time-to-value and cost-savings.
So will 2017 see Heudecker’s predictions come true, or will companies break out of the vicious Hadoop failure loop and finally begin to recognise the consistent value and big data success from the elusive open source framework?
The Hadoop helpers: modern data lakes platforms
While all organisations would agree that the business value potential of Hadoop is huge, getting to a point where that value can be realised has been difficult.
A key culprit to this is that many of these companies’ use cases are built around the ability to bring data together. Up to now, there has been a high barrier to entry in ingesting data from many sources into Hadoop, making the business value difficult to realise.
The challenge of efficiently and reliably ingesting data in a governed manner is familiar to any organisation with a data warehouse.
Enter the new modern data lake platforms. These platforms are working to remove the barriers to data ingestion and discovery. There is growing popularity and awareness of these solutions-based, modern data lake platforms like the newly debuted open source Kylo, with its entirely flexible models, collaborative data science platforms such as Daitaiku, and commercial offerings from Podium and Zaloni which also speed up the solution development cycle on Hadoop in a more fixed, opinionated manner.
>See also: Is Hadoop’s position as the king of big data storage under threat?
Though they vary in approach and flexibility, these next generation data lake platforms are enabling enterprise use cases and removing many of the risks of a custom-engineered approach, allowing companies to become quickly productive.
In addition, they are encouraging organisations to consider governance and best practices upfront to eliminate the common pitfalls of data lakes built on in-house developed solutions.
The key to modern data lake platform success
The Hadoop-ecosystem is similar to a building foundation and bag of useful carpentry tools, but it still requires a highly skilled construction team to actually build a house. Modern data lake platforms provide the house, and companies just need to furnish it with data.
These new platforms are able to take on more complex enterprise user cases to provide a solution so organisations can more easily exploit Hadoop and Spark for analytics. All these solutions, just like the Hadoop distributors themselves, are focused on simplifying and speeding up the solution development cycle on Hadoop, whether it is ingestion or analytics modelling.
>See also: Hadoop in finance: big data in the pursuit of big bucks
Hadoop has some compelling advantages that modern data lake platforms exploit:
• Schema on read, inexpensive Hadoop storage and parallel processing means IT data modellers and their carefully designed normalised schemas aren’t needed. Modern data lake platforms can shift the effort of data ingest from IT to business users.
• Relational database management system (RDBMS) usually represent a stack of precious hardware resources with fixed capacity and carefully managed by IT. Classic ETL transformations are performed along the edge with proprietary software. ETL tools had to deal with complex transform gyrations such as populating star schemas.
• Hadoop and Spark can transform data using inexpensive cluster resources. Spark data frames are well suited for the type of complex data transformations any analytics team would need. Modern platforms allow organisations to exploit the transformational power of Spark without any programming skills.
• All the same governance, security, and data confidence challenges exist with Hadoop that were solved over time with data warehouse projects. Modern data lake platforms bring all the capabilities needed to navigate these challenges.
How will the modern data lake platform help Hadoop succeed?
All these platforms target common data lake use cases: enabling self-service data ingest, data preparation, metadata management, security and governance.
>See also: Data lakes vs data streams: which is better?
Some provide a framework for IT to extend capabilities to design and manage custom pipelines which can be used to integrate with enterprise systems, and others offer the application layer, step-by-step governance requirements and enforced best practices at the build stage.
Companies heavily invested in data lakes are recognising the value of the above combined assets, and are comparing the relatively low price points of the modern data lake platforms to the ongoing cost of Hadoop-savvy software engineers, who are difficult to find.
What companies are finding is that in order to remain competitive and drive growth in the new big data-driven digital economy, these next generation data lake platforms are the future, and possibly just the helping hand companies need to make a success of Hadoop.
Sourced by Matt Hutton, director, R&D Think Big/Teradata