Purifying the Hadoop data lake and preventing a data swamp

Data is the lifeblood of most modern organisations, enabling them to better understand and serve their customers, meet compliance regulations, increase business agility and increase ROI.

For this reason, more businesses are realising the value of mastering all of their data in a centralised data lake so it can be accessed and analysed quickly.

In some ways, the concept of a data lake is now commonplace. Yet, how many organisations are really able to leverage all data assets in the data lake, for deriving real and actionable insights?

The goals of the modern data architecture include turning raw data into insights in a timely manner while maintaining governance, compliance and security standards, and increasing nimbleness and efficiencies in the business. In order to achieve these goals, organisations need to invest into populating the data lake and preventing it from turning into a data swamp.

One essential step is to liberate all enterprise data, especially hard-to-access data from mainframes and other legacy data stores, which typically house the most important transactional data, and populate the data lake.

Another critical step, one that’s often overlooked, is to cleanse the data – ensuring it’s accurate and complete, delivering the right data at the right time.

Populate the data lake

Looking at the most common use cases for the big data initiatives, they range from legacy optimisations to transformative ones. The common theme for all of these use cases is reliance on the availability of the data.

Whether an organisation is trying to optimise customer experience by integrating all data originating through mobile banking and referencing the historical data hosted on the mainframe, or streaming online hotel reservation data from the cloud and integrating it with the on-premise data hosted in the data warehouse to load to a visualisation tool, all use cases rely on data to be available and accessible.

If an organisation excludes difficult to access legacy data stores, such as mainframes, as they build the data lake, there is a big missed opportunity.

New data sources, such as sensor or mobile devices, are easily captured in modern enterprise data hubs, but businesses also need to reference customer or transaction history data, stored on mainframes and data warehouses, to make sense of these newer sources. Making these data assets available for predictive and advanced analytics opens up new business opportunities and significantly increases business agility.

Another essential consideration when populating the data lake is compliance. Therefore, businesses should first discuss their regulatory needs, and when necessary, preserve a copy of their data in its original, unaltered format.

This is especially important in highly-regulated industries like banking, insurance and healthcare, who must maintain data lineage for compliance purposes. Archiving the legacy data sets in more affordable platforms like Hadoop, or cloud help accelerate the IT optimisation projects and adoption of the modern data architecture.

Getting the data in – a feat in itself – isn’t where the job of building a data lake ends. Data quality is often overlooked and the initial implementations of the data lake, or data hubs, are turning into data swamp.

Profiling and understanding the data as it is accessed and integrated to the data lake creates an opportunity to automate creation of the business rules to cleanse, validate and correct the data.

Data stewards are spending about 70% of their time cleansing and validating the data sets instead of understanding the data itself.

Any automation we can provide in cleansing and enriching these data sets is a big productivity improvement and help ensure businesses are making their decisions based on sound, complete data.

Purify the data lake

Organisations want to use data lakes to create a single, 360-degree view – whether for marketing purposes or otherwise. But common “dirty data” issues, like duplicate records or mismatched email addresses and contact information, detract from the efforts and ROI of the entire data lake initiatives.

Many underestimate the importance of a data quality initiative – and unverified or “bad” data, inevitable when populating a data lake from many siloed sources, can cost a company real dollars.

For example, the average cost of one email lead is $120 – by not verifying if an email is correct, an organisation is losing at least that much in profit, and likely even more in time wasted.

To ensure data quality, organisations should first explore the data lake and catalogue what’s inside, creating business rules to validate, match and cleanse the data. In many cases, it’s valuable to enlist the help of third party databases to find and add the missing information, creating a complete picture of a customer.

The most savvy businesses will even automate these steps to cleanse as the data lake is populated – preventing the organisation from having to cleanse new data every time it’s moved to the data lake.

Reaping the benefits

Organisations that are spending the necessary time and resources to both ensure all critical data assets are moved to the data lake and it’s also of good quality are reaping the benefits. It’s allowing them to:

• Improve customer loyalty by having a complete view of their status and history, making it easier to provide personalised information to best serve customer needs.

• Gain insights into demographics, customers’ appetite for risk, preferred products and more, which lead to new sources of revenue through cross-selling and marketing programs that target the right customers with the right offers.

• Maintain and accelerate regulatory compliance and strengthen credibility with regulators by establishing sound data governance processes and by seamlessly accessing, validating and gaining insight into critical information from across the organisation.

• Extract valuable information in real-time for improved marketplace awareness and internal decision making.

With new sources of data emerging every day, businesses should master their data lake now so they’re prepared to take advantage of every piece of incoming customer data to uncover new insights and blow away the competition.
Sourced by Tendu Yogurtcu, general manager of big data at Syncsort and Keith Kohl, VP of product management at Trillium

Nick Ismail

Nick Ismail is a former editor for Information Age (from 2018 to 2022) before moving on to become Global Head of Brand Journalism at HCLTech. He has a particular interest in smart technologies, AI and... More by Nick Ismail

Purifying the Hadoop data lake and preventing a data swamp

Populate the data lake

Purify the data lake

Reaping the benefits

Nick Ismail

Related Topics

Related Stories

Data storage problems and how to fix them

Combining Qumulo integration with open source backup software

Combining block, file and object storage in one cluster technology

Overcoming data loss from embedded devices

Related Stories

Is subscription-based networking the future?

Why and how to craft an effective hyperscale cloud exit strategy

Why cloud computing is losing favour

Future challenges and innovations in cloud security platforms