Data lakes provide a solution for businesses looking to harness the power of data. Stuart Wells, executive vice president, chief product and technology officer at FICO, discusses with Information Age how approaching data in this way can lead to better business decisions.
What is a data lake?
A data lake is a data storage system where data from multiple sources in the organisation is replicated, typically in a distributed lower cost storage such as the Hadoop file system or cloud storage, to enable ad-hoc data exploration and analytics.
>See also: Hadoop: the rise of the modern data lake platform
Why should a business choose to adopt one – what are the benefits?
There are both business and technical benefits for adopting a data lake strategy. Businesses want to get a view of their customer activities such as financial transactions, account status, purchases, emails, support calls as well as social media and web site interactions.
This 360-degree view of the customer, involves many data sources, enabling more relevant marketing offers, better insights for fraud detection, loan approvals, collections and recovery, anti-money laundering as well as customer support, increasing customer satisfaction and better ROI for the business.
The current data systems in businesses are overloaded with operational requests, have inflexible data models and expanding their capacity is both a technical and economic challenge for the enterprise.
With the evolution of low cost distributed computing used in data lakes, the data can be more easily and affordably made available to the analysts, enabling innovative new models to be developed quickly.
>See also: Don’t drown in a data lake, or rather a data swamp
Data lakes have been used for data ingestion, transformation, federation, batch-processing and data discovery due to their ability to store data on inexpensive storage and providing schema flexibility.
What are the challenges in implementing a data lake strategy?
Many enterprises are still struggling with what data to bring into the data lake. With multiple sources of data being merged into a single source, data quality and lack of metadata repository of the raw data is a key issue that enterprises need to deal with.
Maintaining data lineage and a common ontology is also a challenge. With the entire company data in the data lake, governance, security and access control is an important and complex task.
Are there any scenarios where another data storage strategy is more beneficial?
Data lakes are useful where the benefits of multiple data sources for data exploration and insights yields a high ROI on the setup of a data lake.
>See also: Getting more value from a data lake? 6 obstacles to overcome
However, building out a robust meta-data repository or a data catalog, can serve the enterprise as well or better when leveraging a modern self-service analytics tool which can directly pull in data from the operational stores, manipulate the data in-memory/data-cache and allow analysts to explore the data to build new models.
Will a data lake help with regulations such as GDPR?
With the new European Regulation for data privacy coming into force in May 2018, enterprises need to quickly adopt a data strategy that is robust and scalable to meet the guidelines. A key requirement of GDPR is the deletion of customer data upon request.
Data lakes provide a mechanism to manage the data from all sources and get a global view of customer records in multiple operational stores. Data cataloging and identity resolution tools are essential to discover all instances of a customer record. With data lakes, enterprises can maintain customer records more consistently ensuring updates and deletions occur throughout the organisation’s data stores meeting the guidelines of GDPR.