For some companies, data lakes may seem too disorderly to be a viable option for storage, and its formless nature compared to data warehouses may put them off.
But if it’s properly curated, a data lake infrastructure can be perfectly sufficient for organisations that want to store structured and unstructured data and scale it on-demand, while data warehouses only accommodate structured assets.
Should you consider adopting a cloud data warehouse?
Unstructured data that could be stored in a data lake can include important business documents such as client email threads, and company files such as contracts and presentations.
For Informatica‘s VP EMEA and LATAM, Greg Hanson, a main difference between a data warehouse and a data lake is in its asset-holding credentials.
“The aspiration there is for a data lake to be complete in the sense that it will have as many information assets as possible – whether they are structured or unstructured,” he explained. “Contrast that with a data warehouse, where it’s a very structured and limited set of data that has a high degree of accuracy.
“For data lakes, it’s about storing as much data as possible, because once you have as many information assets as possible within a lake, the value you can get is unparalleled insight.
“If you think about the priority areas for today’s organisations, particularly within customer experience, having a high-quality data lake is a fundamental building block in the new customer experience battleground that most companies will be a part of in 2020 and beyond.”
Customer experience experts say emotion drives behaviour as they herald the beginning of the Experience Age
How to get the most potential out of data lakes
While the high capacity of data lakes can be capable of enhancing company data storage to a level of satisfaction for a business, Hanson said that data lakes can only reach its huge potential if the data is:
- well-governed;
- stored in a complete data set, ‘so it can have as much data as possible’;
- accurate ‘at the point of use, and that means real-time’;
- high-quality, and
- available for all
“Another important thing I would say that is it needs to be automated,” Hanson continued. “This means having a platform approach to your analytics and your ingestion of data that has machine learning and AI built in, and I think many organisations now are starting to realise a data management platform’s value with data lakes.
Can we automate data quality to support artificial intelligence and machine learning?
“Many organisations now have not only a data science team, but they have a data engineering team, because they’ve realised their data scientists were just left to fish for data, and they couldn’t understand where the data existed in the first place.
“The data had massive quality issues with it, and they were left to solve these challenges themselves.”
A need for catalogues
Another important component for any successful data lake infrastructure that the Informatica VP identified on top of a team to assess data quality was a way for users to find exactly what data they are looking for.
“One of the key things we then need to add in on top of quality is with a catalogue of information,” he said. “There is a high requirement for a data catalogue and metadata catalogue which is storing all those data assets.
“Then there is the building and maintaining that catalogue automatically, and then providing an Amazon-like front end for people within the organisation to really find the assets appropriate for the analysis that they want to do.
“Early data lakes failed to democratise data, and that’s why a metadata catalogue is so imperative to make these projects successful moving forward.”