The vast majority of enterprises are convinced that big data analytics are an important element of their organisation’s data management and business intelligence programs, yet only 30% of them have adopted a system and put it into production, while the rest are still evaluating the opportunity. This poses the question: why is adoption moving slowly?
There are multiple reasons for this, but one of them is that there is a lot of behind the scenes work to clean and select the most appropriate data, before a company can reap any true benefit. As a result, this increases cost and manual effort, and lowers the time-to-market of big data analytics projects.
In fact, according to a recent article in The New York Times, the work of cleaning up and wrangling data into a usable form is a critical barrier to obtaining insights. It highlighted that data scientists spend from 50% to 80% of their time mired in the mundane labour of collecting and preparing unruly digital data, before it can be explored.
The Wikibon Big Data Analytics Adoption Survey 2014-2015 also supports this view and reveals that the difficulty to transform data into a suitable form for analysis is the biggest technology-related barrier in realising the full value of big data analytics. Additional barriers also derive from the difficulty to integrate big data with existing infrastructure and to merge disparate data sources.
> See also: How data quality analytics can help businesses 'follow the rabbit'
So, when deploying a data lake, a solution that is able to gather all the different data streams, the key question should be 'Are my data streams healthy?'
Considering that information from inside an enterprise is the second most common source of data when building a data lake, businesses must take proper care of the information they already possess. By managing their internal processes and organising both structured and unstructured information, organisations would be well on their way in building clean and readily usable data streams for their data lakes.
Real-time, relevant information, that is integrated with their current systems, would become much more compliant, secure and easily accessible. An additional benefit is that they also would be much more efficient in running their day-to-day business.
In simple terms (and keeping the same analogy of the data lake), the concept is very simple: if you have a lake that is filled with polluted water coming from its different streams, a focus needs to be put on preventing the streams getting polluted in the first instance, rather than continuously cleaning the polluted lake.
A proper Information Management system – including healthy Enterprise Content Management (ECM) and modern archiving strategies – will definitely help to manage and integrate existing data and content across the enterprise, as well as limit or overcome the pollution of such information, by granting chain of custody and compliant governance along the data’s entire lifecycle.
As an example, the utilities department of a Greater London airport has recently implemented smart meters to collect real-time data on water, gas and electricity consumption.
But smart meters are assigned to airport rental spaces, not to the tenant who has rented the area to open their shop or kiosk. Therefore, the utilities manager doesn’t know to whom to send the usage bills to, as only the leasing manager can differentiate which spaces are rented by whom.
This is a typical example of where data and information are maintained in separate silos and managed with improper software tools (typically standard office tools). These situations force manual operations like data cleansing and realignment in order to produce a valuable output, thus exposing the process to inconsistencies, data loss or even data leaks.
> See also: The most common data quality problems holding back businesses, and how to solve them
In this instance, the solution could be that the utilities manager and leasing manager meet once a month to align their data sat in Excel files at the pub in front of a beer.
However to make their lives easier, implementing a proper system to connect their two separate worlds would help to manage all the leasing processes, contracts and documents on a daily basis.
Integrating this with other systems would not only make their daily operation more efficient and secure, but would also produce a usable stream of information for the data lake and the analytics engine that consumes this information.
Modern ECM and archiving systems provide all the compliance, enterprise-grade functionalities to solve these requirements and enables organisations to start their journey to the new world of big data with the right approach and clean data streams to fill their data lakes with.
Sourced from Michele Vaccaro, EMC Information Intelligence Group EMEA Presale Director