Most CIOs know and understand the ‘four Vs’ of big data – veracity, variety, volume and velocity – which describe and define arguably the biggest shift to IT departments in recent years. But do they always apply?
Veracity (the accuracy and trustworthiness of the data) and velocity (the speed at which data is created) are often the most problematic. Against the backdrop of rapidly flowing and changing data, keeping data ‘clean’ and implementing efficient processes to keep ‘dirty data’ out can often be difficult to achieve, and even harder to sustain.
Meanwhile, variety (the numerous types of information that businesses create) rarely causes major issues as few companies are trying to derive value from integrating multiple data types, such as videos or MP3s.
>See also: How to measure the value of big data
But here’s the greatest surprise: despite the media hype, volume is rarely a problem. Unless of course you are collecting terabytes of data every day, perhaps through IoT or social media, but how many companies can say that?
This is because most business systems create at most merely gigabytes of data per month – in fact, the startling truth is that many business system’s databases can be stored on a few portable disk drives.
Most corporate datasets are not billions of records, but in fact are fewer than 10 to 50 million. While this still seems like a big number, with the typical data analysis tools in use, this simply is not a volume of data that merits the moniker ‘big’.
However, this same dataset may well contain dozens of important fields per record and be housed across multiple different sources – and this is where the problem lies.
The more immediate challenge than mere volume is integrating data from different departments, locations or even from third parties. When this is necessary, the issue is no longer one of ‘big data’ but instead one of ‘wide data’.
The diversity of sources and breadth of fields often means that data is inconsistent in format, in turn making correlations impossible to draw without investment to manually correct the discrepancies.
The traditional ‘four Vs’ are relevant for enterprises where volume truly applies. But an enterprise in the more likely position of dealing with less data, but spread across multiple sources, is actually better served prioritising the ‘three As’: accuracy, aggregation and automation.
The root problem of wide data and integrating different sources is one of accuracy.
This problem of poor data quality within a specific single source of data then snowballs into a far greater problem at the point of aggregation – i.e. when trying to integrate several substandard sources together.
Organisations are suddenly faced with various datasets, possibly from external sources or in different internal systems such as ERP, HR or sales, with no common identifier and no way to draw connections between them.
For example, if a company wanted to use its data to better understand its employees, perhaps to predict potential turnover, several small datasets would be required from a variety of systems – HR, payroll, recruitment, attendance logs etc.
But with every system comes potential data problems: duplications, missing information and employees identified in different ways. In one system alone, John Smith could be recorded with his full name, J Smith, John_Smith etc. Not to mention the potential for typos, compounding the issue.
Typical data analytics cannot solve these problems. Simple rule-based algorithms are too deterministic to deal with data that does not obey consistent rules. Therefore, while accuracy at the data-entry level is vital for a single data source to be valuable, once attempts are made to aggregate inaccurate datasets, the result is catastrophic. Even if accurate, if the datasets are materially different, say in different languages, aggregation is a challenge.
Clearly, the individual data sources must be modelled and cleansed, back-filling missing information, correcting inaccuracies and eliminating duplications. On the face of it, this is an arduous, time-consuming and expensive task, but the problem of accuracy, compounded at the point of aggregation, is in fact solved by the third A, automation.
The wide availability of machine learning technology means that even inaccurate data can be made fully understandable, and valuable once again. The ‘machine’ will appreciate that sometimes inaccuracies in data entry might occur, making aggregation impossible.
To resolve this, the machine will be equipped with algorithms that allow it to understand what to do when errors are discovered – and most importantly, enable it to adapt to new types of irregularities and learn new solutions.
Nowadays, organisations know they have data quality issues and will resort to spending a year or more on dramatic revisits to the data before any insights can be drawn, reluctantly assuming that there is no alternative.
This is hardly ideal when the board expects to be making decisions based on valuable insights from the organisation’s data on a monthly, weekly or even daily basis.
>See also:How big is big data – and what can I do with it?
Without the automation capabilities of machine learning, this time is an unavoidable investment and frustrating delay.
New techniques of auto-translate with semantic analysis can now cope with not only the structure of the data, but the meaning, even in multiple languages.
Big data – if truly ‘big’ – is well managed by adhering to the ‘four Vs’. But few companies can honestly claim to deal with this sort of volume – the vast majority suffer the problem of ‘wide data’.
Boards expect to be able to identify correlations across the entire business activity and base strategic decisions on the insights, but the data just is not up to scratch. With this realisation, the rules must change.
Sourced from Keesup Choe, CEO, Pi