If you have enough data, many claim, you can feed it into a black box with some pre-set algorithms, and out will pop correlations which provide valuable business insight. The more data you have, the more value it brings.
This view is flawed and has a history of leading to bad outcomes – on which poor decisions have been made. But nevertheless, IT departments continue to invest large sums of money in analytics platforms to support doing just that. The result is underperforming technology, and CFOs wondering when that big budget they granted the CIO is going to pay off.
Big data, or indeed any data, may indeed hold huge value, but it’s often looked at in the wrong way. When we are looking at data – collected from different sources, to address different motivations, with an ever-changing context – we can’t fast track every correlation into an actionable insight.
We have to understand where the data comes from, the factors limiting its reliability, its consistency when applied across different sub-groups, and where biases may be lurking. We need to carefully interrogate any correlation, before we can understand whether it represents a truth in the real world.
Correlation is not causation
Let’s take the simple example of using analytics on social media or search terms. This is a common technique used for everything from brand perception, to tracking political sentiment, to mapping the fallout of natural disasters.
> See also: 7 ways to improve your content analytics
An increase in Tweets may be a bad thing. A spike in interest may be for a different reason from the one you think you are measuring. The prevailing sentiment may not match that of the subgroup you are interested in.
Critically, any behavioural model that you formulate from measured online activity becomes badly dated, very quickly. In technical terms, the real life data being fed into your model no longer matches the test data used to calibrate it.
A classic example is Google Flu Trends, which claimed to be able to profile flu levels faster and more accurately than traditional methods, just by analysing related search terms. After a promising start, the predictions became increasingly erratic, over-estimating the actual occurrence.
The underlying problem was that the model could not adapt to allow for all of the factors not related to flu, which were changing search behaviour. The calibration inexorably slipped. Ironically, it’s a reasonable assumption that the media coverage of Google Flu itself triggered significant change.
Even when your data is clear, coherent and reliable, it’s all too easy to trip up in the analytics. A real life example of ‘Simpsons Paradox’ occurred when assessing the success rate of two kidney stone treatments.
Looking at just the averages, the data indicated Treatment B was more effective. When, however, you took into account the different stone sizes treated, Treatment A was more effective every time.
> See also: The pitfalls of data storytelling: why you can't always rely on a narrative in analytics
Treatment B looked worse overall only because it was reserved for the most difficult cases, so had a lower success rate, and the two options were not assessed ‘like for like’ (a fuller explanation appears here).
So-called hidden variables, such as kidney stone size, lurk in any data, waiting to trip you up if you just cream the immediate, high level correlations, off the top. Unless you find the hidden variables, you will misunderstand the real-life causes that drive the data you are collecting.
These examples use fairly consistent data sets. Such issues become even more complex and introduce far greater room for error when your data is in different formats, from different sources and introduces large numbers of variables – for example in drug development or mapping the spread of epidemics.
Modelling the future
One of the most common uses of data is to build models based on previous experience to estimate the costs of future projects to identify sticking points. Again, blindly trusting the data could be setting yourself up to fail.
A straightforward analysis of historical data will spot factors that consistently cause cost overruns. But more sophisticated techniques and a bit of intuition can go much further – for example you may find short planning time is not generally correlated with cost overruns, but it is more strongly correlated with overruns in projects over a certain size.
Most importantly, you need to understand why these relationships exist. If one factor consistently reduces costs, can you be confident it will continue to do so in a new market where conditions are different? If you don’t understand your data you can’t make such predictions.
Blind faith in analytics black boxes could result in over interpreting correlations that get churned out, only to find plans don’t work out in new contexts.
Getting useful insight from data
All this shows the dangers of relying on correlations. Any inference you make from a data set has to take into account what drives and, above all, constrains the generation and compiling of that data.
A lot of organisations have bought analytics platforms, assuming they have all the answers. But many problems are too complex and too subtle to completely automate. They need human expertise to frame the problem.
> See also: How to keep analytics real: because software alone is not a cure-all
Data platforms don't tell you if your original data is valid or even complete, and they struggle to provide meaningful results from data in different formats from different sources.
In most cases there is an inescapable need to go beyond identifying relationships in the data, to understanding the mechanisms behind those relationships. This is vital in allowing you to understand whether your results are valid for making important financial, business or health and safety decisions.
Looking deeper into why the data appears as it does, and seeing what it’s really measuring, suddenly makes the world look very different.
By understanding not just what happens, but why it happens, organisations can make much better decisions on a wide range of complex topics, from R&D to market research to cost forecasting. Only then will they extract the promised value from their data analytics investments.
Sourced from Nick Clarke, head of analytics, Tessella