To the sceptics out there who don’t believe in creating another management position, what exactly does a chief data officer do and why is it important?
You’ve got to think of it more as a business-aligned service than an IT-aligned service. Of course it’s technology, but most of what we do today with anything is technology-based. The purpose of a chief data officer, especially in a large organisation, is first of all to understand the landscape of the company’s data, what it’s about, what can be and can’t be tapped, and what’s available in what form and how. They don’t have to become a developer or programmer, they just have to understand what they’ve got at their finger tips. Then they have to understand the questions that are key to the good running of the business, and is there any kind of analysis that could be run that could enlighten or enhance those needs. So it’s a function which creates a value based on being able to use the data within the organisation.
Some IT departments are not comfortable with the idea of losing control and governance of their companies’ data. Are they right to try to hold on?
I think if data is let loose without any control or management then it could be a disaster because if someone comes along and changes how the data is organised and classified, you’ve then got multiple versions of the truth. The CDO and the data scientists need to be very much aligned and working with IT, but they mustn’t be seen as working under IT. The CDO looks at how to generate value from data that will have an impact on the business, while IT makes they have the best tools to hand to keep the business operational.
Should it be a concern for the chief data officer how various data solutions and initiatives are housed in the organisation’s infrastructure?
To be honest, where the data is stored is only going to be a function of its structure, complexity and size. Certain data does very well in Hadoop, others work very well in MongoDB, some very well in relational databases. I don’t think the world should think it’s either Hadoop or nothing, or whatever it may be. The question shouldn’t be: what infrastructure have I got or what container is my data going in? That will be a function of the data. What you really want to focus on is: what is the problem I’m trying to solve, how can I solve it with the kit I have, and do I need a different approach? So I wouldn’t get hung up on the infrastructure layer. We have clients that use a combination of Hadoop, MongoDB, SQL, Azure Blob – we integrate all the data into a certain structure. As long as this stuff gets moved and is used, I’m not fussed how it gets there.
>See also: The rise and rise of the Chief Data Officer
It’s critical for the people who need to store the data and run the algorithms, but as a chief data officer I’m looking at how I’m going to find value in the underlying data. You’ve got a data container level with all these buzzwords in it; you’ve then got a data integration-enrichment-cleansing level; then above that is the human-driven machine-learning element, which is generating value from the data with insights from the individuals; and then you’ve got the distribution or reporting of the data and development of applications. The hierarchy of all these processes need to be in place, but the definition of exactly what those are is not the key issue.
How many organisations would you estimate have this hierarchy in place and are doing it well?
I think most of them have a problem with integration. I would expect the vast majority – 80% to 90% of businesses – will not have a seamless integration between those elements because they’re formed for the purpose of running a business, not for the purpose of data analytics. At the infrastructure layer, a lot of companies we come across do obviously have a data warehouse, but is every single piece of data that is of value in there? Usually the answer is no. But what they do is kind of import it in a special way or harmonise it somewhere else – and then use their old legacy system to do something else – and it ends up somehow in a document that is queried from an Excel spreadsheet.
That’s where most people are; they’re trying to do data analysis and science on a platform or structure that was never designed for that purpose. They’re using whatever they can do get the data out, then run a whole batch of statistical analysis and put it into a Powerpoint presentation or data visualisation tool to do representation analysis on what they found. That’s usually a one-time event. It can be repeated sometimes but most people are doing it as part of an one-time exercise that is not repeated or scalable. They are getting there and I think people see the value, but the vast majority do not have a platform that has been designed to be well suited for this type of analysis, which is one of the struggles that CDOs and data scientists have. They’ve got to make do with what is available, and that’s quite a challenge.