“Kedro can change the way data scientists and engineers work,” explains Quantum Black‘s product manager Yetunde Dada, “making it easier to manage large workflows and ensuring a consistent quality of code throughout a project.”
One small step
McKinsey has never before created a publicly available, open source tool.
“It represents a big step for the firm,” notes Jeremy Palmer, CEO of QuantumBlack, “as we continue to balance the value of proprietary assets with opportunities to engage as part of the developer community, and accelerate as well as share our learning.”
Open source and proprietary software solutions: the key for an analytic project
How to marry open source and proprietary software solutions into one successful analytic project.
Kedro
The name Kedro derives from the Greek word meaning centre or core. The name gives some clue to its purpose; an open source software that provides crucial code for ‘productionising’ advanced analytics projects. Essentially, it is a development workflow framework.
Kedro has two major benefits, according to today’s announcement.
- It allows teams to collaborate more easily by structuring analytics code in a uniform way so that it flows seamlessly through all stages of a project. This can include consolidating data sources, cleaning data, creating features and feeding the data into a machine-learning models for explanatory or predictive analytics.
- Kedro also helps deliver code that is ‘production-ready,’ making it easier to integrate into a business process.
“Data scientists are trained in mathematics, statistics and modelling — not necessarily in the software engineering principles required to write production code,” explains Dada. “Often, converting a pilot project into production code can add weeks to a timeline, a pain point with clients. Now, they can spend less time on the code, and more time focused on applying analytics to solving their clients’ problems.”
At a feature level, the open software tool can help teams build data pipelines that are ‘modular, tested, reproducible in any environment and versioned, allowing users to access previous data states,’ continues the announcement.
“More importantly, the same code can make the transition from a single developer’s laptop to an enterprise-level project using cloud computing,” explains Ivan Danov, Kedro’s technical lead. “And it is agnostic, working across industries, models, and data sources.”
A reliance on open source in enterprise: Necessary for digital transformation
Two years in the making — an age in the tech space
Two years in the making, Kedro was the brainchild of two QuantumBlack engineers — Nikolaos Tsaousis, Aris Valtazanos, and QB alumnus Peteris Erins, who created it to manage their numerous workstreams. It had started as a prototype library and was being quickly adapted by different teams when they brought it to QuantumBlack Labs, the technical innovation group led by Michele Battelli.
“Client teams can rotate into our lab and have the resources to convert a one-off piece of software or database [such as Kedro] into a viable product that can be used across industries, and that will be continually improved,” explains Michele. “It is a powerful way of innovating; our tech teams can move faster, more efficiently, and make a lasting contribution.”
Tried and tested
McKinsey has used Kedro on more than 50 projects, to date. According to Tsaousis, clients especially like its pipeline visualisation. He explains that Kedro makes conversations much easier, as clients immediately see the different transformation stages, types of models involved, and can backtrack outputs all the way to the raw data source.
“Kedro began as a proprietary program, but when a project was over, clients couldn’t access the tool anymore. We had created a technical debt,” Tsaousis said. “By converting Kedro into an open source tool, clients can use it after we leave a project – it is our way of giving back.”
“There is a lot of work ahead, but our hope and vision is that Kedro should help advance the standard for how data and modelling pipelines are built around the world, while enabling continuous and accelerated learning. There are huge opportunities for organisations to improve their performance and decision-making based on data, but capturing these opportunities at scale, and safely, is extremely complex and requires intense collaboration” says Palmer. “We’re keenly interested to see what the community does with this and how we can work and learn faster together.”
Features
The features provided by Kedro include:
• A standard and easy-to-use project template, allowing collaborators to spend less time understanding how you’ve set up your analytics project.
• Data abstraction, managing how you load and save data so that you don’t have to worry about the reproducibility of your code in different environments.
• Configuration management, helping you keep credentials out of your code base.
• Promotes test-driven development and industry standard code quality, decreasing operational risks for businesses.
• Modularity, allowing you to break large chunks of code into smaller self-contained and understandable logical units.
• Pipeline visualisation making it easy to see how your data pipeline is constructed.
• Seamless packaging, allowing you to ship your projects to production, e.g. using Docker or Airflow.
• Versioning for your datasets and machine learning models whenever your pipeline runs.