When you’re creating and building analytic models, it’s important to realise just how much time, money and energy is being spent. The models that are created ultimately generate the insights that will enable teams to make better decisions, faster.
The value of the scores (or data) generated must be recognised as quickly as possible to be relevant, and they must always remain as accurate as possible. What this means is that once the model has been validated and is ready to be used, the process required to consistently send the results needs to be flawless. From model creation to deployment, maintaining the accuracy and consistency of the results is one of the most important tasks you’ll undertake.
When you have one model, this may seem like an easy task to accomplish. But as systems, problems, and team complexities increase, it could be the one task that could be paralysing and cost-prohibited for an organization. In fact, anything operating consistently within a business that utilises computing architecture, must be handled in a consistent and error-proofed manner.
>See also: From predictive to prescriptive analytics: using data to see into the future
Don’t wait until the moment is gone to get the information needed for a critical decision about the business. It’s not worth the risk of the information being wrong.
So, when you’re building your model, what can you do to prevent these critical missteps from taking place to go from analytic creation to production to IT handoff?
The data science world is experimental and research-oriented and solutions are often a result of a deep understanding of the problem domain and the data available. This experimental nature requires a modeling environment with the freedom to try new tools and libraries/packages (model dependencies).
Once the data science team has designed, trained, evaluated and selected a candidate model for production, the model creator should just be able to give it to IT, right?
>See also: The 3 types of analytics set to transform customer experience
Unfortunately, this is not the case in most organisations. The data science and IT teams need to find solutions that allow them to come together every time a model is pushed into production. And the importance of this step can’t be overstated. It’s a process that is further complicated when making updates to the model or the system surrounding it.
To ensure that the model will actually run, the model must be able to access all of its dependencies while being executed in the production environment. Additionally, the model must receive the correct production data and send the scores to the right place.
From there, the model must pass testing on all fronts, and the system must be set up for monitoring and scalability to make appropriate improvements over time.
In order for the model to run properly in the production environment it must have access to the required dependencies. If this is not already the case, it needs to be addressed either in the model or in the production environment. Changing the production environment may both be risky and work intensive.
>See also: The new analytics challenges keeping CEOs awake at night
More often than not, changing the production environment requires approval. Approval processes take time, which causes many companies to opt to change the model entirely. In some cases, the model is actually rewritten into a language that the production environment already supports, such as from R to Java.
Some models are sent back to the data science team and told that they must create a model that does not require certain libraries.
This puts the data scientist back to square one with even more constraints than before. In either case, this process rarely finds success the first time through, which results in an ongoing and time-intensive back and forth between IT and data science.
In order to correct this problem, data scientists would need a way to test the execution of their models in an environment that is identical to the production environment without having to involve IT.
>See also: Majority of companies lack analytics capabilities – study finds
By nature, this environment must support an ever-growing list of libraries and languages that data scientists use and in an isolated way as to not disrupt other applications on the production servers.
One way to achieve this is through a Docker ecosystem that lays the groundwork to build such a solution: Using a containerised analytic engine to run analytics provides a portable way to deploy, test, and push models into production. Container portability means that models can be validated early, by the data science team, reducing the back-and-forth between data science and IT.
With plans for solution in place, it’s time to start laying the groundwork for unilateral operation guidelines. This comes down to communication. Problems could arise between the IT and data science teams if production of the model is being dominated by one of these teams. And then if one of the teams isn’t involved in engineering discussions, then you’re back to square one with incorrect data.
The handoff process is incredibly intricate phase in model deployment. You’ll need to ensure that teams are laying out clear expectations and testing practices for each person and component involved. This reduces the amount of time to deploy, and ensures quality at each step providing a durable framework for deployment.
>See also: Big data, Einstein and the definition of madness
A solution like this can be incrementally integrated into legacy systems and provide the ability to expand the technologies needed to generate top performance.
More importantly, it breaks away from traditional monolithic thinking, with a solution that is more flexible, highly configurable, easy to implement, and introduces best practices and standards.
This handoff should be replicated over time to ensure a model is long-lived and a durable asset within the organisation. And it starts with understanding the responsibilities of each team between IT and data scientists, and opening up lines of communication to ensure that the process is running without wasting valuable time or resources.
Sourced by Rehgan Avon, product manager at Open Data Group