With data being touted as the oil for digital transformation in the 21st century, organizations are increasingly looking to extract insights from their data by building and deploying their custom-built ML models. In our previous article (MLOps – The CEO’s Guide to Productionization of Data, Part 1), we learned why and how embedding ML models in production via MLOps has become an integral part of the digital roadmap for companies. However, the ML productionization journey is riddled with hurdles like compatibility, load balancing, or scalability faced during deploying ML models from the development stage to the production stage.
Extracting business value on your MLOps journey
Gartner, McKinsey, and others have articulated the challenges faced by organizations when they get on the ML and MLOps journey. Here are a few recommendations for extracting business value from ML based on the industry consensus and our experience:
Owning the ML solution process and outcomes
- Move to a new way of building systems. ML models and systems are probabilistic in design and operation. Internalizing the uncertainty in ML is critical for success.
- Accept that ML is NOT magic. Making ML takes effort, often upfront. Increasing performance and accuracy is an iterative process requiring tools, experimentation, and processes that evolve with the context.
- Recognize new risks and opportunities. ML algorithms and data usability bring organizations into the purview of new privacy and algorithmic accountability laws directly and indirectly. It also enables companies to build new data products at a pace and with differentiation that wasn’t possible before.
Skilling for success
- Pick the right problems and approaches: A lot of time is wasted by pursuing problems that don’t have good RoI potential, or that cannot be realistically solved with existing data. Mature teams invest in good problem selection, evaluation metrics, development processes, and integration into products. Here, experience makes a difference.
- Build end-to-end discipline: ML is ultimately linear algebra or some other math. Correct operation of ML requires discipline in all phases of the lifecycle, from planning and data collection to model operations. Organizations tend to focus narrowly on the model, ignoring the rest. Even the modeling phase is chaotic. Developing and enforcing discipline is a must.
- Design for learning: All ML models degrade over time (in fact, the degradation starts from the moment the training is over), and we learn over time what matters – data quality, corner cases, etc. Continuous monitoring and improvement should be a core part of the design of any ML solution.
Providing the right infrastructure
- Use tools for standardization and automation. ML development and operational processes are iterative, laborious, and error-prone. Cutting time and effort at every phase through standardization, simplification, validation, and automation helps.
- Provide checks and balances. The core value of ML is in the data and the algorithms. Risks to the organization include lost data, lost knowledge when staff leaves, and decisions that can’t be defended with clients/other stakeholders. Tools that provide checks and balances during all phases of ML are critical to protecting the value created by ML for the organization.
A sample journey could be as follows:
- Phase 1 (1 usecase): Select and put basic infrastructure in place and identify one usecase. Design from the get-go for continuous usage, along with data and process discipline. Achieve transparency (everyone knows what is happening), reproducibility (repeated execution), predictability (standardize outputs, locations, servers, etc.), monitoring (notifications, etc.), and consumption interfaces (APIs)
- Phase 2 (2-10 usecases). Generalize standards and processes by adding new usecases and evolving the compute and process to scale. Also, create reusable datasets, processes, and assets.
- Phase 3 (10+ usecases). Separate out teams to focus on specific phases of the ML. Design APIs, integration mechanisms, monitoring mechanisms, etc.
There is an active debate on build-vs-buy across industries. For a long time, there was a strong preference for build, especially on the infrastructure side. What organizations are learning over time is that:
- The core value is in data ownership, good people, and end-to-end design. Therefore, organizations are freely discussing their solution design with no fear of losing a competitive edge. They are using transparency to attract good talent.
- Time is of the essence. Product development cycles are shrinking across the board. Organizations are stitching complex solutions with available resources and not waiting for the perfect product or approach.
- Emphasis on infrastructure has grown more prominent in recent years. However, it is expensive and time-consuming. Therefore, only a few organizations have the budgets and direct access to data resources, like Uber and Google. This motivates small companies to outsource their infrastructure need to reduce their build approach here over time.
- Complex algorithms will not be easily built or bought. The algorithm that won the Netflix recommendation prize was not put into production due to RoI considerations. At present, organizations are opting for simplicity and careful thinking behind ML models over complexity. As a result, the need for explainability is taking precedence. Further, organizations are looking for models that can need easy and minimal staff training.
The report on the interview study mentioned above also offers recommendations that can be employed to address the challenges faced in embedding ML models in productionization. Some of these are:
- For new models or model updates, organizations, especially those with a large customer base, have a multi-step deployment procedure with progressive evaluation at each level. Companies employ a procedure known as staged deployment to deliver code, which comprises designated test clusters, [stage 1] and [stage 2] clusters, and finally, the global distribution cluster. Here, the objective is to deploy more often along these clusters in order to detect issues before consumers are affected.
- Each organization uses various terminology (such as test, dev, canary, staging, shadow, and A/B) and has a variable number of deployment stages. The stages assist in the invalidation of models that would perform badly in full production, particularly for brand-new or mission-critical pipelines.
- One of these stages is the shadow stage, which occurs before deployment to a small percentage of live users and occurs when predictions are made live but are not revealed to consumers. The shadow stage enables assessment of the possible impact of new features without actually putting them into use.
- By running concurrently with the production model and providing predictions, this could be implemented in a ‘shadow mode.’ The metrics for every model can be tracked by ML engineers, who can also compare them with ease. Shadow mode could also be used to persuade other stakeholders (like product managers and business analysts) that a new model or bringing modification to an existing model into production is justified.
Aligning ML evaluation metrics to product metrics:
- It’s crucial to synchronize model performance to the company’s KPIs (key performance indicators), such as click-through rate and user churn. To ensure that the right measurements were identified, it is crucial for engineers to consider selecting the metrics as an explicit phase in their process to work in tandem with other stakeholders.
- Finding out what customers are genuinely interested in or what metrics (features about ML model-based solutions or products) they care about should be prioritized first before undertaking any new ML project. The product team must validate every model update made in production. Engineers working on machine learning can proceed with the deployment if a statistically higher percentage of users subscribe to the product.
The best companies, at every scale, today have understood the need to have the right people, processes, and mechanisms by which they can reliably find ML use cases, build models, and use them in production deployments every day.
A thought-through approach (more time spent sharpening the axe than the actual chopping of wood) to the ML and MLOps lifecycle, including the internal processes, standards, and tool choices, will allow organizations that are getting on the ML journey to be that much more efficient and to build serious value internally as well as for their end-customers.