Let’s assume you want to leverage data to improve one of your processes, such as partner benchmarking. Even though it’s one of your top priorities for the year, you have limited resources to spend on partner data collection, segregation, and overall data preparation to do any sort of analysis. And even if you find a way around it, there’s a high probability that this approach will have a short shelf life due to an evolving business environment, as well as an opportunity cost owing to your data science team’s limited bandwidth – which could have easily been utilized on use cases with a significantly higher business impact.
What do you do in such a situation, where you have a process that needs improvement, but have limited resources?
Venkata Pingali, Co-Founder, and CEO at Scribble Data, recently spoke about this and other similar problem statements that can be addressed using a Sub-ML approach at Featurestore.org’s meetup with Jim Dowling, CEO of Logical Clocks and Associate Professor at KTH Royal Institute of Technology.
We’ve listed some of the key takeaways and actionable insights from the meetup here, but don’t forget to watch the complete on-demand webinar here to capture every minute of the insightful conversation!
In this meetup, Venkata spoke about
Sub-ML and its significance from a system, data science, and business perspective.
What it means to design a feature store for Sub-ML; and
Observations and predictions about where we see Sub-ML headed
Understanding Sub ML from a system standpoint,
data science as well as a business standpoint
Every complicated data science use case involves the application of feature engineering and models. We find that every organization has three buckets or categories of use cases it needs to choose from, based on their business impact:
“Priority” when the number of use cases is low, but the impact is high. For example, a search function for an eCommerce company is usually a priority project where many resources are deployed with a complex infrastructure.
“Maybe,” for when the impact is medium, but the number of use cases is higher. These are typically decision support problems such as fraud detection in a marketplace. These problems tend to be second or third on the list of priorities for an organization.
“Not Important,” when the use cases have zero to minimal impact on the business, it is hard to justify the ROI.
According to Venkata, of all the data science use cases under consideration for an organization, 70% fall into the “Maybe” bucket, and 30% in the “Priority” bucket. And the remaining projects that fall under the “Not important” bucket are not worth considering from a cost structure standpoint.
The “not important” bucket looks like decision support problems, covering 70% of the products. What is exciting today is how these buckets have evolved over the past few years. The middle bucket of “Maybe” use cases has increased significantly over the past three years, showing an almost 50% growth, with a lot of spreadsheet-based problem statements slowly scaling and graduating to this bucket. This has made it one of the fastest-growing subspace of use-cases that we call Sub-ML.
Venkata mentioned, “The interesting thing is that the basic data science lifecycle seems to apply to all of these Sub-ML use cases. You still have a combination of feature engineering and modeling, and it’s also a continuous process with all the checks and balances, except that it is simpler than what we can think of as Big-ML, which consumes a lot of data”. Venkata further went on to explain Sub-ML use cases with the following use cases.