Over the past 3 years, we’ve heard a lot about Feature Stores. While they might not sound like much, over time, they’ve become table stakes for enterprises building their offerings on ML.
The rapid adoption of feature stores, where they’re starting to become mainstream instead of being a niche restricted to big-tech, can largely be credited to the efforts of forums like Featurestore.org — an international community of users and developers of Feature Store platforms for machine learning.
One of the initiatives of Featurestore.org is their annual summit, Feature Store Summit (FSS), catering to anyone interested in the discipline of feature engineering, working with feature stores, or just curious about all that’s new in the space.
With the evolution of MLOps, there’s a lot of emphasis on productionization of data to achieve outcomes. In keeping with the times, this year’s event revolved around the theme of ‘Accelerating Production Machine Learning with Feature Stores’. Scribble Data had the good fortune of sharing the stage with an all-star lineup of speakers at FSS2022 from companies like Uber, Linkedin, Airbnb, Doordash, Disney Streaming, and many more. It was great to hear all about the good, bad and ugly sides of the feature stores and learn from the speakers’ experiences.
Our takeaways from the event:
Feature stores are no longer niche. EVERYONE knows about them
Big tech building their own feature stores isn’t exactly breaking news, but for companies building in-house feature engineering capabilities, these are the components that are considered important: pipeline orchestration, feature + model engineering, storage layer , model observability, and metrics.
The fact that feature stores are now table stakes was made clear when we observed that of the 18 talks, speakers from 10 of these companies spoke about their proprietary feature stores (it was interesting to note that a majority of these companies are feature store vendors):
- Representative companies that have chosen to build their feature store in-house
- Uber
- Doordash
- AirBnb
- Linkedin (on Azure)
- Disney
- Stitchfix
- Feature store companies / vendors:
- Hopsworks
- Featureform
- dotData
- Scribble Data (Enrich)
There are different flavors of ML that suit different needs of users
FSS also acknowledged the different flavors of Machine Learning. With its widespread adoption,there are different types of personnel responsible for the productionization of data in organizations. This depends on the different needs of their organizations, as well as the volume and complexity of data.
- Artisanal ML: This is a creative and exploratory approach to Machine Learning, usually led by citizen scientists. The focus is on building experimental models before they are scaled / industrialized into something that’s more repeatable. In most cases, most of this is done using the scientists’ laptops and Jupyter notebooks.
- Analytical ML: This approach to ML is followed in companies which have teams of data scientists, but MLOps isn’t exactly high on the priority list. There is, however, modeling effort involved, but most models are usually “built offline and thrown over the wall” to ML engineers. At Scribble Data, we’ve observed that most organizations are unsure about whether they require a feature store at this stage.
- Operational ML: Operational ML uses ML models to autonomously make mission-critical business decisions. These models run “online” in production on a company’s operational data stack. Operational ML depends heavily on MLOps tools, which is why it requires serious infrastructure and ML talent. It’s impossible to get to this stage without a feature store.
- Operational ML (in real time): This is an approach that builds on operational ML but with added complexity since real time constraints and guarantees need to be met. And that’s why it becomes important to maintain parity between offline and online features.
While most of these approaches are categorized based on the type of modeling and the effort, at Scribble Data we believe that not all approaches to ML require modeling. To this end, Achint Thomas, Scribble Data’s Data Architect, spoke about Sub-ML (more on this in a bit) – which focuses on the long tail of use cases in an organization and can put data to work, irrespective of the size and complexity.
Machine Learning at Reasonable Scale, Postmodern stack, ML without the MLOps or Sub-ML …
… whatever you may choose to call it, but similar to our observations from the TLMS MLOps World Summit, the long tail use cases in ML are here to stay! While most of the talks were focused on traditional Machine Learning and feature engineering specifically for ML model building and serving, we feel like there’s a largely unaddressed set of users and use cases that can benefit from a feature engineering approach that is more focused on faster outcomes. At Scribble Data, we call this Sub-ML.
One of the companies that’s not focusing solely on data engineers and data scientists that we found particularly interesting was AtScale. They’ve built a semantic layer to make data actionable for business insights, and is more focused on business users. We’re looking forward to seeing how an increased focus on business users can help drive ML adoption in the enterprise.
Summary
Based on what we observed at FSS 2022, there are some interesting directions that features stores might be able to take. We’re looking forward to seeing how they’re going to work out in the near future. But one thing’s for certain – build or buy, feature stores are here to stay! We can’t wait for the next edition of FSS, and hope to showcase more updates on Sub-ML and our Enrich feature store.
If you’d like to view a recording of Achint’s talk on fast Sub-ML use case development using feature stores, you can now watch it on demand here (Slides available here). If you’d also like to watch his panel discussion on the challenges of making the feature store disappear and become part of the workflow of data science and data engineering, you can watch it here.