Thursday 26 May, 2020

The recent past has seen a slew of announcements from companies about feature stores they’re building. More are scheduled for later this year. The Feature Stores for ML resource tracks well-known feature stores developed and released by a number of companies. Scribble Data has built and operated feature stores for customers over the past few years, and in this document we share our framework for understanding and evaluating the designs of stores. 

Photo: Andi Whiskey @unsplash.com

Feature Stores

Machine Learning models require prepared input data which often looks like a table. This table is called a feature matrix - a collection of columns for every record, where each column is a feature. Features are computed frequently, and are consumed by models using a standardized interface. In that sense, the underlying engineering of a feature is such that it keeps that feature updated at the context-appropriate frequency. The system that computes, stores, manages, and serves features is a Feature Store. 

As data science and its productionization accelerates, the importance of feature stores is growing:

“...we added a layer of data management, a feature store that allows teams to share, discover, and use a highly curated set of features for their machine learning problems.  We found that many modeling problems at Uber use identical or similar features, and there is substantial value in enabling teams to share features between their own projects and for teams in different organizations to share features with each other.” -  Uber’s Michelangelo Blog

Understanding Feature Stores

Traditionally, feature engineering and data preparation was done by data scientists together with  ML engineers. Including a feature store in the ML stack aids streamlining and robustness, but it is also one more subsystem inserted into the overall data ecosystem in any organization, and therefore adds complexity. Feature stores emerged as a natural result of the growing number of data science use cases. Organizations need: 

 

  1. Efficiency: A platform to enable asset reuse and speed up the development process

  2. Control: A way to achieve a high degree of control, trust, or availability due to regulatory, security, financial, or other reasons

  3. Controlled evolution: In fast moving environments with large numbers of continuously evolving models, a way to reduce confusion by imposing processes and standards. This is key for debugging, re-training, and also version improvements in the models

 

These objectives are met using a combination of technologies and processes, and are context-specific. Across industries, as data science teams mature, we find:

  1. Feature stores are becoming critical, and coming earlier in the data science journey of organizations

  2. Features range from simple and relatively passive (e.g. aggregates), to complex model outputs (e.g. penultimate activation layer weights in a deep neural network)

  3. Features evolve continuously, reflecting the evolution in understanding of the underlying data and approaches to modelling it

  4. Batch and realtime paths are separated in most cases, with reuse in feature computation code 

  5. Feature stores are not monolithic in nature but rather, comprise a number of smaller subsystems each of non-trivial complexity, each having their own implementation considerations

  6. Large systems are more constrained and tend to be narrower in scope by design

  7. Mid-scale feature stores tend to be more integrated and broader 

Architecture of Feature Stores

Looking at various feature stores built by companies such as Uber and AirBnB, we find that they vary in technology component choices and implementation specifics. This is a function of a number of  aspects including their tech stack, scale of operations, and organization-specific needs. Further, the conceptual boundaries of feature stores vary across organizations. In some implementations, the feature store refers only to storage and serving of the features. In others it includes transformations as well. We see atleast eight design points addressing these diverse needs.

 

We abstract out of the basic elements of a feature store in the picture below. It applies to both batch and real-time paths.

Examples​

Palette: is Uber’s feature store, a sub-system of Michelangelo. It operates at the petabyte scale. The core is built around very scalable compute systems such as Hive and Spark. Also it is designed for a high performance modeling system involving thousands of models and hundreds of millions of users.

FEAST: GoJEK is a Uber competitor in Southeast Asia. FEAST is GoJEK’s feature store, and is focused on the serving and monitoring of the feature through a simple API. By standardizing dataflows, naming and computations FEAST leverages BEAM and BigQuery to simplify the overall system.

SurveyMonkey is interesting because it operates at a different scale than Michelangelo and FEAST with about a couple of dozen ML use cases, and data in the sub-petabyte range. Their ML engineering team has built a simpler, maintainable, and fit-for-purpose system. 

Enrich, Scribble’s customizable feature store product, operates in the mid-TB range, and designed for mid-market enterprises with a few dozen use cases. Its architecture emphasizes simplicity, auditability, and evolvability. 

Selecting a Feature Store

There are two basic questions: 

  1. Does my organization need a feature store? 

  2. If so, what characteristics should it have? 

 

The first question is easier to answer than the second one. An organization needs a feature store if: 

  1. Duplication - Models have to be (re)trained frequently 

  2. Collaboration - You have three or more data scientists who have to understand and leverage each others’ work 

  3. Coordination - The combination of models and their versions exceeds some threshold (10 is a good cutoff)

  4. Risk - The cost of an error (e.g., debugging, customer impact) is high, and you need a way to systematically build, monitor, and evaluate features

The main drivers of the feature store are: 

  1. Scale of data - Terabytes or petabytes 

  2. Model Integration - Explicit (Machine Learning),  Implicit (Deep Learning), Federated

  3. Developer Resources - Low vs high

  4. Guarantees - Best-effort vs explicit

  5. Pace of Evolution - Fast vs Slow vs Not-specified

  6. Emphasis - Depth vs Breadth

  7. Domain Knowledge - Low vs High

 

By and large, feature stores are used only in the traditional Machine Learning context, but there are also Deep Learning experts who suggest that similar requirements exist in that context as well. We anticipate that there will be feature stores in federated ML as well. 

We can characterize the systems mentioned above:

 

These drivers are not independent of each other, and we don't expect 66 design points. We foresee at least eight design points mainly driven by the scale, model integration, and emphasis. The first drives the robustness requirements, the second determines the programming abstraction, and the last drives the scope of the platform.

A few observations: 

  1. Most of the available feature stores operate at high TB or PB scale. It is not an accident that ML-driven organizations need a feature store first. We expect that most future deployments of feature stores will be at the lower end of the spectrum due to the number of companies deploying ML models.

  2. Traditionally, performance was the single most important criterion for evaluating feature stores. Emerging considerations include provenance, and developer support.

  3. We expect domain-specific feature stores in future that provide out of the box transforms, integrations and workflows. 

Summary

Feature stores are used to create, store and serve feature data for ML models. It is a fast emerging space that will see many more implementations to suit various organizational, data and modeling requirements, especially as organizations look to standardize and streamline deployments of models into production. They will also play a key role in debugging and retraining models, and allowing for version improvements, all of which will be key to organizations looking to maintain their data edge over their competitors.

Feature Stores:

The CEO's Guide

hello@scribbledata.io

+1-720-445-8387

2074 (#17) 16 D Main,

H.A.L 2nd Stage,

Indiranagar,

Bangalore 560008

2035, Sunset Lake Road, Suite B-2,

Newark, DE - 19702

CIN: U72900KA2016PTC097462

  • White LinkedIn Icon
  • White Facebook Icon
  • White Twitter Icon

© 2020 by ScribbleData.