🏗🏪 Edge#77: How Feature Stores Were Started

Apr 06, 2021

In this issue:

we discuss what a Feature Store is;
we tell the story of how Uber Michelangelo began the Feature Store movement;
we explore the feature store market.
Share

💡 ML Concept of the Day: What is a Feature Store

Feature stores are becoming one of the hottest buzzwords in the machine learning ecosystem. What started as a novel building block of Uber Michelangelo now seems to be one of the centerpieces of modern ML pipelines. Most MLOps platforms started to incorporate feature storage and lifecycle management capabilities as first-class citizen. Despite the rise in popularity, the adoption of feature stores in real-world ML applications remains relatively low. In the early phases of ML projects, feature stores can be seen as overkill and are often ignored. However, as machine learning infrastructures become larger and more complex, it becomes advantageous to incorporate feature store capabilities.

The first step to facilitate the adoption of feature stores is to clearly understand where they fit in machine learning pipelines. From a functional standpoint, there are three main capabilities that should be present in any feature store solution:

Feature Transformation: Processes that extract features from raw datasets.
Feature Storage: Data storage infrastructure is used to persist the state of features as well as its associated metadata. Features are not a static, point-in-time representation, and they evolve with the lifecycle of a machine learning model. From that perspective, the storage component should maintain a record of historical versions of different features.
Feature Serving: APIs that can serve features for training and inference jobs in machine learning models. Typically, training feature-serving jobs operate over historical offline feature representations, while inference feature-serving jobs operate against real-time feature representations.

In addition to these three key building blocks, feature store platforms enable all sorts of complementary capabilities such as feature versioning, usage tracking, lifecycle monitoring, and many others. Feature stores are relevant across the entire lifecycle of machine learning models, as illustrated in the following diagram:

Image credit: Tecton

“Feature Stores serve as the interface between your data and your models” said Mike Del Balso, one of the architects of Uber Michelangelo, co-founder and CEO of Tecton. “To that end, Feature Stores should enable you to easily transform features from raw data or connect to sources of features you have already prepared, store features for serving, catalog features in a centralized registry, and serve features either online for inference or offline for model training.”

Now that we understand the key capabilities of feature stores and their relevance across the different stages of a machine learning pipeline, the remaining challenge becomes to determine at which point it becomes viable to incorporate feature stores in your ML solutions. While using feature stores for a single ML model can create unnecessary friction from an infrastructure standpoint, a large ML pipeline might make it an impossible task. The answer seems to be not in the middle but closer to the beginning 😉 In my opinion, incorporating feature stores in the early stages of machine learning pipelines can pay dividends in the form of higher productivity and more efficient lifecycles of machine learning models.

🔎 ML Research You Should Know: How Uber Michelangelo Started the Feature Store Movement

In the blog post “Meet Michelangelo: Uber’s Machine Learning Platform,” The Uber engineering team discussed the architecture behind its machine learning infrastructure, including an intriguing new concept: feature stores.

The objective: Review the key building blocks of large-scale ML infrastructures.

Why is it so important: Uber Michelangelo introduced many novel concepts of large-scale ML infrastructures, including the ideas that seeded the feature store movement.

“We built the Michelangelo platform to facilitate the growth of machine learning across teams and use cases at Uber” said Mike Del Balso, one of the architects of Uber Michelangelo, co-founder and CEO of Tecton. "Michelangelo was really one of the first platforms of its kind. It was in the process of building the capabilities we needed to remove friction for teams at Uber to deploy ML to production that we discovered the need for and importance of Feature Stores."

Diving deeper: Michelangelo is the centerpiece of the Uber machine learning stack. Conceptually, Michelangelo can be seen as an ML-as-a-Service platform for internal ML workloads at Uber. From the functional standpoint, Michelangelo automates different aspects of ML models’ lifecycle, allowing different Uber engineering teams to build, deploy, monitor, and operate ML models at scale. Michelangelo powers hundreds of machine learning scenarios across different divisions at Uber. For instance, Uber Eats uses machine learning models running on Michelangelo to rank restaurant recommendations. Similarly, the incredibly exact estimated time of arrivals (ETA) in the Uber app is calculated using incredibly sophisticated machine learning models running on Michelangelo that estimate ETAs segment-by-segment.

The architecture behind Michelangelo uses a modern but complex stack based on technologies such as HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow.

Image credit: Uber blog

There are many innovation areas in Michelangelo architecture, but feature management stands in its own category. Based on the previous diagram, we can see that it provides feature stores for batch and online jobs powered by Hive and Cassandra, respectively. It is widely accepted that Michelangelo was the first machine learning architecture to make the concept of a feature store mainstream. However, the Michelangelo creators didn’t stop there. In addition to the centralized feature catalog, Michelangelo includes a domain-specific language (DSL), based on Scala, that abstracts feature transformations. This DSL allows data scientists to select, combine and transform features served in the machine learning pipeline. Additionally, Michelangelo monitoring tools include detailed analysis of features and their behavior across different machine learning models.

Features, their impact on the model, and their interactions can be explored though a feature report

Image credit: Uber blog

Michelangelo continues powering machine learning workloads at Uber and has modernized quite a bit in the last few years. New versions of Michelangelo have incorporated various new capabilities, including integrations with modern open-source machine learning tools and frameworks. However, the impact of Michelangelo expands far beyond Uber. Among other things, Michelangelo is widely credited as the effort that started the feature store movement. Part of the team behind Michelangelo spun out of Uber to create Tecton, one of the most complete feature store platforms in the current machine learning market.

🤖 ML Technology to Follow: Five Feature Store Platforms You Should Know About

Why should I know about this: A quick overview that can help data scientists understand and evaluate the top feature store platforms in the current market.

What is it: The rapid growth of the feature store space brings tremendous levels of innovation to the ML space, but it can also result in being overwhelming for data science teams looking to incorporate those capabilities in their ML solutions. In the current state of the market, we are getting to the point where it is becoming difficult to differentiate signals from noise in the feature store landscape. In this section, we outline five platforms that have achieved relevant traction in their own merits and should be considered by data science teams when evaluating feature store capabilities.

Tecton

Tecton can be considered one of the feature store space pioneers and one of the most complete and innovative platforms in the space. Created by the team who built Uber’s Michelangelo platform, Tecton has surrounded the key feature store components with several enterprise-grade capabilities that streamline the adoption of the platform in complex machine learning infrastructures. Tecton is mostly enabled as a SaaS model, which requires no infrastructure footprint to get started. As a company, Tecton is one of the best funded startups in the feature store space.

Key Benefits: The main differentiators of the Tecton platform are its enterprise-grade capabilities such as feature monitoring, versioning management, collaboration, data quality, and several others. Additionally, the platform has proven adoption on several large-scale machine learning scenarios.
Supported MLOps Stacks: SageMaker, KubeFlow, Databricks.
Delivery Mode: SaaS
Early Adopters: Atlassian, Tide, Omdia …

Feast

Feast is a lightweight, open-source feature store that makes it incredibly simple to get started on incorporating these capabilities into machine learning pipelines. Feast was initially incubated by Google and transportation startup GoJek, but it counts several other companies as supporters. Feast can be used in both on-premise and cloud environments and is integrated with several MLOps platforms.

Key Benefits: Simplicity is by far the biggest value proposition of Feast. The platform is a very lightweight model for capturing both offline and online features in machine learning pipelines. This capability has made Feast a favorite to enable feature store capabilities in MLOps stacks. The project was recently accepted as part of the Linux Foundation for AI & Data.
Supported MLOps Stacks: Spark, KubeFlow.
Delivery Mode: On-premise and Cloud
Early Adopters: Google, GoJek, Zulily, Agoda …

AWS SageMaker Feature Store

AWS got into the feature store space with the launch of the SageMaker Feature Store. The new addition to the SageMaker platform enables the search, discovery and management of features for machine learning pipelines. Even though the platform still has very obvious limitations in terms of capabilities, it compensates that with seamless integration with other components of the SageMaker platform.

Key Benefits: The native integration with SageMaker and other AWS technologies is the main benefit of the SageMaker feature store stack. Even though the AWS feature store’s user experience still feels a bit rudimentary compared to that of other platforms, we should expect that to improve soon. For now, the SageMaker feature store is a very easy entry point for data science teams already working on the SageMaker platform.
Supported MLOps Stacks: SageMaker.
Delivery Mode: AWS
Early Adopters: Intuit, Experian, Care.com …

Molecula

Molecula is one of the new feature store platforms with a strong focus on the enterprise. The startup has attracted sizable funding from venture capitalists in order to capture a relevant market share in the nascent feature store space. Molecula seems to excel in enabling seamless feature extraction capabilities across many databases and streaming messaging platforms, which is always a plus for enterprise environments. Molecula is based on the open-source Pilosa project, which was designed for large-scale data processing.

Key Benefits: Integration seems to be the main differentiator of the Molecula feature store platform. Molecula introduces the concept of Data Taps to abstract specific data sources used to extract and calculate features. Building on that capability, Molecula also enables a simple way to model features based on SQL queries.
Delivery Mode: On-premise.
Early Adopters: Q2 eBanking.

Hopsworks

Hopsworks is an open-source feature store created by machine learning startup Logical Clocks. The platform’s current version includes key building blocks of feature lifecycle management such as feature discovery, analysis, and serving. The platform is based on a very modern open-source architecture and provides integration with several MLOps stacks as well as database and streaming platforms.

Key Benefits: Hopsworks provides a nice middle ground between Feast’s simplicity and the enterprise-grade feature store capabilities of Tecton. The feature store platform seamlessly integrates with a large number of data storage and MLOps platforms. Hopsworks is not only open source but can also be accessed as a managed cloud service.
Supported MLOps Stacks: Databricks, SageMaker, Azure ML
Delivery Mode: On-premise and cloud.
Early Adopters: Nvidia, AMD, SCANIA…

Conclusion

The feature store market is one of the most active areas of innovation in modern machine learning. From innovative startups to technology giants like AWS, the feature store market is becoming one of the most fascinating battlefields in the emerging field of MLOps technologies. Despite being a very nascent market, the feature store space is already showing signs of fragmentation with many new platforms emerging and capturing small market share. In the current state of the machine learning market, the feature store platforms outlined in this research have been able to capture an important early momentum, enough to make feature stores an important component of modern machine learning solutions.

🧠 The Quiz

Now, to our regular quiz. After ten quizzes, we will reward the winners. The questions are the following:

Which of the following statements better describes the role of feature stores in machine learning pipelines?
What database platforms are used to power the feature store capabilities of Uber Michelangelo platform?

Check your knowledge

That was fun! 👏 Thank you.

TheSequence

Discussion about this post