🎙 Jim Dowling/CEO Logical Clocks: The future of feature stores

TheSequence interviews ML practitioners to merge you into the real world of machine learning and artificial intelligence

Jan 29, 2021

There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work can become a great source of insights and inspiration. We’d like to introduce to you TheSequence Chat – the interviews that bring you closer to real ML practitioners. Please share these interviews if you find them enriching. No subscription is needed.

👤 Quick bio / Jim Dowling

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Jim Dowling (JD): I come from a research background. My PhD was on Middleware for distributed reinforcement learning back in 2004. After my PhD, I worked at MySQL for a couple of years, then as a researcher at RISE (Research Institutes of Sweden) and an Associate Professor at KTH. As part of my systems research, we built Hopsworks as an open-source data science platform – that includes the first open-source feature store for machine learning.

🛠 ML Work

Feature stores have been gaining prominence in the last couple of years. Can you describe what’s the value proposition of a feature store and why are they a necessary component of a machine learning pipeline?

JD: In order to serve models in production, you need to feed them with (often non-trivial) features. Those features are computed from input data, and the code that computes the features should be the same for both training and serving. You should not re-implement feature engineering code for serving, as non-DRY feature engineering code increases the risk of subtle differences in the implementations that introduce difficult to track down bugs. A solution to this problem is to store computed features in a feature store, and retrieve the same features when training and serving models. The feature store then becomes a centralized, enterprise platform to manage data (features) for machine learning – feature stores have the same role for ML that data warehouses have for analytics.

What should be the three core capabilities of an enterprise-ready feature store?

JD:

(a) Feature stores should provide efficient access to the large volumes of (potentially historical) features for training models on different data science platforms, and low-latency access to the latest values of features for model serving.
(b) Feature stores should be intuitive and easy to use by data scientists and data/ML engineers, for example, providing Python APIs to allow them to browse and understand available features, create training data, and create new features from either Enterprise data sources or existing features.
(c) Features to be access controlled, versioned (both schema version and data versioning), governed, and easily discovered.

🔺🔻 Subscribe to our Premium newsletter – TheSequence Edge, a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms. Stay up to date with the news, trends, and tech developments in the AI field. Very practical. No hype. 🔻🔺

In the long-term, are feature stores a standalone product or a feature (interesting choice of words 😉 ) of broader ML platforms?

JD: I don’t think we have even answered the question of whether data warehouses are just part of larger analytics pipelines, yet. Feature stores are much newer and will be standalone products for the next couple of years. But, ML pipelines will benefit hugely from end-to-end provenance for debugging, governance, and reproducing models. The feature store will need to be tightly integrated into those ML pipelines and the platforms used to develop and operate those ML pipelines.

How do techniques like representation learning, that can learn features from a given dataset, influence the future of feature stores?

JD: I don’t think they have a direct bearing on the system architecture of feature stores themselves. It is already the case that feature stores ingest ‘base’ features from which many derived features are created by data scientists. There may be value in automated feature engineering to reduce the manual effort in identifying and creating downstream features. However, deep learning shows us that a lot of feature engineering can be done in model training with appropriate model architectures, so I do not expect automated feature engineering will be the next big thing for feature stores.

Big technology platforms like AWS have recently entered the feature store space which also includes well-funded startups like Tecton. How do you see the competitive landscape in the near future?

JD: The first feature stores, developed at Uber and AirBnb, used domain-specific languages (DSLs) to support feature engineering for constrained domains. Now, Enterprise feature stores need to support a wider set of clients and use cases and DSLs are not flexible enough – Python language APIs are dominating, and most platforms are converging on a Dataframe API (Pandas and (Py)Spark) that we first introduced in Hopsworks. We expect that there will be one or two dominant open-source feature stores (Hopsworks and Feast, maybe) that will become more widely used as more models need to be put in production. We also expect there will be managed feature store platforms on every cloud provider this year. Currently, there is Sagemaker Feature Store and Tecton available on AWS. Hopsworks.ai is available on both AWS and Azure, and Google announced that they would release a managed feature store, soon. Databricks will also release a feature store in 2021.

💥 Miscellaneous – a set of rapid-fire questions

TensorFlow or PyTorch?

JD: It’s not 2017 anymore. In 2021, they are practically the same. If I have to choose, TensorFlow for its Enterprise capabilities.

Favorite math paradox?

JD: 75% of people think they are smarter/more-attractive than average.

Any book you would recommend to aspiring data scientists?

JD: Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron.

Is P equals NP?

JD: The systems research adage doesn’t help much here: “don’t guess, measure”.

TheSequence’s goal is to make you smarter about artificial intelligence. 5 minutes of your time by a newsletter – you steadily become knowledgeable about everything happening in the AI space. Subscribe to receive it straight into your inbox. Support the project and our mission to simplify AI education, one newsletter at a time. Thank you.

TheSequence

🎙 Jim Dowling/CEO Logical Clocks: The future of feature stores

TheSequence interviews ML practitioners to merge you into the real world of machine learning and artificial intelligence

👤 Quick bio / Jim Dowling

🛠 ML Work

💥 Miscellaneous – a set of rapid-fire questions

Discussion about this post