π Jim Dowling/CEO Logical Clocks: The future of feature stores
TheSequence interviews ML practitioners to merge you into the real world of machine learning and artificial intelligence
There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work canΒ become a great source of insights and inspiration. Weβd like to introduce to youΒ TheSequence ChatΒ β the interviews that bring you closer to real ML practitioners. Please share these interviews if you find them enriching. No subscription is needed.
π€Β Quick bio / Jim Dowling
Tell us a bit about yourself. Your background, current role and how did youΒ getΒ started in machine learning?Β
Jim Dowling (JD):Β I come from a research background. My PhDΒ wasΒ onΒ Middleware forΒ distributedΒ reinforcement learningΒ back in 2004.Β After myΒ PhD,Β I worked at MySQLΒ for a couple of years, thenΒ as a researcherΒ at RISEΒ (Research Institutes of Sweden) andΒ anΒ Associate Professor at KTH. As part of my systems research, we builtΒ HopsworksΒ as an open-source data science platformΒ β that includes the first open-source feature store for machine learning.Β
π ML WorkΒ Β
Feature stores have been gaining prominence in the last couple of years. Can youΒ describe whatβsΒ the value proposition of a feature store andΒ whyΒ are theyΒ a necessary component of a machine learning pipeline?Β
JD:Β In order toΒ serveΒ models in production, you need to feed them withΒ (often non-trivial)Β features. Those featuresΒ are computed from input data, and the code that computes theΒ featuresΒ shouldΒ beΒ the sameΒ for both training and serving. You shouldΒ notΒ re-implement feature engineering code for serving, asΒ non-DRY feature engineering code increases the risk of subtle differences in theΒ implementations that introduceΒ difficult to track down bugs.Β A solution to this problem is to store computed features in a feature store, and retrieve the same features when training and serving models.Β The feature store then becomes a centralized, enterpriseΒ platform to manage data (features) for machine learningΒ β feature stores have the same role for ML that data warehouses have for analytics.Β
What should be the three core capabilities of an enterprise-ready feature store?Β Β
JD:Β
(a)Β Feature stores should provide efficient access to the largeΒ volumes of (potentially historical)Β features for trainingΒ modelsΒ on different data science platforms, and low-latency access to the latest values of features for model serving.Β
(b) Feature stores should be intuitive and easy to use by data scientists and data/ML engineers,Β for example, providing Python APIs to allow them toΒ browse and understand available features,Β create training data,Β and create new featuresΒ from either Enterprise data sources or existing features.Β
(c)Β Features to be access controlled, versioned (both schema version and data versioning),Β governed, and easily discovered.
πΊπ»Β Subscribe to our Premium newsletter β TheSequence Edge, a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms. Stay up to date with the news, trends, and tech developments in the AI field.Β Very practical. No hype.Β π»πΊ
In the long-term,Β are feature storesΒ aΒ standaloneΒ productΒ or a featureΒ (interesting choice of wordsΒ πΒ )Β ofΒ broader ML platforms?Β
JD:Β I donβt think we have even answered the question of whetherΒ data warehousesΒ are justΒ part of larger analytics pipelines, yet. Feature stores are much newer and will be standalone products for the next couple of years.Β But,Β ML pipelines will benefitΒ hugelyΒ from end-to-end provenance for debugging, governance, and reproducing models. The feature store will need toΒ beΒ tightly integrated into those ML pipelinesΒ and the platforms used to develop and operate those ML pipelines.
How do techniques like representationΒ learning,Β that can learn features from a given dataset,Β influenceΒ the future of feature stores?Β
JD:Β I donβt think they have a direct bearing onΒ theΒ systemΒ architecture ofΒ feature stores themselves.Β It is already the case that feature storesΒ ingestΒ βbaseβ features from whichΒ manyΒ derived features are created by data scientists. There may beΒ value inΒ automated feature engineeringΒ toΒ reduceΒ the manual effort in identifying and creating downstream features. However, deep learning shows us that a lot of feature engineering can be done in model training with appropriate model architectures, soΒ IΒ do notΒ expectΒ automatedΒ feature engineering will be the next big thingΒ for feature stores.Β
Big technology platforms like AWS have recently entered the feature store spaceΒ which also includes well-funded startups like Tecton. HowΒ do you see theΒ competitive landscape in the near future?Β
JD:Β The first feature stores, developed at Uber andΒ AirBnb, used domain-specific languagesΒ (DSLs)Β to support feature engineering forΒ constrained domains. Now,Β EnterpriseΒ feature stores need to support a wider set of clients and use cases and DSLsΒ are not flexible enoughΒ β PythonΒ languageΒ APIsΒ areΒ dominating, and most platforms are converging on aΒ DataframeΒ API (Pandas andΒ (Py)Spark)Β that weΒ first introduced inΒ Hopsworks.Β We expect that there will be one or twoΒ dominantΒ open-source feature stores (HopsworksΒ and Feast, maybe) thatΒ will become more widely usedΒ asΒ more models need to be put in production. We also expect there will beΒ managedΒ feature store platforms on every cloud provider this year.Β Currently, there isΒ SagemakerΒ Feature Store and Tecton available on AWS. Hopsworks.ai is available on both AWS and Azure, and Google announced that they would release a managed feature store, soon. Databricks will also release a feature store in 2021.Β Β
π₯ MiscellaneousΒ β a set ofΒ rapid-fireΒ questionsΒ Β
TensorFlow orΒ PyTorch?Β
JD:Β Itβs not 2017 anymore. In 2021, theyΒ areΒ practically the same.Β If I have to choose, TensorFlowΒ for its Enterprise capabilities. Β
Favorite math paradox?
JD:Β 75% of people think they are smarter/more-attractive than average.
AnyΒ bookΒ you wouldΒ recommend to aspiring data scientists?
JD:Β Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by AurΓ©lienΒ GΓ©ron.
Is P equals NP?
JD:Β The systems research adage doesnβt help much here: βdonβt guess, measureβ.Β
TheSequenceβsΒ goal is to make you smarter about artificial intelligence.Β 5 minutes of your time by a newsletterΒ β you steadily become knowledgeable about everything happening in the AI space.Β Subscribe to receive it straight into your inbox.Β Support the projectΒ and our mission to simplify AI education, one newsletter at a time. Thank you.
Interesting interview Could feature stores evolve into self enhancing AI?