๐ Jim Dowling/CEO Logical Clocks: The future of feature stores
TheSequence interviews ML practitioners to merge you into the real world of machine learning and artificial intelligence
There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work canย become a great source of insights and inspiration. Weโd like to introduce to youย TheSequence Chatย โ the interviews that bring you closer to real ML practitioners. Please share these interviews if you find them enriching. No subscription is needed.
๐คย Quick bio / Jim Dowling
Tell us a bit about yourself. Your background, current role and how did youย getย started in machine learning?ย
Jim Dowling (JD):ย I come from a research background. My PhDย wasย onย Middleware forย distributedย reinforcement learningย back in 2004.ย After myย PhD,ย I worked at MySQLย for a couple of years, thenย as a researcherย at RISEย (Research Institutes of Sweden) andย anย Associate Professor at KTH. As part of my systems research, we builtย Hopsworksย as an open-source data science platformย โ that includes the first open-source feature store for machine learning.ย
๐ ML Workย ย
Feature stores have been gaining prominence in the last couple of years. Can youย describe whatโsย the value proposition of a feature store andย whyย are theyย a necessary component of a machine learning pipeline?ย
JD:ย In order toย serveย models in production, you need to feed them withย (often non-trivial)ย features. Those featuresย are computed from input data, and the code that computes theย featuresย shouldย beย the sameย for both training and serving. You shouldย notย re-implement feature engineering code for serving, asย non-DRY feature engineering code increases the risk of subtle differences in theย implementations that introduceย difficult to track down bugs.ย A solution to this problem is to store computed features in a feature store, and retrieve the same features when training and serving models.ย The feature store then becomes a centralized, enterpriseย platform to manage data (features) for machine learningย โ feature stores have the same role for ML that data warehouses have for analytics.ย
What should be the three core capabilities of an enterprise-ready feature store?ย ย
JD:ย
(a)ย Feature stores should provide efficient access to the largeย volumes of (potentially historical)ย features for trainingย modelsย on different data science platforms, and low-latency access to the latest values of features for model serving.ย
(b) Feature stores should be intuitive and easy to use by data scientists and data/ML engineers,ย for example, providing Python APIs to allow them toย browse and understand available features,ย create training data,ย and create new featuresย from either Enterprise data sources or existing features.ย
(c)ย Features to be access controlled, versioned (both schema version and data versioning),ย governed, and easily discovered.
๐บ๐ปย Subscribe to our Premium newsletter โ TheSequence Edge, a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms. Stay up to date with the news, trends, and tech developments in the AI field.ย Very practical. No hype.ย ๐ป๐บ
In the long-term,ย are feature storesย aย standaloneย productย or a featureย (interesting choice of wordsย ๐ย )ย ofย broader ML platforms?ย
JD:ย I donโt think we have even answered the question of whetherย data warehousesย are justย part of larger analytics pipelines, yet. Feature stores are much newer and will be standalone products for the next couple of years.ย But,ย ML pipelines will benefitย hugelyย from end-to-end provenance for debugging, governance, and reproducing models. The feature store will need toย beย tightly integrated into those ML pipelinesย and the platforms used to develop and operate those ML pipelines.
How do techniques like representationย learning,ย that can learn features from a given dataset,ย influenceย the future of feature stores?ย
JD:ย I donโt think they have a direct bearing onย theย systemย architecture ofย feature stores themselves.ย It is already the case that feature storesย ingestย โbaseโ features from whichย manyย derived features are created by data scientists. There may beย value inย automated feature engineeringย toย reduceย the manual effort in identifying and creating downstream features. However, deep learning shows us that a lot of feature engineering can be done in model training with appropriate model architectures, soย Iย do notย expectย automatedย feature engineering will be the next big thingย for feature stores.ย
Big technology platforms like AWS have recently entered the feature store spaceย which also includes well-funded startups like Tecton. Howย do you see theย competitive landscape in the near future?ย
JD:ย The first feature stores, developed at Uber andย AirBnb, used domain-specific languagesย (DSLs)ย to support feature engineering forย constrained domains. Now,ย Enterpriseย feature stores need to support a wider set of clients and use cases and DSLsย are not flexible enoughย โ Pythonย languageย APIsย areย dominating, and most platforms are converging on aย Dataframeย API (Pandas andย (Py)Spark)ย that weย first introduced inย Hopsworks.ย We expect that there will be one or twoย dominantย open-source feature stores (Hopsworksย and Feast, maybe) thatย will become more widely usedย asย more models need to be put in production. We also expect there will beย managedย feature store platforms on every cloud provider this year.ย Currently, there isย Sagemakerย Feature Store and Tecton available on AWS. Hopsworks.ai is available on both AWS and Azure, and Google announced that they would release a managed feature store, soon. Databricks will also release a feature store in 2021.ย ย
๐ฅ Miscellaneousย โ a set ofย rapid-fireย questionsย ย
TensorFlow orย PyTorch?ย
JD:ย Itโs not 2017 anymore. In 2021, theyย areย practically the same.ย If I have to choose, TensorFlowย for its Enterprise capabilities. ย
Favorite math paradox?
JD:ย 75% of people think they are smarter/more-attractive than average.
Anyย bookย you wouldย recommend to aspiring data scientists?
JD:ย Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurรฉlienย Gรฉron.
Is P equals NP?
JD:ย The systems research adage doesnโt help much here: โdonโt guess, measureโ.ย
TheSequenceโsย goal is to make you smarter about artificial intelligence.ย 5 minutes of your time by a newsletterย โ you steadily become knowledgeable about everything happening in the AI space.ย Subscribe to receive it straight into your inbox.ย Support the projectย and our mission to simplify AI education, one newsletter at a time. Thank you.
Interesting interview Could feature stores evolve into self enhancing AI?