Discover more from TheSequence
🎙H.O. Maycotte/CEO of Molecula on shifting from “data as fuel” to “features as fuel”
There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work can become a great source of insights and inspiration. Share this interview if you find it enriching. No subscription is needed.
👤 Quick bio / Higinio “H.O.” Maycotte
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?
Higinio “H.O.” Maycotte (H.O.): I am originally from Mexico and have always been obsessed with using technology, primarily data and AI, to help humans solve problems and evolve even faster than on Darwin’s trajectory. I believe that we are on the cusp of a super evolution, however, because these super evolutionary technologies are available to only a few Tech Titans, who spend billions annually on pure R&D, I am on a mission to ensure that these capabilities are accessible to everyone, everywhere.
🛠 ML Work
Feature stores have become a hot trend in modern machine learning stacks. Can you describe the value proposition of feature stores in general and how are you guys delivering that value proposition with Molecula?
H.O.: Let’s start with the basic value proposition of the traditionally defined feature store. It’s relatively simple – enterprises cannot efficiently or cost-effectively serve many models in real-time, at scale unless they have a feature store providing access to offline training data and online serving data. The feature store bridges the data silos within organizations, allowing features to be shared and reused, therefore saving on time and resources. Feature stores rapidly prove value in large organizations that are incorporating numerous predictive models, likely across multiple business units.
At Molecula, we believe the world has been putting the ML/AI cart ahead of the data readiness horse. Part of the challenge in scaling modern data systems is that we continue to operate on information-era databases and formats, designed for human-centric data. Features, and specifically our purpose-built feature store, are primed to spur the evolution from row-based and column-based databases, to a machine-ready format designed specifically to store and retrieve values at an attribute level, thus making the data ready for computation.
It is clear that the world is waking up to the power of features as the raw ingredient to fuel the entire ML Lifecycle, but we haven’t yet enabled a full mindset shift away from “data as fuel” to “features as fuel.” We continue to move data to laptops, manually extracting features, then fighting IT to try to figure out how to productionize our work. Worse, we are storing these features in data stores, causing unnecessary latency and inefficiency.
Molecula has introduced a platform designed to automatically convert all data into features and to be the intersection between Data Engineering and Data Science. We believe that Data Engineers are the key to aligning the workload imbalance that lies between our data and extracting value from our data. Data Engineers will be the modern corporate heroes when they can transition from deploying infrastructure for every single project and can instead spend their day delivering model-ready data to the business.
In recent months, AI incumbents such as AWS have entered the feature store space. How would this affect the competitive landscape, would it make it hard for feature store platforms to remain as standalone companies or would it signal a consolidation in the market?
H.O.: First and foremost, our view and definition of a feature store are very different from the definition much of the market uses. Most of the reference architectures for feature stores that have been released by the Tech Titans, as well as the other “off the shelf” feature stores, are really model management/lifecycle tools that are storing features in data stores. To us, converting data into features as the last step leaves some of the biggest benefits that features offer on the table. When you automate extraction of features directly from raw data sources and store only the features in a purpose-built feature storage system, all of your most important data (think customers, patients, inventory, supply chain, parts, etc.) is already in an ML-optimized and performance-optimized form. Using this feature-first format as the basis for all workloads, you get huge cost, security, and performance benefits. Features are amazing and they absolutely will power the future of machine intelligence, nobody is disputing that, but when you power your features with a data format designed specifically for ML, the benefits are exponential. Because of this, I believe that feature stores, the ones that simply store and retrieve features like ours, will become a foundational component of the machine learning stack.
Many large technology companies such as Uber, Airbnb or Pinterest have decided to build proprietary feature store stacks on top of mainstream stacks like Redis. How should companies think about the balance between build vs. buy when comes to feature stores?
H.O.: Great question. It is very validating for us to see that Tech Titans are building and releasing reference architectures for feature stores. Obviously, this is self-serving, but I think there is a tremendous opportunity to simplify these architectures by eliminating the storage and retrieval of features from/to various data stores. For example, Lyft recently published a great blog post on their feature store entitled ML Feature Serving Infrastructure at Lyft. I couldn’t help but notice that this architecture requires at least three different data stores including DynamoDB, Hive, and Elasticsearch. These specific workloads can all be subsumed into a much simpler, less expensive, and much more performant way if persisted in a feature-first format.
In recent years, research in areas such as representation learning or AutoML are pushing the boundaries of feature extraction/selection. What are some of the main areas of deep learning research that can influence the future of feature stores?
H.O.: I think that these are some of the most important and exciting areas of research in machine learning, however, just like any other model, they depend on having access to the data. Because of the manual nature of extracting and selecting features today, we tend to introduce a tremendous amount of bias into what goes into our models. Molecula believes that all data should be converted into coarse-grained features first, and then refined through feature selection, model training, etc.
When it comes to the impact deep learning research might have on the future of feature stores, I get really excited about data volumes. Just the other day, I was listening to a talk with Ali Ghodsi of DataBricks and he shared that, while at UC Berkeley, he and his founding team were looking at early stage Facebook, Uber, Airbnb, and more, all claiming to have success using “crappy” algorithms (his word, not mine!) from the ’70s and ’80s. At first, he didn’t believe it, but in looking more in-depth, he realized that they actually were having success with these algorithms that traditionally hadn’t worked because they were applying huge amounts of data to the models. This ties in well with the symbiosis between deep learning and the future of feature stores.
To-date, a lot of the potential in the deep learning space has been hindered because of the physical limitations surrounding big data (including actual legacy technologies as well as potential cost implications). With massive data volumes and how intense the model training process can be (running a model hundreds and hundreds of times), we need better-optimized solutions or enterprises will bow out quickly from lack of tangible success. If model training is fed by a high-performance data format, like our feature-first format, it can cut compute costs drastically, thereby accelerating innovation and advancements within the space.
What is your most ambitious vision of where feature store platforms can go in the next three to five years?
H.O.: I am excited to think about a world where all of our data is automatically persisted in a model-ready state – once – so that access is instant without the need to pre-process data, where refined features can be easily reused, where queries, transformations and real-time JOINS are lighting fast, where sharing (of data) is commonplace, and where we are executing complex computational models constantly. When this vision is achieved, humans and machines will be able to make informed decisions in real-time, regardless of geography, wherever these decisions need to be made without copying or moving data. To me, this is all of the makings of a mega computer.
Do you like TheSequence? Support it by becoming a Premium member. It means a lot.
💥 Miscellaneous – a set of rapid-fire questions
TensorFlow or PyTorch?
H.O.: Both of these deep learning frameworks are amazing and, to me, the choice really depends on the nature of your use case. For me, I lean towards PyTorch for R&D because you can modify your graph on-the-go, whereas with TensorFlow you must lock in a static graph. The dynamic approach to computational graphs can be useful when using variable-length inputs in an RNN. Obviously, everything has trade-offs – the more rigid approach can make TensorFlow better for production use cases and scale.
Favorite math paradox?
H.O.: I have always been obsessed with perpetual motion and parallel universes, so naturally Zeno’s paradox of motion would be one of my favorites. Although the question of whether motion is an illusion tends to be more a paradox of physics, than of math.
What book can you recommend to an aspiring ML engineer?
H.O.: As you know, I always want to align data science with business value, and as the world starts to develop momentum around operational AI it is important that practitioners focus on solving real business problems. It is not to say that pure research is not valuable, but by solving real business problems, this research can align all stakeholders in creating sustainable impact, financial or otherwise. There are many books on this subject, but one that I enjoyed is Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions by Matt Taddy.
Is the Turing Test still relevant? Any clever alternatives?
H.O.: I am not sure the Turing Test was ever all that relevant, maybe for a finite window in history, but as I think about the broader timeline of the future there are other ways to evaluate our progress towards making machines biological. I fundamentally believe that basic natural rules on a massive scale are simply systems that work in a circular form in which order ends in disorder and vice versa. Modern science and technology in my mind have laid this very foundation of basic physics and we are on the verge of creating a Holarchy of information systems that will result in totally unexpected outcomes. These outcomes are starting to happen all around us and will drive a super evolution we can’t even fathom. These super evolutionary sparks are evident in Netflix’s recently released documentary, The Social Dilemma. To me, these unintended consequences in the movie are the test and the positive proof that we are transcending into a new future.
Is P equals NP?
I would hate to give away the answer to this question. If any of your readers think they have definitive proof to this existential question, I will gladly pay $500k USD to the first person who sends me a verified proof, one way or the other. ;)