🔹◽️ Edge#96: Molecula is a Feature Extraction and Storage Platform Designed for Enterprise ML Workloads
This is an example of TheSequence Edge, a Premium newsletter that our subscribers receive every Tuesday and Thursday. On Thursdays, we do deep dives into one research paper, or platform, or framework that worth knowing about. Learn what might be useful for your work.
In this issue, we overview:
challenges of choosing between the feature-related platforms;
Pilosa-based Molecula platform and its four fundamental sets of capabilities;
Ingesters, PSQL, and Consumption Interfaces in Molecula’s architecture.
💥 What’s New in AI: Molecula is a Feature Extraction and Storage Platform Designed for Enterprise ML Workloads
Features are rapidly becoming one of the fastest-growing components in the relatively crowded machine learning space. The proliferation in the number of feature-related platforms makes it increasingly challenging for data science teams to decide which stack is better suited for their machine learning workloads. At this early stage in the development of the Operational AI market, it is relatively wise to rely on platforms that have achieved relevant customer traction, raised enough financial backing, and are building on a credible technological and research foundation. Molecula is one of the feature-based platforms that fit those criteria and has carved a place for itself among the early leaders of the feature store ecosystem.
Molecula’s feature store capabilities are optimized for enterprise-scale machine learning pipelines. The platform is built on the open-source Pilosa format and has made significant enhancements to it with enterprise-grade capabilities that streamline the lifecycle of features in machine learning applications. The scalability and feature computation performance of its format are definitely the areas in which Molecula excels. After all, Pilosa was built for addressing large-scale analytics workloads in enterprise environments. In the same way that other feature store platforms rely on data-centric platforms like Redis or Hive for its feature storage capabilities, Molecula decided to bank on the real-time computation and analytics capabilities of Pilosa’s feature-first format. The Molecula platform adapts those capabilities to compute and manage machine learning features across a large variety of streaming and historical datasets.
Molecula
From a functional standpoint, the Molecula platform enables four fundamental sets of capabilities:
Feature Store: The central element of the Molecula platform, the feature store abstracts the calculation, storage, and management of features in machine learning pipelines.
Extension Framework: Molecula was built with extensibility as a first-class citizen. The Extension Framework is a series of programmable modules that can be used to extend the core functionality of the Molecula platform.
Control Pane: Molecula’s Control Pane allows data science teams to manage the topology of a feature lifecycle management architecture, including aspects such as cloud, on-premise environments, data store connections and many others.
Data Taps: Molecula’s Data Taps abstract the integration with different real-time or batch data sources in order to extract features relevant to machine learning programs, at the source.
Integration is one of the most differentiated capabilities of the Molecula platform. Using its Data Taps model, Molecula has integration services with top data platforms and SaaS APIs that connect to the relevant datasets for feature extraction and computation. This robust set of integration capabilities, with a world-class implementation team to deploy these connections, makes Molecula especially well suited for complex, heterogenous enterprise data environments.
The Architecture
As mentioned previously, Molecula is based on the Pilosa open-source framework for big data analytics. You can think about Pilosa as a massive columnar store distributed across a large number of nodes. Differently from other columnar data stores, Pilosa partitions each column into a set of unique values so that they can be represented as a single bit. This type works well for feature computations as features are typically represented as unique values. This core format that powers Molecula allows for granular scans at a feature-by-feature level, unlike columnar or tabular data formats, effectively shattering the latency floor other systems have been unable to break.
Molecula extends the Pilosa architecture with a series of building blocks optimized for feature storage and lifecycle management. The following diagram provides a high-level overview of the architecture of the Molecula platform:
Image credit: Molecula
Let’s explore some of the components outlined in this diagram in more detail.
Ingesters
As its name suggests, Ingesters are responsible for collecting data from different data sources and transforming it into a Pilosa compatible format. For instance, a SQL Server Ingester will connect to a SQL Server instance, execute specific queries and transform the results into Pilosa’s bit-columnar format. Molecula’s Ingesters are powering its Data Taps capabilities.
PSQL
Molecula introduces its own query language for feature computation. PSQL is a SQL-based query language optimized for the Pilosa store format. If you think about the process of computing features in a machine learning model, it usually requires composing different functions to achieve a final result. PSQL extends a typical SQL syntax with function composition capabilities that make it easy to model out feature computations.
Consumption Interfaces
Molecula includes a series of programmable interfaces to automate the interactions with the platform. Among those interfaces, the platform includes HTTP and gRPC APIs that can be used to interact with the platform from third-party applications. Additionally, Molecula includes a web interface that enables the management and monitoring of its node-topology as well as the state of the features computed for a specific dataset.
Image credit: Molecula
Putting It All Together
The feature-based foundation allows Molecula to provide a highly scalable platform for feature extraction, storage, and lifecycle management. The platform can be deployed in cloud runtimes such as AWS, Azure and Google Cloud as well as on container platforms. In the near future, we suggest that Molecula should improve its integration with deep learning frameworks and platforms to streamline its adoption within MLOps pipelines. For now, Molecula has been able to achieve relevant customer traction and become a relevant platform in the nascent feature store space.
Further Reading: More details about Molecula can be found at: https://www.molecula.com/products/