The Sequence Chat: Emmanuel Turlay – CEO, Sematic

Model orchestration, Airflow limitaitons in ML and new ideas about MLOps.

Jul 12, 2023

👤 Quick bio

Tell us a bit about yourself: your background, current role, and how you got started in machine learning.

I’m Emmanuel, CEO and founder of Sematic .

I started my career in academia doing particle physics research on the Large Hadron Collider at CERN. After my post-doc I went to work for a string of small European startups before moving to the US in 2014 and joining Instacart where I led engineering teams dealing with payments and orders, and dabbled in MLOps.

In 2018, I joined Cruise and cofounded the ML Infrastructure team there. We built many critical platform systems that enabled the ML teams to develop and ship models much faster, which contributed to the commercial launch of robotaxis in San Francisco in 2022.

In May 2022, I started Sematic to bring my experience in ML infrastructure to the industry in an open-source manner.

🛠 ML Work

Your most recent project is Sematic, which focuses on enabling Python-based orchestration of ML pipelines. Could you please tell us about the vision and inspiration behind this project?

At Cruise, we noticed a wide gap between the complexity of cloud infrastructure, and the needs of the ML workforce. ML Engineers want to focus on writing Python logic, and visualizing the impact of their changes quickly.

On the other hand, leadership at Cruise wanted to enable almost weekly retraining with newly labeled data to improve model performance very quickly (and beat Waymo to commercial launch). This required large end-to-end pipelines.

The vision for Sematic is to give all ML teams access to the type of orchestration platform previously only available to a few large organizations that built it in-house with large dedicated platform teams.

By abstracting away infrastructure and guaranteeing things like visualizations, traceability, and reproducibility out of the box, we have noticed an 80% speed-up in development time and retraining time.

What are the core capabilities of production-ready ML orchestration pipelines, and how are they reflected in Sematic?

Production-ready ML pipelines should have the following characteristics:

Traceability. All assets (data, code, configuration, resources used, etc.) should be tracked in a knowledge graph. We call this Lineage Tracking. The idea is that if a production model exhibits failures or issues, it should be straightforward to find out what job trained it, what data, code, and configuration were used. This makes debugging much faster, and can even be a matter of legal compliance. Sematic persists and tracks all assets pertaining to all pipeline executions in a knowledge graph and surfaces it in its dashboard.
Reproducibility. A model that cannot be reproduced from scratch (within stochastic variations), should not be used in production. The same way that applications assets (binaries, images, etc.) can be built from source through a CI pipeline, models should be reproducible in order to enable debugging, performance investigations and compliance. By tracking artifacts as sources of truth, and tracking all container images used, Sematic can rerun any pipeline at any time.
Observability. Without easy access to logs, failures, performance metrics, investigating issues and optimizing resource usage is simple not possible. Sematic surfaces container logs, exceptions, failure diagnostics, infrastructure traces directly in its dashboard to empower users to address issues in minutes.

When reading about Sematic, it is hard to avoid drawing comparisons to Airflow. What are the key differences between these two approaches to ML orchestration?

Airflow is a fantastic tool but it is not adapted for Machine Learning work for three reasons:

Airflow does not enable iterative development. Executing pipelines locally for development requires running an Airflow instance locally, and submitting a job to run at scale in a cloud cluster requires deploying the pipeline to the Airflow instance, which updates it for everyone. The strong coupling between Airflow code and user code prevents the type of iterative work that is common in ML teams. Sematic enables local execution without any deployment, and packages dependencies at runtime to ship them to the cluster. Multiple engineers can iterate on the same pipeline in parallel, everything is neatly isolated.
Airflow does not guarantee a strong traceability of assets. The concept of Lineage Tracking mentioned above is simply not implemented. Users have to built their own layer on top of Airflow to track experiment metadata, input and outputs of pipeline steps, code, data, configuration, etc. Sematic guarantees this by default by serializing, persisting, and tracking all inputs and outputs of each pipeline steps.
Airflow has very poor visualizations capabilities. In order to visualize what went into a given pipeline step (e.g. configurations, data) or out (models, metrics, etc.) teams have to build their own visualization tools on top of it. Sematic surfaces all artifacts in the dashboard (e.g. dataframes, plots, metrics, configs) to make it straightforward for users to visualize what happened.

ML orchestration pipelines can be incredibly challenging to scale. Recently, Sematic enabled integration with Ray to address some of these challenges. What are some of the best practices for scaling ML orchestration pipelines?

The first thing to do is to leverage caching. When iterating on pipelines, it’s common that certain things do not change between executions. For example, when iterating on training, it is unnecessary to rerun data preparation. Sematic can hash inputs to detect changes and only run functions whose inputs are different, enabling fast iterations.

Secondly, leverage heterogeneous compute. Not all pipeline steps need the same compute resources, and using the largest VMs possible for all tasks is not cost-effective. Sematic lets users specify for each pipeline steps what resources are needed (e.g. high-memory for data processing, GPUs for train/eval, small VMs to extract reports, etc.), and will allocate them accordingly at runtime.

Thirdly, without dedicated attention, GPUs will often sit idle while data is being downloaded and loaded into memory. Optimizing data streaming into training frameworks (e.g. Pytorch dataloaders) is critical to making sure GPUs are maximally utilized and money can be used to scale instead of paying for idle resources.

Finally, distributed compute can dramatically speed up your pipelines. Whether it is for data processing (e.g. map/reduce tasks) or training (distributed training), execution times can be cut by as many times as there are nodes available in your cluster. Sematic’s Ray integration enables spinning up and down Ray cluster at runtime with a couple of lines of Python code. This pattern also solves dependency packaging which is clunky in Ray.

How does ML orchestration differ in the following three scenarios: 1) traditional centralized supervised learning, 2) mobile/IoT in-device ML, and 3) foundation models or LLMs?

Great question.

For traditional supervised learning, batch processing is sufficient. Users can develop an end-to-end pipelines that reads raw data in a data warehouse, processes it into a feature dataset, then train the model, evaluates, extracts reports, and potentially deploy the model.
For edge models, the process can be fairly similar except that the model needs to be exported to the required hardware architecture, and potentially compressed and optimized to run on more constrained resources. This was a common process at Cruise called “model export/conversion”. We used things like TensorRT to convert models trained on large cloud VMs to run on the cars’ hardware. Now if models need to be trained or fine-tuned on edge as well (as is necessary sometimes for privacy reasons) then it’s a whole other ballgame that requires very specialized tooling (C++ rather than Python, limited resources, limited battery, strict latency requirements, etc.)
The main difference with foundation models is the sheer scale. These models require thousands of GPUs and weeks of training. The most important thing here is fault tolerance. No training or orchestration framework can guarantee a flawless successful training job across thousands of GPUs for many days. There is bound to be failures (e.g. network failures, out-of-memory issues, code bugs, etc.). Fault tolerance can be implemented thanks to frequent model checkpointing and warm restarts (training job can restart from a saved checkpoint). Self-healing infrastructure is also necessary to make sure that crashed nodes reboot and rejoin the cluster.

💥 Miscellaneous – a set of rapid-fire questions

What is your favorite area of AI research?

I have been quite interested in so-called model collapse. The idea is that if large foundational models are trained on large amounts of public data from the internet, and if more and more online content is AI-generated, models will essentially train on their own data. Some studies proved that this leads to a collapse of the long tail of freak events that is present in human-generated data (e.g. experimental art, ground-breaking concepts and opinions, marginal content, etc.), leading to more conformity and less innovation. I call this model inbreeding.

The recent generation of foundation models has introduced concepts like fine-tuning, memory, or knowledge augmentation that challenge more traditional MLOps pipelines. How do you envision MLOps architectures adapting to foundation model pipelines?

Unlike traditional supervised ML, training foundational models will not become mainstream, for sheer scale, cost, and expertise reasons. I can see how every Fortune 500 company in 5 years will do some amount of deep learning (e.g. YOLO for industrial quality control) or fine-tuning (e.g. fine-tune LLama on private data), but I don’t think they will all train their own Falcon from scratch. Therefore, I think Big AI will have their own tools dedicated to large-scale foundational model training, while the rest of the industry will still need traditional MLOps tools. However, what is going to emerge are tools around LLM orchestration. Langchain is the first of those, and more will come. Essentially creating dynamic DAGs of operations such as prompt templating, model inferencing, ML-powered expert model selection, etc. But these will be real-time pipelines that will have to run within the milliseconds between user input and the expected feedback.

Why do traditional CI/CD practices fail when applied to ML solutions?

The idea of CI should be used in ML (e.g. regression testing), and CD as well (recurrent retraining with new data). However the usual tools (e.g. Buildkite, Circle CI) are not suitable because they lack visualization and traceability of assets. These tools will essentially give you a log trace of the job, but will not enable outputting plots, dataframe, et cetera.

How do you see the balance of ML orchestration between standalone platforms and the features incorporated into large platforms such as Amazon SageMaker or Azure ML?

In my experience, large platforms cater to large enterprise companies, that’s where the money is. The products are sometimes inferior, but are extremely well marketed to reassure CIOs of Fortune 500 companies. Especially when those have spend commitments with those platforms.

It’s a common pattern in tech that indie challengers build their businesses on a more avant-garde customer base, and as they grow, shift upmarket towards enterprise and sometimes get acquired by Big Tech. GitHub is a good example. Luckily Microsoft seems smart enough to “keep GitHub cool”, but for how long?

TheSequence

Discussion about this post