🔸◽️Edge#94: Determined AI Tackles the Monster Challenge of Distributed Training

No subscription is needed

Jun 03, 2021

This is an example of TheSequence Edge, a Premium newsletter that our subscribers receive every Tuesday and Thursday. Become smarter about ML and AI.

In this issue, we overview:

the challenges of training models at scale;
the core objective of Determined platform and its capabilities;
what master-agent architecture and how it helps deliver great results.

💥 What’s New in AI: Determined AI Tackles the Monster Challenge of Distributed Training

Training is one of those aspects of machine learning applications that we tend to take for granted. Until we need to do it at scale. While training a simple machine learning model in a lab environment seems simple enough, scaling that training infrastructure across a large number of models in production rapidly becomes a nightmare. GPU allocation, training workload distribution, and hyperparameter optimizations are some of the key challenges that immediately surface at scale. Data scientists don’t always make the best infrastructure engineers and, similarly, infrastructure engineers run into trouble understanding the key requirements of the training of data science models. Machine learning models have training infrastructure requirements that are highly different from those of traditional distributed systems architectures. In order to achieve mainstream adoption, machine learning infrastructure should become as transparent and ubiquitous as web servers and databases became for mobile and web applications.

As data scientists working in real-world projects, having awareness of machine learning infrastructure platforms is as relevant, if not more, than staying up to date with the latest deep learning framework. Unfortunately, machine learning infrastructure platforms do not receive the same level of attention as development stacks.

Determined is a deep learning training platform focused on streamlining the adoption of native AI-first infrastructure. While Determined covers many infrastructure building blocks, it certainly excels in the area of model training. Think of Determined as a consistent, enterprise-grade experience to leverage several of the most robust machine learning training frameworks in the market such as Horovod (created by Uber), Metaflow (created by Netflix) and many others. Large-scale machine learning training is not a problem solved by a single open-source framework or tool, but rather by a combination of different stacks. That is what the Determined platform tries to achieve. Let’s dive in.

The Determined Platform

The core objective of Determined is to abstract the core infrastructure building blocks required to train machine learning models at scale. Determined includes many capabilities relevant to data science teams but more of them can be grouped into four fundamental areas:

Distributed Training: Determined enables data scientists to train models using many GPUs without needing changes in the underlying code.
Hyperparameter Tuning: Determined includes state-of-the-art hyperparameter search algorithms used to fine-tune models based on the training results.

Experiment Tracking: Determined provides a real-time dashboard to track the performance of different experiments covering code versions, metrics, checkpoints, and hyperparameters.
Access and Share GPU Results: Determined allows data science teams to keep track of GPU resources and share them in an efficient way.

Other capabilities of Determined include visualization and debugging of machine learning models as well as different training accelerator methods. One of the things I find very convenient about Determined is the native SDKs for deep learning frameworks like TensorFlow, Keras and PyTorch, which allow data scientists to incorporate those sophisticated training infrastructure capabilities without having to make major modifications to their original model code.

Image credit: Determined AI

The Architecture

To deliver some of the features outlined in the previous section, Determined relies on a distributed architecture based on a master-agents model. In that architecture, the master node is responsible for coordinating the training workflow across different agents. More specifically, there are three core capabilities of master nodes in the Determined platform:

Storing experiment, trial, and workload metadata.
Scheduling and dispatching work to agents.
Advancing the experiment, trial, and workload state machines over time.

The agent nodes in the Determined architecture are responsible for executing ML workloads. Specifically, each agent manages a pool of computing resources in the form of CPUs or GPUs. These resource pools are known as slots. Each slot is used to execute a containerized task known as the trial runner, which can represent any computation required in the training of machine learning models.

The workflows to coordinate master and agent nodes in the Determined architecture are relatively straightforward.

The master collects information on the agent's topology in a cluster.
The master estimates the size of a target cluster and which agents should be part of it.
The master invokes the agent APIs in infrastructures such as AWS or GCP to the provision of terminating specific agents.

Determined’s master-agent architecture is consistent across different infrastructures such as AWS, Google Cloud and Kubernetes, which facilitates its deployment in heterogeneous enterprise environments.

Image credit: Determined AI

Even though Determined excels in resource scheduling and training workloads, this is far from its only contribution to streamlining the lifecycle of machine learning models. Hyperparameter optimization is another key capability of Determined. The current version includes several hyperparameter tuning techniques ranging from basic random/grid search to more sophisticated population-based training and adaptive search methods. More importantly, Determined integrates its hyperparameter search capabilities with its job scheduler, making it a native building block of the lifecycle of machine learning models.

Experiment lifecycle management is another area of Determined’s contribution. Using Determined’s user interface, data science teams can submit experiments using different configurations. Behind the scenes, Determined distributes the different experiments across the underlying master-agent topology and tracks the results.

Image credit: Determined AI

The final capability of Determined that is worth highlighting is its developer experience. Machine learning infrastructure is really difficult and, quite often, there is a mismatch between the deep learning libraries used in the implementation of a model and the infrastructure used to deploy it and run it. Determined addresses this challenge by providing a consistent programming model across different frameworks, such as TensorFlow, Keras and PyTorch, that incorporates its infrastructure capabilities as a first-class building block.

Infrastructure management is one of the most challenging aspects of real-world machine learning solutions. Even though the platforms in this space are still very nascent, they can deliver a lot of value streamlining the infrastructure requirements of machine learning models. Determined is one of the early innovators in the machine learning infrastructure management space that has achieved meaningful market traction and delivers a modern set of capabilities that are quite helpful to the lifecycle of machine learning solutions. The current version of Determined is open-source and available for provision in different cloud and container environments.

Further Reading: More details about Determined can be found at GitHub.

TheSequence

🔸◽️Edge#94: Determined AI Tackles the Monster Challenge of Distributed Training

No subscription is needed

💥 What’s New in AI: Determined AI Tackles the Monster Challenge of Distributed Training

The Determined Platform

Image credit: Determined AI

The Architecture

Image credit: Determined AI

Image credit: Determined AI

Discussion about this post