🔸◽️Edge#94: Determined AI Tackles the Monster Challenge of Distributed Training

Jun 03, 2021

In this issue, we overview:

the challenges of training models at scale;
the core objective of Determined platform and its capabilities;
what master-agent architecture and how it helps deliver great results.

💥 What’s New in AI: Determined AI Tackles the Monster Challenge of Distributed Training

Training is one of those aspects of machine learning applications that we tend to take for granted. Until we need to do it at scale. While training a simple machine learning model in a lab environment seems simple enough, scaling that training infrastructure across a large number of models in production rapidly becomes a nightmare. GPU allocation, training workload distribution, and hyperparameter optimizations are some of the key challenges that immediately surface at scale. Data scientists don’t always make the best infrastructure engineers and, similarly, infrastructure engineers run into trouble understanding the key requirements of the training of data science models. Machine learning models have training infrastructure requirements that are highly different from those of traditional distributed systems architectures. In order to achieve mainstream adoption, machine learning infrastructure should become as transparent and ubiquitous as web servers and databases became for mobile and web applications.

As data scientists working in real-world projects, having awareness of machine learning infrastructure platforms is as relevant, if not more, than staying up to date with the latest deep learning framework. Unfortunately, machine learning infrastructure platforms do not receive the same level of attention as development stacks.

Determined is a deep learning training platform focused on streamlining the adoption of native AI-first infrastructure. While Determined covers many infrastructure building blocks, it certainly excels in the area of model training. Think of Determined as a consistent, enterprise-grade experience to leverage several of the most robust machine learning training frameworks in the market such as Horovod (created by Uber), Metaflow (created by Netflix) and many others. Large-scale machine learning training is not a problem solved by a single open-source framework or tool, but rather by a combination of different stacks. That is what the Determined platform tries to achieve. Let’s dive in.

The Determined Platform

The core objective of Determined is to abstract the core infrastructure building blocks required to train machine learning models at scale. Determined includes many capabilities relevant to data science teams but more of them can be grouped into four fundamental areas:

Distributed Training: Determined enables data scientists to train models using many GPUs without needing changes in the underlying code.
Hyperparameter Tuning: Determined includes state-of-the-art hyperparameter search algorithms used to fine-tune models based on the training results.

Experiment Tracking: Determined provides a real-time dashboard to track the performance of different experiments covering code versions, metrics, checkpoints, and hyperparameters.
Access and Share GPU Results: Determined allows data science teams to keep track of GPU resources and share them in an efficient way.

Other capabilities of Determined include visualization and debugging of machine learning models as well as different training accelerator methods. One of the things I find very convenient about Determined is the native SDKs for deep learning frameworks like TensorFlow, Keras and PyTorch, which allow data scientists to incorporate those sophisticated training infrastructure capabilities without having to make major modifications to their original model code.