👯‍♀️👯 Edge#74: How Uber, Google, DeepMind and Microsoft Train Models at Scale

Mar 25, 2021

What’s New in AI, a deep dive into one of the freshest research papers or technology frameworks that are worth your attention. Our goal is to keep you up to date with new developments in AI in a way that complements the concepts we are debating in other editions of our newsletter.

Check the Quiz below!

💥 What’s New in AI: How Uber, Google, DeepMind and Microsoft Train Models at Scale

Large-scale training is one of the most challenging aspects of building deep learning solutions in the real world. As the old proverb says, your greatest strength can become your biggest weakness, and that certainly applies to deep learning models. The entire deep learning space was possible partly due to the ability of deep neural networks to scale across GPU topologies. However, that same ability to scale resulted in the creation of computationally intensive programs that are operationally challenging to most organizations. From training to optimization, the lifecycle of deep learning programs requires robust infrastructure building blocks to be able to parallelize and scale computation workloads. While deep learning frameworks are evolving at a rapid pace, the corresponding infrastructure models remain relatively nascent. Over the last few years, technology giants Google, Microsoft, Uber, DeepMind and others have regularly unveiled separate efforts for enabling the parallelization of deep learning models across large GPU infrastructures.

The principle of distributing and parallelizing computations is relevant across almost any stage of the lifecycle of a deep learning program. Training a deep learning model can be an incredibly expensive exercise and so is its execution. The obvious answer is to leverage large GPU networks to distribute the workloads of deep learning programs, but that’s far from being an easy endeavor. Concurrent and parallel programming is notoriously complex and even more so when applied to large neural networks. Large technology companies face these challenges every day as they operate incredibly complex deep neural networks for mission-critical applications. Today, I would like to review some of the top architectures used at Google, DeepMind, Microsoft and Uber to parallelize the training of large-scale deep learning models. Specifically, I would like to discuss the following projects:

Google GPipe
Uber Horovod
DeepMind TF-Replicator
Microsoft DeepSpeed

Google’s GPipe

GPipe focuses on scaling training workloads for deep learning programs. The complexity of training processes from an infrastructure standpoint is an often-overlooked aspect of deep learning models. Training datasets are getting larger and more complex. For instance, in the healthcare space, it is not uncommon to encounter models that need to be trained using millions of high-resolution images. As a result, training processes often take a long time to complete and result in being incredibly expensive from memory and CPU consumption.

An effective way to think about the parallelism of deep learning models is to divide it between data and model parallelism. The data parallelism approach employs large clusters of machines to split the input data across them. Model parallelism attempts to move the model to accelerators, such as GPUs or TPUs, which have special hardware to accelerate model training. At a high level, almost all training datasets can be parallelized following certain logic, but the same can’t be said about models. For instance, some deep learning models are composed of parallel branches that can be trained independently. In that case, a classic strategy is to divide the computation into partitions and assign different partitions to different branches. However, that strategy falls short in deep learning models that stack layers sequentially, presenting a challenge to parallelize computation efficiently.

GPipe combines both data and model parallelism by leveraging a technique called pipelining. Conceptually, GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent and pipeline parallelism for training, applicable to any DNN that consists of multiple sequential layers. GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. This model allows GPipe’s accelerators to operate in parallel, maximizing the scalability of the training process.

The following figure illustrates the GPipe model in which a neural network with sequential layers is partitioned across four accelerators. Fk is the composite forward computation function of the kth partition. Bk is the corresponding backpropagation function. Bk depends on both Bk+1 from the upper layer and the intermediate activations of Fk. In the top model, we can see how the sequential nature of the network leads to the underutilization of resources. The bottom figure shows the GPipe approach in which the input mini-batch is divided into smaller macro-batches which can be processed by the accelerators at the same time.

Image credit: The original paper

Google open-sourced implementation of GPipe as part of the TensorFlow project.

Uber Horovod

Horovod is one of the Uber ML stacks that has become extremely popular within the community and has been adopted by research teams at AI-powerhouses like DeepMind and OpenAI. Conceptually, Horovod is a framework for running distributed deep learning training jobs at scale.

Horovod leverages message passing interface stacks such as OpenMPI to enable a training job that runs on a highly parallel and distributed infrastructure without any modifications. Running a distributed TensorFlow training job in Horovod is accomplished in four simple steps:

hvd.init() initializes Horovod.

config.gpu_options.visible_device_list = str(hvd.local_rank()) assigns a GPU to each of the TensorFlow processes.

opt=hvd.DistributedOptimizer(opt)wraps any regular TensorFlow optimizer with a Horovod optimizer, which takes care of averaging gradients using ring-allreduce.

hvd.BroadcastGlobalVariablesHook(0) broadcasts variables from the first process to all other processes to ensure consistent initialization.

You can see these four steps in the following code sample that is a template of a basic TensorFlow training job:

DeepMind’s TF-Replicator

TF-Replicator focuses on a different aspect of scalability related to how TensorFlow programs leverage Tensor Processing Units (TPUs). Considered one of the most advanced AI chips, TPUs provide native scalability for machine learning workloads. However, the usage of TPUs in TensorFlow programs requires specialized APIs, which causes issues with portability and barriers of adoption for data scientists who are not familiar with the underlying hardware model. DeepMind’s TF-Replicator addresses this challenge by providing a simpler, developer-friendly programming model for leveraging TPUs in TensorFlow programs.

The magic of the TF-Replicator relies on an “in-graph replication” pattern in which the computation for each device is replicated in the same TensorFlow graph. Communication between devices is achieved by connecting nodes from the devices’ corresponding sub-graphs. To achieve that level of parallelization, TF-Replicator leverages TensorFlow’s graph rewriting model to insert native communication between devices in the graph. When presented with a TensorFlow graph, TF-Replicator first builds the computation for each device independently and leaves placeholders where cross-device computation has been specified by the user. Once the sub-graphs for all devices have been built, TF-Replicator connects them by replacing the placeholders with actual cross-device computation.

Image credit: DeepMind

From the programming model standpoint, code written using TF-Replicator looks similar to the native TensorFlow code authored for a single device. The user simply needs to define (1) an input function that exposes a Dataset, and (2) a step function that defines the logic of their model (e.g. a single step of gradient descent). The following code snippet shows a simple TF-Replicator program:

To optimize the communication between different devices, TF-Replicator leverages state-of-the-art MPI interfaces. In some experiments, DeepMind was able to train the famous BigGAN model on batches of size 2048 across up to 512 cores of a TPUv3 pod. At the moment, TF-Replicator is the main programming interface for TPUs at DeepMind.

Microsoft DeepSpeed

Microsoft’s DeepSpeed is a new open-source framework focused on optimizing the training of massively large deep learning models. The current release includes the first implementation of ZeRO as well as other optimization methods. From a programming standpoint, DeepSpeed is built on top of PyTorch and provides a simple API that allows engineers to leverage training parallelization techniques with just a few lines of code. DeepSpeed abstracts all of the difficult aspects of large-scale training, such as parallelization, mixed precision, gradient accumulation, and checkpoints allowing developers to focus on the construction of the models.

From the functional standpoint, DeepScale excels at four key aspects:

Scale: DeepSpeed provides system support to run models up to 100 billion parameters, which represents 10x improvement compared to other training optimization frameworks.
Speed: In the initial tests, DeepSpeed showed 4x-5x higher throughput than that of other libraries.
Cost: Models could be trained using DeepSpeed at three times less the cost than that of alternatives.
Usability: DeepSpeed does not require refactoring PyTorch models and can be used with just a few lines of code.

Image credit: Microsoft Research Blog

Conclusion

Parallelizing the training of deep learning models is a very complex exercise that falls beyond the expertise of most machine learning teams. Leveraging the frameworks and architectures created by technology companies like Google, Microsoft, Uber or DeepMind can certainly streamline these efforts. In the near future, we expect to see some versions of these frameworks included in mainstream deep learning stacks to democratize access by the core deep learning community.

🧠 The Quiz

Every ten quizzes we reward two random people. Participate! The question is the following:

Which of the following frameworks is the best fit for parallelizing training when building a PyTorch-based machine learning model??

Check your knowledge

TheSequence

Discussion about this post