⚙️ Edge#183: Data vs Model Parallelism in Distributed Training
In this issue:
we explore data vs model parallelism in distributed training;
we discuss how AI training scales;
we overview Microsoft DeepSpeed, a training framework powering some of the largest neural networks in the world.
Enjoy the learning!
💡 ML Concept of the Day: Data vs. Model Parallelism
The core principle of distributed training in ML models relies on dividing the workload between multiple nodes that perform different training tasks. From an architectural standpoint, there are two fundamental approaches to distributed training: data and model parallelism.
The main idea of data parallelism in distributed training is to replicate the model in different nodes and train on different portions of the dataset in a differentiable way. In this approach, the dataset is divided into a number of partitions available to different nodes. Each node downloads a version of the model, trains it in a target subset of the dataset, then performs backpropagation to compute the results. Finally, those results are aggregated to form a new version of the model.