TheSequence

TheSequence

Share this post

TheSequence
TheSequence
⚙️ Edge#183: Data vs Model Parallelism in Distributed Training

⚙️ Edge#183: Data vs Model Parallelism in Distributed Training

Apr 19, 2022
∙ Paid
9

Share this post

TheSequence
TheSequence
⚙️ Edge#183: Data vs Model Parallelism in Distributed Training
Share

In this issue: 

  • we explore data vs model parallelism in distributed training; 

  • we discuss how AI training scales;

  • we overview Microsoft DeepSpeed, a training framework powering some of the largest neural networks in the world.  

Enjoy the learning!  

💡 ML Concept of the Day: Data vs. Model Parallelism  

The core principle of distributed training in ML models relies on dividing the workload between multiple nodes that perform different training tasks. From an architectural standpoint, there are two fundamental approaches to distributed training: data and model parallelism.  

  1. Data Parallelism 

    1. The main idea of data parallelism in distributed training is to replicate the model in different nodes and train on different portions of the dataset in a differentiable way. In this approach, the dataset is divided into a number of partitions available to different nodes. Each node downloads a version of the model, trains it in a target subset of the dataset, then performs backpropagation to compute the results. Finally, those results are aggregated to form a new version of the model.  

    Data parallelism
    Image credit: Frank Denneman 

  2. Model Parallelism 

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Jesus Rodriguez
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share