📨 Edge#191: MPI – the Fundamental Enabler of Distributed Training
In this issue:Â
we discuss the fundamental enabler of distributed training: message passing interface (MPI);Â
we overview Google’s paper about General and Scalable Parallelization for ML Computation Graphs;Â
we share the most relevant technology stacks to enable distributed training in TensorFlow applications.Â
Enjoy the learning! Â
💡 ML Concept of the Day: MPI: The Enabler of Distributed TrainingÂ
During this series about distributed training, we have covered some of the main methods that enable the scaling of training across large clusters of nodes. However, one question that is on everyone’s mind when learning about distributed training is about the technologies that make this possible. To conclude this series, we would like to discuss what many consider the fundamental enabler of distributed training: message passing interface (MPI). Â
MPI has become one of the most adopted standards for high-performance computing (HPC) architectures powering many computing systems from companies like Intel, IBM, NVIDIA, and many others. Not surprisingly, MPI has been adopted by most distributed training frameworks in machine learning. Functionally,