Efficient Pipeline Planning for Expedited Distributed DNN Training

Luo, Ziyue; Yi, Xiaodong; Long, Guoping; Fan, Shiqing; Wu, Chuan; Yang, Jun; Lin, Wei

doi:10.1109/INFOCOM48880.2022.9796787

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2204.10562 (cs)

[Submitted on 22 Apr 2022]

Title:Efficient Pipeline Planning for Expedited Distributed DNN Training

Authors:Ziyue Luo, Xiaodong Yi, Guoping Long, Shiqing Fan, Chuan Wu, Jun Yang, Wei Lin

View PDF

Abstract:To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple versions of model parameters to co-exist (similar to asynchronous training), and cannot ensure the same model convergence and accuracy performance as without pipelining. Synchronous pipelining has recently been proposed which ensures model performance by enforcing a synchronization barrier between training iterations. Nonetheless, the synchronization barrier requires waiting for gradient aggregation from all microbatches and thus delays the training progress. Optimized pipeline planning is needed to minimize such wait and hence the training time, which has not been well studied in the literature. This paper designs efficient, near-optimal algorithms for expediting synchronous pipeline-parallel training of modern large DNNs over arbitrary inter-GPU connectivity. Our algorithm framework comprises two components: a pipeline partition and device mapping algorithm, and a pipeline scheduler that decides processing order of microbatches over the partitions, which together minimize the per-iteration training time. We conduct thorough theoretical analysis, extensive testbed experiments and trace-driven simulation, and demonstrate our scheme can accelerate training up to 157% compared with state-of-the-art designs.

Comments:	INFOCOM 2022
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2204.10562 [cs.DC]
	(or arXiv:2204.10562v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2204.10562
Related DOI:	https://doi.org/10.1109/INFOCOM48880.2022.9796787

Submission history

From: Ziyue Luo [view email]
[v1] Fri, 22 Apr 2022 08:18:41 UTC (4,556 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Efficient Pipeline Planning for Expedited Distributed DNN Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Efficient Pipeline Planning for Expedited Distributed DNN Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators