Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices

This paper introduces Ravnest, an asynchronous decentralized training system for large deep learning models. With the increasing complexity and size of these models, traditional centralized methods face memory constraints. Ravnest overcomes these challenges by utilizing the computing power of regular PCs with limited resources connected over the internet. It organizes these nodes into clusters with similar data transfer rates and computing capabilities. These clusters engage in Zero-Bubble Asynchronous Model Parallel training, and a Parallel Multi-Ring All-Reduce method is used for global parameter averaging across all clusters. The paper also discusses linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.

Publication date: 3 Jan 2024
Project Page: https://arxiv.org/abs/2401.01728
Paper: https://arxiv.org/pdf/2401.01728

Post Views: 317

Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices

root

Leave a Reply Cancel reply

Press ESC to close

Share Article:

root

Investigating the Suitability of Concept Drift Detection for Detecting Leakages in Water Distribution Networks

Zero-shot Active Learning Using Self Supervised Learning

Leave a Reply Cancel reply

Please allow ads on our site