This paper introduces Ravnest, an asynchronous decentralized training system for large deep learning models. With the increasing complexity and size of these models, traditional centralized methods face memory constraints. Ravnest overcomes these challenges by utilizing the computing power of regular PCs with limited resources connected over the internet. It organizes these nodes into clusters with similar data transfer rates and computing capabilities. These clusters engage in Zero-Bubble Asynchronous Model Parallel training, and a Parallel Multi-Ring All-Reduce method is used for global parameter averaging across all clusters. The paper also discusses linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.
Publication date: 3 Jan 2024
Project Page: https://arxiv.org/abs/2401.01728
Paper: https://arxiv.org/pdf/2401.01728