Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, problems like high memory usage and redundancy in experts arise due to duplication of network layers into multiple copies as experts. This paper introduces a novel merging algorithm for SMoE, called M-SMoE, which uses routing statistics to guide expert merging. The process starts with neuron permutation alignment for experts, then forms dominant experts and their group members based on routing policies. Finally, every expert group is merged into a single expert using each expert’s activation frequency as their weight for merging. This method reduces memory usage by up to 80% and FLOPs by 20% with virtually no loss in performance.

 

Publication date: 2 Oct 2023
Project Page: https://github.com/UNITES-Lab/MC-SMoE
Paper: https://arxiv.org/pdf/2310.01334