Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy
Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, problems like high memory usage and redundancy in experts arise due to duplication…
Continue reading