Tutel: Adaptive Mixture-of-Experts at Scale

一、论文概览

属性	内容
标题	Tutel: Adaptive Mixture-of-Experts at Scale
arXiv	2206.03382
机构	UC Berkeley, Microsoft
代码	https://github.com/microsoft/tutel

Tutel 支持灵活的 EP（Expert Parallelism）组合：

在训练中，若某专家在 batch 中未被分配到 tokens，该专家在那一轮的计算可被跳过。Tutel 利用这一特性设计动态 Drop Tolerance 策略：在达到某个 token 覆盖率阈值后，主动跳过剩余专家。