MOE | Cong's Log

Links: https://arxiv.org/abs/2101.03961 “SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY”，提出了一种可以扩展到万亿参数的网络，有两个比较大的创新，基于Transformer MoE网络结构，简化了MoE的routing机制，降低了计算量；进一步通过数据并行+模型并行+expert并行的方式降低了训练通信量，提升训练性能。模型 Simplifying Sparse Routing Mixture of Expert Routing which takes as an input a token representation x and then routes this to the best deter- mined top-k experts Switch Routing: route to only a single expert, this simplification preserves model quality, reduces routing computation and performs better. Sparse routing通过参数Wr计算出一个在N个experts上的softmax分布，对每个token输入筛选概率最高的 top k 个 experts，对应的是MOE中的门控机制。这样对算力的需求并没有随着参数量的增加而大幅增长，使得这个模型更加容易训练。 ...

Mixture of Experts (MOE) MOE属于Ensemble Method中的一个方法, 采用分治思想：将复杂的建模任务分解为多个相对简单的子任务，为每个子任务训练专门的模型：涉及子任务分解，或者Clustering 需要一个门控模型，基于数据输入选择如何组合多个专家模型的结果 Mixture of experts aims at increasing the accuracy of a function approximation by replacing a single global model by a weighted sum of local models (experts). It is based on a partition of the problem domain into several subdomains via clustering algorithms followed by a local expert training on each subdomain. Local Models & Global Models Hinton的课件介绍了模型拟合分布的两个极端方式: ...