Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Links: https://arxiv.org/abs/2101.03961 “SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY”,提出了一种可以扩展到万亿参数的网络,有两个比较大的创新,基于Transformer MoE网络结构,简化了MoE的routing机制,降低了计算量;进一步通过数据并行+模型并行+expert并行的方式降低了训练通信量,提升训练性能。 模型 Simplifying Sparse Routing Mixture of Expert Routing which takes as an input a token representation x and then routes this to the best deter- mined top-k experts Switch Routing: route to only a single expert, this simplification preserves model quality, reduces routing computation and performs better. Sparse routing通过参数Wr计算出一个在N个experts上的softmax分布,对每个token输入筛选概率最高的 top k 个 experts,对应的是MOE中的门控机制。这样对算力的需求并没有随着参数量的增加而大幅增长,使得这个模型更加容易训练。 EFFICIENT SPARSE ROUTING 并行Switch实现 tensor shapes are statically determined at compilation time computation is dynamic due to the routing decisions at training and inference....

2021-07-10 · 4 min · Cong Chan

Mixture of Experts (MOE)

Mixture of Experts (MOE) MOE属于Ensemble Method中的一个方法, 采用分治思想: 将复杂的建模任务分解为多个相对简单的子任务,为每个子任务训练专门的模型:涉及子任务分解,或者Clustering 需要一个门控模型,基于数据输入选择如何组合多个专家模型的结果 Mixture of experts aims at increasing the accuracy of a function approximation by replacing a single global model by a weighted sum of local models (experts). It is based on a partition of the problem domain into several subdomains via clustering algorithms followed by a local expert training on each subdomain. Local Models & Global Models Hinton的课件介绍了模型拟合分布的两个极端方式: Very local models: 使用很多非常局部化的模型, e....

2021-07-03 · 3 min · Cong Chan