Transformers

Links: https://arxiv.org/abs/2101.03961 “SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY”，提出了一种可以扩展到万亿参数的网络，有两个比较大的创新，基于Transformer MoE网络结构，简化了MoE的routing机制，降低了计算量；进一步通过数据并行+模型并行+expert并行的方式降低了训练通信量，提升训练性能。模型 Simplifying Sparse Routing Mixture of Expert Routing which takes as an input a token representation x and then routes this to the best deter- mined top-k experts Switch Routing: route to only a single expert, this simplification preserves model quality, reduces routing computation and performs better. Sparse routing通过参数Wr计算出一个在N个experts上的softmax分布，对每个token输入筛选概率最高的 top k 个 experts，对应的是MOE中的门控机制。这样对算力的需求并没有随着参数量的增加而大幅增长，使得这个模型更加容易训练。 ...