Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Links: https://arxiv.org/abs/2101.03961 “SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY”,提出了一种可以扩展到万亿参数的网络,有两个比较大的创新,基于Transformer MoE网络结构,简化了MoE的routing机制,降低了计算量;进一步通过数据并行+模型并行+expert并行的方式降低了训练通信量,提升训练性能。 模型 Simplifying Sparse Routing Mixture of Expert Routing which takes as an input a token representation x and then routes this to the best deter- mined top-k experts Switch Routing: route to only a single expert, this simplification preserves model quality, reduces routing computation and performs better. Sparse routing通过参数Wr计算出一个在N个experts上的softmax分布,对每个token输入筛选概率最高的 top k 个 experts,对应的是MOE中的门控机制。这样对算力的需求并没有随着参数量的增加而大幅增长,使得这个模型更加容易训练。 EFFICIENT SPARSE ROUTING 并行Switch实现 tensor shapes are statically determined at compilation time computation is dynamic due to the routing decisions at training and inference....

2021-07-10 · 4 min · Cong Chan

Mixture of Experts (MOE)

Mixture of Experts (MOE) MOE属于Ensemble Method中的一个方法, 采用分治思想: 将复杂的建模任务分解为多个相对简单的子任务,为每个子任务训练专门的模型:涉及子任务分解,或者Clustering 需要一个门控模型,基于数据输入选择如何组合多个专家模型的结果 Mixture of experts aims at increasing the accuracy of a function approximation by replacing a single global model by a weighted sum of local models (experts). It is based on a partition of the problem domain into several subdomains via clustering algorithms followed by a local expert training on each subdomain. Local Models & Global Models Hinton的课件介绍了模型拟合分布的两个极端方式: Very local models: 使用很多非常局部化的模型, e....

2021-07-03 · 3 min · Cong Chan

Survey - Pre-Trained Models - Past, Present and Future

Links: https://arxiv.org/abs/2106.07139 最新出炉的 Pre-Trained Models 综述速览。 先确定综述中的一些名词的定义 Transfer learning:迁移学习,一种用于应对机器学习中的data hungry问题的方法,是有监督的 Self-Supervised Learning:自监督学习,也用于应对机器学习中的data hungry问题,特别是针对完全没有标注的数据,可以通过某种方式以数据自身为标签进行学习(比如language modeling)。所以和无监督学习有异曲同工之处。 一般我们说无监督主要集中于clustering, community discovery, and anomaly detection等模式识别问题 而self-supervised learning还是在监督学习的范畴,集中于classification and generation等问题 Pre-trained models (PTMs) :预训练模型,Pre-training是一种具体的训练方案,可以采用transfer learning或者Self-Supervised Learning方法 2 Background 脉络图谱 Pre-training 可分为两大类: 2.1 Transfer Learning and Supervised Pre-Training 此类可进一步细分为 feature transfer 和 parameter transfer. 2.2 Self-Supervised Learning and Self-Supervised Pre-Training Transfer learning 可细分为四个子类 inductive transfer learning (Lawrence and Platt, 2004; Mihalkova et al., 2007; Evgeniou and Pontil, 2007), transductive transfer learning (Shimodaira, 2000; Zadrozny,2004; Daume III and Marcu, 2006), self-taught learning (Raina et al....

2021-06-19 · 10 min · Cong Chan

CorefQA - Coreference resolution as query-based span prediction

2020, ACL data: CoNLL-2012, GAP task: Coreference Resolution 通过QA方式处理coreference问题,A query is generated for each candidate mention using its surrounding con- text, and a span prediction module is em- ployed to extract the text spans of the corefer- ences within the document using the generated query. 近期的方法有consider all text spans in a document as potential mentions and learn to find an antecedent for each possible mention. There。这种仅依靠mention的做对比的方法的缺点: At the task formalization level: 因为当前数据集有很多遗漏的mention, mentions left out at the mention proposal stage can never be recov- ered since the downstream module only operates on the proposed mentions....

2021-05-11 · 2 min · Cong Chan

Early Rumour Detection

2019, ACL data: TWITTER, WEIBO links: https://www.aclweb.org/anthology/N19-1163, https://github.com/DeepBrainAI/ERD task: Rumour Detection 这篇文章采用GRU编码社交媒体posts stream,作为环境的状态表示;训练一个分类器以GRU的状态输出为输入,对文本做二分类判断是否是rumor。用DQN训练agent,根据状态做出是否启动rumor分类器进行判断,并根据分类结果对错给予奖惩。目标就是尽可能准尽可能早地预测出社交媒体posts是否是rumor。 Focuses on the task of rumour detection; particularly, we are in- terested in understanding how early we can detect them. Our model treats social media posts (e.g. tweets) as a data stream and integrates reinforcement learning to learn the number minimum num- ber of posts required before we classify an event as a rumour. Let $E$ denote an event, and it consists of a series of relevant posts $x_i$, where $x_0$ denotes the source message and $x_T$ the last relevant message....

2021-05-01 · 3 min · Cong Chan

Matching the Blanks - Distributional Similarity for Relation Learning

2019, ACL data: KBP37, SemEval 2010 Task 8, TACRED task: Entity and Relation Extraction Build task agnostic relation representations solely from entity-linked text. 缺陷 文章认为网页中, 相同的的实体对一般指代相同的实体关系, 把实体不同的构建为负样本. 这个在单份文件中可能大概率是对的. 但是实体不完全一直不代表这个两对实体的关系不同. 所以这个作为负样本是本质上映射的是实体识别而不是关系. 比较好的方式是把实体不同但是关系一样的也考虑进来. 方法 Define Relation Statement We define a relation statement to be a block of text containing two marked entities. From this, we create training data that contains relation statements in which the entities have been replaced with a special [BLANK]...

2021-04-21 · 3 min · Cong Chan

A Frustratingly Easy Approach for Joint Entity and Relation Extraction

2020, NAACL data: ACE 04, ACE 05, SciERC links: https://github.com/princeton-nlp/PURE task: Entity and Relation Extraction 提出了一种简单但是有效的pipeline方法:builds on two independent pre-trained encoders and merely uses the entity model to provide input features for the relation model. 实验说明: validate the importance of learning distinct contextual representations for entities and relations, fusing entity information at the input layer of the relation model, and incorporating global context. 从效果上看, 似乎是因为cross sentence的context加成更大 方法 Input: a sentence X consisting of n tokens x1, ....

2021-04-20 · 2 min · Cong Chan

Two are Better than One - Joint Entity and Relation Extraction with Table-Sequence Encoders

2020, EMNLP data: ACE 04, ACE 05, ADE, CoNLL04 links: https://github.com/LorrinWWW/two-are-better-than-one. task: Entity and Relation Extraction In this work, we propose the novel table-sequence encoders where two different encoders – a table encoder and a sequence encoder are designed to help each other in the representation learning process. 这篇ACL 2020文章认为, 之前的Joint learning方法侧重于learning a single encoder (usually learning representation in the form of a table) to capture information required for both tasks within the same space....

2021-03-27 · 2 min · Cong Chan

Improving Event Detection via Open-domain Trigger Knowledge

2020, ACL data: ACE 05 task: Event Detection Propose a novel Enrichment Knowledge Distillation (EKD) model to efficiently distill external open-domain trigger knowledge to reduce the in-built biases to frequent trigger words in annotations. leverage the wealth of the open-domain trigger knowledge to improve ED propose a novel teacher-student model (EKD) that can learn from both labeled and unlabeled data 缺点 只能对付普遍情况, 即一般性的触发词; 但触发词不是在任何语境下都是触发词. 方法 empower the model with external knowledge called Open-Domain Trigger Knowledge, defined as a prior that specifies which words can trigger events without subject to pre-defined event types and the domain of texts....

2021-03-25 · 3 min · Cong Chan

DeepPath - A Reinforcement Learning Method for Knowledge Graph Reasoning

2017, EMNLP data: FB15K-237, FB15K task: Knowledge Graph Reasoning Use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. 方法 RL 系统包含两部分, 第一部分是外部环境,指定了 智能体 和知识图谱之间的动态交互。环境被建模为马尔可夫决策过程。 系统的第二部分,RL 智能体,表示为策略网络,将状态向量映射到随机策略中。神经网络参数通过随机梯度下降更新。相比于 DQN,基于策略的 RL 方法更适合该知识图谱场景。一个原因是知识图谱的路径查找过程,行为空间因为关系图的复杂性可能非常大。这可能导致 DQN 的收敛性变差。另外,策略网络能学习梯度策略,防止 智能体 陷入某种中间状态,而避免基于值的方法如 DQN 在学习策略梯度中遇到的问题。 关系推理的强化学习 行为 给定一些实体对和一个关系,我们想让 智能体 找到最有信息量的路径来连接这些实体对。从源实体开始,智能体 使用策略网络找到最有希望的关系并每步扩展它的路径直到到达目标实体。为了保持策略网络的输出维度一致,动作空间被定义为知识图谱中的所有关系。 状态 知识图谱中的实体和关系是自然的离散原子符号。现有的实际应用的知识图谱例如 Freebase 和 NELL 通常有大量三元组,不可能直接将所有原子符号建模为状态。为了捕捉这些符号的语义信息,我们使用基于平移的嵌入方法,例如 TransE 和 TransH 来表示实体和关系。这些嵌入将所有符号映射到低维向量空间。在该框架中,每个状态捕捉 智能体 在知识图谱中的位置。在执行一个行为后,智能体 会从一个实体移动到另一个实体。两个状态通过刚执行的行为(关系)由 智能体 连接。第 t 步的状态向量:...

2020-03-11 · 2 min · Cong Chan