Posts

DQN, Double DQN, Dueling DoubleQN, Rainbow DQN

深度强化学习DQN和Natural DQN, Double DQN, Dueling DoubleQN, Rainbow DQN 的演变和必看论文. DQN的Overestimate DQN 基于 Q-learning, Q-Learning 中有 Qmax, Qmax 会导致 Q现实当中的过估计 (overestimate). 而 Double DQN 就是用来解决过估计的. 在实际问题中, 如果你输出你的 DQN 的 Q 值, 可能就会发现, Q 值都超级大. 这就是出现了 overestimate. DQN 的神经网络部分可以看成一个最新的神经网络 + 老神经网络, 他们有相同的结构, 但内部的参数更新却有时差. Q现实部分是这样的: $$Y_t^\text{DQN} \equiv R_{t+1} + \gamma \max_a Q(S_{t+1}, a; \theta_t^-)$$过估计 (overestimate) 是指对一系列数先求最大值再求平均，通常比先求平均再求最大值要大（或相等，数学表达为： $$E(\max(X_1, X_2, ...)) \ge \max(E(X_1), E(X_2), ...)$$一般来说Q-learning方法导致overestimation的原因归结于其更新过程，其表达为： $$Q_{t+1} (s_t, a_t) = Q_t (s_t, a_t) + a_t(s_t, a_t)(r_t + \gamma \max a Q_t(s_{t+1}, a) - Q_t(s_t, a_t))$$而更新最优化过程如下 ...

DeepPath - A Reinforcement Learning Method for Knowledge Graph Reasoning

2017, EMNLP data: FB15K-237, FB15K task: Knowledge Graph Reasoning Use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. 方法 RL 系统包含两部分，第一部分是外部环境，指定了智能体和知识图谱之间的动态交互。环境被建模为马尔可夫决策过程。系统的第二部分，RL 智能体，表示为策略网络，将状态向量映射到随机策略中。神经网络参数通过随机梯度下降更新。相比于 DQN，基于策略的 RL 方法更适合该知识图谱场景。一个原因是知识图谱的路径查找过程，行为空间因为关系图的复杂性可能非常大。这可能导致 DQN 的收敛性变差。另外，策略网络能学习梯度策略，防止智能体陷入某种中间状态，而避免基于值的方法如 DQN 在学习策略梯度中遇到的问题。关系推理的强化学习行为给定一些实体对和一个关系，我们想让智能体找到最有信息量的路径来连接这些实体对。从源实体开始，智能体使用策略网络找到最有希望的关系并每步扩展它的路径直到到达目标实体。为了保持策略网络的输出维度一致，动作空间被定义为知识图谱中的所有关系。状态知识图谱中的实体和关系是自然的离散原子符号。现有的实际应用的知识图谱例如 Freebase 和 NELL 通常有大量三元组，不可能直接将所有原子符号建模为状态。为了捕捉这些符号的语义信息，我们使用基于平移的嵌入方法，例如 TransE 和 TransH 来表示实体和关系。这些嵌入将所有符号映射到低维向量空间。在该框架中，每个状态捕捉智能体在知识图谱中的位置。在执行一个行为后，智能体会从一个实体移动到另一个实体。两个状态通过刚执行的行为（关系）由智能体连接。第 t 步的状态向量： ...

Knowledge-Graph-Embedding的Translate族（TransE，TransH，TransR，TransD）

data: WN18, WN11, FB15K, FB13, FB40K task: Knowledge Graph Embedding TransE Translating Embeddings for Modeling Multi-relational Data（2013） https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf 这是转换模型系列的第一部作品。该模型的基本思想是使head向量和relation向量的和尽可能靠近tail向量。这里我们用L1或L2范数来衡量它们的靠近程度。损失函数 $\mathrm{L}(h, r, t)=\max \left(0, d_{\text {pos }}-d_{\text {neg }}+\text { margin }\right)$使损失函数值最小化，当这两个分数之间的差距大于margin的时候就可以了(我们会设置这个值，通常是1) 但是这个模型只能处理一对一的关系，不适合一对多/多对一关系，例如，有两个知识，(skytree, location, tokyo)和(gundam, location, tokyo)。经过训练，“sky tree”实体向量将非常接近“gundam”实体向量。但实际上它们没有这样的相似性。 with tf.name_scope("embedding"): self.ent_embeddings = tf.get_variable(name = "ent_embedding", shape = [entity_total, size], initializer = tf.contrib.layers.xavier_initializer(uniform = False)) self.rel_embeddings = tf.get_variable(name = "rel_embedding", shape = [relation_total, size], initializer = tf.contrib.layers.xavier_initializer(uniform = False)) pos_h_e = tf.nn.embedding_lookup(self.ent_embeddings, self.pos_h) pos_t_e = tf.nn.embedding_lookup(self.ent_embeddings, self.pos_t) pos_r_e = tf.nn.embedding_lookup(self.rel_embeddings, self.pos_r) neg_h_e = tf.nn.embedding_lookup(self.ent_embeddings, self.neg_h) neg_t_e = tf.nn.embedding_lookup(self.ent_embeddings, self.neg_t) neg_r_e = tf.nn.embedding_lookup(self.rel_embeddings, self.neg_r) if config.L1_flag: pos = tf.reduce_sum(abs(pos_h_e + pos_r_e - pos_t_e), 1, keep_dims = True) neg = tf.reduce_sum(abs(neg_h_e + neg_r_e - neg_t_e), 1, keep_dims = True) self.predict = pos else: pos = tf.reduce_sum((pos_h_e + pos_r_e - pos_t_e) ** 2, 1, keep_dims = True) neg = tf.reduce_sum((neg_h_e + neg_r_e - neg_t_e) ** 2, 1, keep_dims = True) self.predict = pos with tf.name_scope("output"): self.loss = tf.reduce_sum(tf.maximum(pos - neg + margin, 0)) TransH Knowledge Graph Embedding by Translating on Hyperplanes（2014） ...

综述 A Survey on Knowledge Graphs - Representation, Acquisition and Applications

Survey: https://arxiv.org/abs/2002.00388v4 A knowledge graph is a structured representation of facts, consisting of entities, relationships and semantic descriptions. Entities can be real-world objects and abstract concepts, Relationships represent the relation between entities, Semantic descriptions of entities and their relationships contain types and properties with a well-defined meaning G: A knowledge graph F: A set of facts (h, r, t): A triple of head, relation and tail $(\mathbf{h}, \mathbf{r}, \mathbf{t})$: Embedding of head, relation and tail ...

Open-Domain Targeted Sentiment Analysis via Span-Based Extraction and Classification

2019, ACL data: SemEval 2014, SemEval 2014 ABSA, SemEval 2015, SemEval 2016 task: ABSA propose a span-based extract-then-classify framework, where multiple opinion targets are directly extracted from the sentence under the supervision of target span boundaries, and corresponding polarities are then classified using their span representations. 优点：用指针网络选取target，避免了序列标注的搜索空间过大问题用span边界+极性的标注方式，解决多极性的target问题方法 Input: sentence x =(x1,..., xn) with length n, Target list T = {t1,..., tm}： each target ti is annotated with its start, end position, and its sentiment polarity ...

A Lite BERT(AlBERT) 原理和源码解析

A Lite BERT BERT(Devlin et al., 2019)的参数很多, 模型很大, 内存消耗很大, 在分布式计算中的通信开销很大. 但是BERT的高内存消耗边际收益并不高, 如果继续增大BERT-large这种大模型的隐含层大小, 模型效果不升反降. 针对这些问题, 启发于mobilenet, Alert使用了两种减少参数的方法来降低模型大小和提高训练速度, 分别是Factorized embedding parameterization和Cross-layer parameter sharing. 这些设计让ALBERT增加参数大小的边界收益远远大于BERT. 除此之外, 在句子关系任务上抛弃了bert的nsp任务, 改为sop任务. 整体而言, ALBERT是当前众多BERT系列模型的集大成者, 其思路值得学习, 代码也写得很清楚. 下面仔细过一遍. Factorized embedding parameterization BERT以及后续的XLNet(Yang et al., 2019), RoBERTa(Liu et al., 2019)等, WordPiece embedding的维度E是和隐层维度H绑定的. WordPiece embedding本意是学习context-independent的表达，而hidden-layer旨在学习context-dependent的表达。将WordPiece embedding大小E与隐层大小H解绑，可以更有效地利用建模所需的总模型参数. 从实用性的角度看, 这样可以减少词汇量对模型大小的影响. 在NLP中词汇量一般都很大, 所以这个解绑收益是很明显的. 具体的做法就是对embedding进行因式分解, 把非常大的单词embedding分解成两个小的矩阵, O(V × H)变成O(V × E + E × H), 可以显著减少单词映射embedding的参数量. 这个在topic models一文中的隐变量模型中类似的思路体现. Cross-layer parameter sharing 各个 transformer blocks 所有参数共享, 这样参数不再随着模型层数加深而增大. ...

Entity Linking

Entity Linking Knowledge Graph (知识图谱)：一种语义网络，旨在描述客观世界的概念实体及其之间的关系，有时也称为Knowledge Base (知识库)。图谱由三元组构成：<实体1，关系，实体2> 或者 <实体，属性，属性值>；例如：<姚明，plays-in，NBA>、<姚明，身高，2.29m>；常见的KB有：Wikidata、DBpedia、YAGO。 Entity 实体：实体是知识图谱的基本单元，也是文本中承载信息的重要语言单位。 Mention 提及：自然文本中表达实体的语言片段。应用方向 Question Answering：EL是KBQA的刚需，linking到实体之后才能查询图数据库； Content Analysis：舆情分析、内容推荐、阅读增强； Information Retrieval：基于语义实体的搜索引擎，google搜索一些实体，右侧会出现wikipedia页面； Knowledge Base population：扩充知识库，更新实体和关系。候选实体和消歧 Entity linking system consists of two components: candidate entity generation：从mention出发，找到KB中所有可能的实体，组成候选实体集 (candidate entities)； Entity Disambiguation：从candidate entities中，选择最可能的实体作为预测实体。 Entity Disambiguation (ED) 是最重要的部分 Features Context-Independent Features： LinkCount：#(m->e)，知识库中某个提及m指向实体e的次数； Entity Attributes：Popularity、Type； Context-Dependent Features： Textual Context：BOW, Concept Vector Coherence Between Entities：WLM、PMI、Jaccard Distance Context-Independent Features mention到实体的LinkCount、实体自身的一些属性（比如热度、类型等等） LinkCount作为一个先验知识，在消歧时，往往很有用 Context-Dependent Features 全局地进行entities的消歧实际上是一个NP-hard的问题，因此核心问题是如何更加快速有效地利用一致性特征 ...

知识图谱补全

知识图谱补全基于知识表示的方法知识表示学习：对知识图谱中的实体和关系学习其低维度的嵌入式表示。常见的知识表示学习方法：主要是以 TransE 法及其变种为核心，针对空间映射等场景做的改进基于实体和关系的表示对缺失三元组进行预测；利用实体描述信息，可以解决开放域实体补全的问题；基于路径查找的方法可使用基于路径查找的方法来处理这类多步推理问题。传统的路径查找方法主要是 PRA 方法（Path Ranking Algorithm）；但是这种方法对于包含较大规模的知识图谱来说，会由于路径数量爆炸式增长，导致特征空间急剧膨胀可以尝试用 embedding 的方式表示关系，对关系进行泛化，并基于此对知识的补全进行建模，以缓解路径数量过多导致的特征空间膨胀问题。给定实体对集合，利用 PRA 查找一定数量的路径；路径计算过程中加入实体类型信息（减少长尾实体影响）；使用 RNN 沿着路径进行向量化建模；RNN 模型参数在不同关系之间共享；通过比较路径向量与待预测关系向量间的关联度来进行关系补全。基于强化学习的方法前面提到的两种方法，仍然存在若干的问题：需要基于 random walk 来查找路径；而 random walk 算法在离散空间中运行，难以评价知识图谱中相似的实体和关系；超级结点可能影响 random walk 算法运行速度。强化学习方法：在连续空间中进行路径搜索；通过引入多种奖励函数，使得路径查找更加灵活、可控。 DeepPath DeepPath: A Reinforcement Learning Method for Knowledge Graph Reasoning xwhan/DeepPath 任务：查找 Band of Brothers 和 English 之间的关系。路径起点：Band of Brothers 状态：实体中的 embedding 动作：图谱中的关系；奖励 Binary，是否到达终点路径长度路径多样性策略网络：使用全连接网络。 DeepPath 方法仍然存在一些缺陷：知识图谱本身的不完善很可能对路径查找造成影响。 ...

Deep Q Networks

Combining reinforcement learning and deep neural networks at scale. The algorithm was developed by enhancing a classic RL algorithm called Q-Learning with deep neural networks and a technique called experience replay. Q-Learning Q-Learning is based on the notion of a Q-function. The Q-function (a.k.a the state-action value function) of a policy $\pi$，$Q^{\pi}(s, a)$ ，measures the expected return or discounted sum of rewards obtained from state $s$ by taking action $a$ first and following policy $\pi$ thereafter. ...

BERT的Adam Weight Decay

Adam Weight Decay in BERT 在看BERT(Devlin et al., 2019)的源码中优化器部分的实现时，发现有这么一段话 # Just adding the square of the weights to the loss function is *not* # the correct way of using L2 regularization/weight decay with Adam, # since that will interact with the m and v parameters in strange ways. # # Instead we want ot decay the weights in a manner that doesn't interact # with the m/v parameters. This is equivalent to adding the square # of the weights to the loss with plain (non-momentum) SGD. 其针对性地指出一些传统的Adam weight decay实现是错误的. ...