Cong's Log

Value-based Reinforcement Learning

时序决策以经典的Atari游戏为例，agent在t时刻观测一段包含M个帧的视频$s_t = (x_{t-M+1}, ..., x_t) \in S$, 然后agent做决策, 决策是选择做出一个动作 $a_t \in A = \{ 1, ..., |A| \}$(A为可选的离散动作空间 ), 这个动作会让agent获得一个奖励$r_t$. 这就是时序决策过程, 是一个通用的决策框架，可以建模各种时序决策问题，例如游戏，机器人等. Agent 观察环境，基于policy $\pi\left(a_{t} \mid s_{t}\right)$ 做出响应动作，其中 $s_{t}$是当前环境的观察值(Observation 是环境State对Agent可见的部分)。Action会获得新的 Reward $r_{t+1}$, 以及新的环境反馈 $s_{t+1}$. Note: It is important to distinguish between the state of the environment and the observation, which is the part of the environment state that the agent can see, e.g. in a poker game, the environment state consists of the cards belonging to all the players and the community cards, but the agent can observe only its own cards and a few community cards. In most literature, these terms are used interchangeably and observation is also denoted as . ...

从头理解注意力机制

注意力机制如何起源的神经网络中的注意力机制启发自人类的视觉注意力机制，能够（高分辨率地）聚焦于图像中需要重点关注的目标区域（节省大脑资源），同时（低分辨率地）感知周围的图像，然后随着时间的推移调整焦点（状态调整）。在神经网路中，注意力机制是为了解决什么问题？在深度学习还没流行的时候, 传统的算法早已应用了注意力机制的思想. 比如一个非线性回归问题，对于代表位置的输入变量${x_1, ..., x_m}$ 和代表位置对应的输出值${y_1, ..., y_m}$, 如何预测新的$x_n$对应的输出? Baseline 就是求均值, $$\frac{1}{m} \sum_{i=1}^{m} y_i$$ 当然更好的方案(Watson, Nadaraya, 1964)是根据不同的输入$x_i$给与$y_i$不同的权重, $$y = \sum_{i=1}^{m} \alpha(x, x_i) y_i $$这里$x$代表一个新的输入(作为query), 根据$x$和已有的位置$x_i$(作为key)进行某种运算, 得到$x_i$对应的输出$y_i$(作为value)的权重. 如果每一个权重都是一个Guassians分布, 并正则化, 则一个加权的回归预测模型就是: $$f(x) = \sum_i y_i \frac{k(x_i, x)}{\sum_j k(x_j, x)}$$这个算法的"深度学习"版本, 就是其权重是通过优化器(如sgd)学习得来, 并且把平均运算改为加权池化(weighted pooling). 如何简单直观地理解注意力机制虽然注意力机制一开始被应用于图像识别领域，但是后来推广到神经机器翻译(NMT)中(Seq2Seq for Machine Translation, Sutskever, Vinyals, Le ‘14). NMT也是注意力机制在NLP领域最早最成功的应用之一. 在上图中，Echt，Dicke和Kiste词被送到编码器中，并且在特殊信号（未显示）之后，解码器开始生成翻译后的句子。解码器不断生成单词，直到产生特殊的句子结尾标记(如<eos>)。也就是说解码器仅根据最后一个隐含状态$h_3$来生成序列. 假如这个句子很短, 那么效果其实是很好的. 不过对于比较长的句子, 那么这个架构的弱点就暴露无疑了. 首先, 编码器能否把句子的所有信息(语言学上的和常识等知识)都理解/捕捉到? 其次, 受限于目前的实现技术(主要是硬件), 单个隐含状态(如$h_3$这个向量)的维度大小是有限的, 而句子长度以及语言的组合情况是无限的, 单靠$h_3$自身是存储信息能力是有限的. 再者, 解码器是否有足够的解码能力从一个隐含状态中解码出所有的信息? 虽然大部分句子是相对紧凑的, 但语言有个特点, 就是一个词有可能和前面好几步之外的词有联系, 比如一些指代词用于指代文本最开头出现的名词; 语义上, 某个句子的理解, 可能依赖于前面多个句子; 当然往大了说, 要理解一篇文章或一本书, 我们通常需要理解并联系多个段落, 多个章节. 这种现象称之为语言的长距离依赖(long-term dependency), 在一般性的序列数据中, 这个现象称之为的Long-range dependence(LRD). 即使是使用了LSTM这种理论上可以克服长距离依赖问题地网络, 也无法很好的克服语言的长距离依赖问题, 究其原因, 除了LSTM自身的局限性之外, 更主要是深度学习的梯度学习方法的局限性(在梯度反向传播中, 会出现梯度消失). ...

Percolations problem

Union-find applications: Percolation Problem discriptions Percolation data type. To model a percolation system, create a data type Percolation with the following API: public class Percolation { public Percolation(int n); // create n-by-n grid, with all sites blocked public void open(int row, int col); // open site (row, col) if it is not open already public boolean isOpen(int row, int col); // is site (row, col) open? public boolean isFull(int row, int col); // is site (row, col) full? public int numberOfOpenSites(); // number of open sites public boolean percolates(); // does the system percolate? } Monte Carlo simulation. To estimate the percolation threshold, consider the following computational experiment: ...

Inf Course Note - Accelerated Natural Language Processing

爱丁堡大学信息学院课程笔记 Accelerated Natural Language Processing, Informatics, University of Edinburgh References: Accelerated natural language processing ANLP revision guide Lecture Slides from the Stanford Coursera course Natural Language Processing, by Dan Jurafsky and Christopher Manning 概率模型 Probability Model 概率模型是随机现象的数学表示，由样本空间，样本空间内的事件以及与每个事件相关的概率定义。目标是模拟给一个事件发生的概率估算概率（Probability Estimation）一般使用最大似然估计（MLE，相关频率）： $$p(x_i) = \frac{Count(x_i)}{\sum_{i=0}^nCount(x_i)}$$平滑Smoothing 一般用于处理0概率的问题，比如在训练集中看不到, 但出现在测试集中的词。 Language modeling To compute the probability of sentence /sequence of words $P(w_1, w_2, w_3...)$, or to predict upcomming words $P(w|w_1, w_2, w_3...)$… a language model is also a probability model. ...

Inf Course Note - Natural Language Understanding

爱丁堡大学信息学院课程笔记 Natural Language Understanding, Informatics, University of Edinburgh References: Natural language understanding CS224n: Natural Language Processing with Deep Learning Lecture Slides from the Stanford Coursera course Natural Language Processing, by Dan Jurafsky and Christopher Manning Meaning representations 意思的表达有很多方法。一种有效的表示单词的含义的方法是 distributional semantic. Semantics (from Ancient Greek: σημαντικός sēmantikos, “significant”) is the linguistic and philosophical study of meaning, in language, programming languages, formal logics, and semiotics. 语义学 Semantics 在语言学中的研究目的在于找出语义表达的规律性、内在解释、不同语言在语义表达方面的个性以及共性；与计算机科学相关的语义学研究在于机器对自然语言的理解。 Tradition solution of usable meaning in a computer: Use e.g. WordNet, a resource containing lists of synonym sets and hypernyms. ...

Inf Course Note - Parallel Programming Language and Systems

爱丁堡大学信息学院课程笔记 Parallel Programming Language and Systems, Informatics, University of Edinburgh Reference: http://www.inf.ed.ac.uk/teaching/courses/ppls/ CMU 15213: Introduction to Computer Systems (ICS) Computer Systems: A Programmer’s Perspective A Comprehensive MPI Tutorial Resource A chapter on MPI from Ian Foster’s online Book Designing and Building Parallel Programs Introduction to parallel computer architecture Covering some of the nasty issues presented by the shared memory model, including weak consistency models and false sharing in the cache, and some architectural issues for the multicomputer model. ...

Inf Course Note - Software Architecture, Process, and Management

爱丁堡大学信息学院课程笔记 Software Architecture, Process, and Management, Informatics, University of Edinburgh Reference: microsoft IBM Software Architecture in Practice (3rd edition), Bass, Clements, and Kazman What is Software Architecture? Software architecture is often described as the organization or structure of a system, where the system represents a collection of components that accomplish a specific function or set of functions. grouping components into areas of concern (layers): For example, the UI, business processing, and data access. focus on interaction between the components and how different components work together. 在书中的定义： ...

Inf Course Note - Software Testing

爱丁堡大学信息学院课程笔记 Software Testing, Informatics, University of Edinburgh Reference: http://www.inf.ed.ac.uk/teaching/courses/st/2017-18/index.html Pezze and Young, Software Testing and Analysis: Process, Principles and Techniques, Wiley, 2007. Why Software Testing? 1, 软件的漏洞, 错误和失效 Software Faults, Errors & Failures The problem start with Faults, Fault(BUG): latent error, mistakes in programming. e.g add(x, y) = x * y. With the Faults in programs, if and only if executing add(x, y) = x * y, the fault being activated, and generate an Errors. ...

深入理解word2vec

Word2vec Mikolov et al. How to represent meanings? 如何在数学上表达词义？ Vector space models (VSMs) 表示把单词映射到(嵌入)连续的矢量空间, 而且理论上语义相似的单词会映射到空间中临近的位置。VSMs是一个历史悠久的NLP理论，但所有实现方法都不同程度依赖于Distributional Hypothesis, 即出现在相同（相似）的上下文中的单词具有相同（相似）的语义意义。利用此原则的方法大致可以分为两类: Count-based methods (例如, Latent Semantic Analysis))和Predictive models(例如 neural net language models (NNLM))。具体的区别详见Baroni et al.. 但总的来说，Count-based methods 统计词汇间的共现频率，然后把co-occurs matrix 映射到向量空间中；而Predictive models直接通过上下文预测单词的方式来学习向量空间（也就是模型参数空间）。 Word2vec 是一种计算特别高效的predictive model, 用于从文本中学习word embeddings。它有两种方案, Continuous Bag-of-Words model (CBOW) 和 Skip-Gram model (Section 3.1 and 3.2 in Mikolov et al.). 从算法上讲, 两种方案是相似的, 只不过 CBOW 会从source context-words ('the cat sits on the')预测目标单词(例如"mat"); 而skip-gram则相反, 预测目标单词的source context-words。Skip-gram这种做法可能看起来有点随意. 但从统计上看, CBOW 会平滑大量分布信息(通过将整个上下文视为一个观测值), 在大多数情况下, 这对较小的数据集是很有用的。但是, Skip-gram将每个context-target pair视为新的观测值, 当数据集较大时, 这往往带来更好的效果。 ...

循环神经网络

循环神经网络当人类阅读时，会根据对之前单词的理解和记忆来辅助理解当前看到的每个单词。也就是人能够很好地处理语言的长距离依赖特性（long-term dependency）。在自然语言处理任务中，很多传统的模型无法做到这一点，比如前馈神经网络；而传统的n-gram模型固然可以通过把把n系数增大来捕捉长距离依赖，但带来的非常巨大的内存消耗。循环神经网络（Recurrent Neural Networks, RNNs)可以看做是多个共享参数的前馈神经网络不断叠加的结果 ![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png “A recurrent neural network and the unfolding in time of the computation involved in its forward computation. “image from: http://colah.github.io”) 这里的核心是想办法解码历史信息, 即通过递归方程$s_i = R(x_i, s_{i−1})$让$s_i$解码序列$x_{1:n}$. 比如把所有历史信息累加就是一种非常简单粗暴的方式, 这样得到的是连续词袋模型(continuous-bag-of-words model)$s_i = R_{CBOW}(x_i, s_{i-1}) = x_i + s_{i−1}$, 虽然简单，但这种RNN其实忽略了数据的时序性质。一般意义上的RNN是指Elman Network or Simple-RNN (S-RNN)(Elman [1990]), $s_i = R_{SRNN}(x_i, s_{i-1}) = g(x_iW^x + s_{i−1}W^s + b)$, 也就是把历史信息先进行线性变换(乘以矩阵), 再和bias加起来, 再通过一个非线性激活函数(tanh或ReLU). 添加了线性变换再进行非线性激活, 使网络对输入的顺序变得敏感。在使用时, 给定输入序列（单词序列或语音）得出输出序列的过程如下： ...