Cong's Log

Inf Course Note - Accelerated Natural Language Processing

爱丁堡大学信息学院课程笔记 Accelerated Natural Language Processing, Informatics, University of Edinburgh References: Accelerated natural language processing ANLP revision guide Lecture Slides from the Stanford Coursera course Natural Language Processing, by Dan Jurafsky and Christopher Manning 概率模型 Probability Model 概率模型是随机现象的数学表示，由样本空间，样本空间内的事件以及与每个事件相关的概率定义。目标是模拟给一个事件发生的概率估算概率（Probability Estimation）一般使用最大似然估计（MLE，相关频率）： $$p(x_i) = \frac{Count(x_i)}{\sum_{i=0}^nCount(x_i)}$$平滑Smoothing 一般用于处理0概率的问题，比如在训练集中看不到, 但出现在测试集中的词。 Language modeling To compute the probability of sentence /sequence of words $P(w_1, w_2, w_3...)$, or to predict upcomming words $P(w|w_1, w_2, w_3...)$… a language model is also a probability model. ...

Inf Course Note - Natural Language Understanding

爱丁堡大学信息学院课程笔记 Natural Language Understanding, Informatics, University of Edinburgh References: Natural language understanding CS224n: Natural Language Processing with Deep Learning Lecture Slides from the Stanford Coursera course Natural Language Processing, by Dan Jurafsky and Christopher Manning Meaning representations 意思的表达有很多方法。一种有效的表示单词的含义的方法是 distributional semantic. Semantics (from Ancient Greek: σημαντικός sēmantikos, “significant”) is the linguistic and philosophical study of meaning, in language, programming languages, formal logics, and semiotics. 语义学 Semantics 在语言学中的研究目的在于找出语义表达的规律性、内在解释、不同语言在语义表达方面的个性以及共性；与计算机科学相关的语义学研究在于机器对自然语言的理解。 Tradition solution of usable meaning in a computer: Use e.g. WordNet, a resource containing lists of synonym sets and hypernyms. ...

Inf Course Note - Parallel Programming Language and Systems

爱丁堡大学信息学院课程笔记 Parallel Programming Language and Systems, Informatics, University of Edinburgh Reference: http://www.inf.ed.ac.uk/teaching/courses/ppls/ CMU 15213: Introduction to Computer Systems (ICS) Computer Systems: A Programmer’s Perspective A Comprehensive MPI Tutorial Resource A chapter on MPI from Ian Foster’s online Book Designing and Building Parallel Programs Introduction to parallel computer architecture Covering some of the nasty issues presented by the shared memory model, including weak consistency models and false sharing in the cache, and some architectural issues for the multicomputer model. ...

Inf Course Note - Software Architecture, Process, and Management

爱丁堡大学信息学院课程笔记 Software Architecture, Process, and Management, Informatics, University of Edinburgh Reference: microsoft IBM Software Architecture in Practice (3rd edition), Bass, Clements, and Kazman What is Software Architecture? Software architecture is often described as the organization or structure of a system, where the system represents a collection of components that accomplish a specific function or set of functions. grouping components into areas of concern (layers): For example, the UI, business processing, and data access. focus on interaction between the components and how different components work together. 在书中的定义： ...

Inf Course Note - Software Testing

爱丁堡大学信息学院课程笔记 Software Testing, Informatics, University of Edinburgh Reference: http://www.inf.ed.ac.uk/teaching/courses/st/2017-18/index.html Pezze and Young, Software Testing and Analysis: Process, Principles and Techniques, Wiley, 2007. Why Software Testing? 1, 软件的漏洞, 错误和失效 Software Faults, Errors & Failures The problem start with Faults, Fault(BUG): latent error, mistakes in programming. e.g add(x, y) = x * y. With the Faults in programs, if and only if executing add(x, y) = x * y, the fault being activated, and generate an Errors. ...

深入理解word2vec

Word2vec Mikolov et al. How to represent meanings? 如何在数学上表达词义？ Vector space models (VSMs) 表示把单词映射到(嵌入)连续的矢量空间, 而且理论上语义相似的单词会映射到空间中临近的位置。VSMs是一个历史悠久的NLP理论，但所有实现方法都不同程度依赖于Distributional Hypothesis, 即出现在相同（相似）的上下文中的单词具有相同（相似）的语义意义。利用此原则的方法大致可以分为两类: Count-based methods (例如, Latent Semantic Analysis))和Predictive models(例如 neural net language models (NNLM))。具体的区别详见Baroni et al.. 但总的来说，Count-based methods 统计词汇间的共现频率，然后把co-occurs matrix 映射到向量空间中；而Predictive models直接通过上下文预测单词的方式来学习向量空间（也就是模型参数空间）。 Word2vec 是一种计算特别高效的predictive model, 用于从文本中学习word embeddings。它有两种方案, Continuous Bag-of-Words model (CBOW) 和 Skip-Gram model (Section 3.1 and 3.2 in Mikolov et al.). 从算法上讲, 两种方案是相似的, 只不过 CBOW 会从source context-words ('the cat sits on the')预测目标单词(例如"mat"); 而skip-gram则相反, 预测目标单词的source context-words。Skip-gram这种做法可能看起来有点随意. 但从统计上看, CBOW 会平滑大量分布信息(通过将整个上下文视为一个观测值), 在大多数情况下, 这对较小的数据集是很有用的。但是, Skip-gram将每个context-target pair视为新的观测值, 当数据集较大时, 这往往带来更好的效果。 ...

循环神经网络

循环神经网络当人类阅读时，会根据对之前单词的理解和记忆来辅助理解当前看到的每个单词。也就是人能够很好地处理语言的长距离依赖特性（long-term dependency）。在自然语言处理任务中，很多传统的模型无法做到这一点，比如前馈神经网络；而传统的n-gram模型固然可以通过把把n系数增大来捕捉长距离依赖，但带来的非常巨大的内存消耗。循环神经网络（Recurrent Neural Networks, RNNs)可以看做是多个共享参数的前馈神经网络不断叠加的结果 ![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png “A recurrent neural network and the unfolding in time of the computation involved in its forward computation. “image from: http://colah.github.io”) 这里的核心是想办法解码历史信息, 即通过递归方程$s_i = R(x_i, s_{i−1})$让$s_i$解码序列$x_{1:n}$. 比如把所有历史信息累加就是一种非常简单粗暴的方式, 这样得到的是连续词袋模型(continuous-bag-of-words model)$s_i = R_{CBOW}(x_i, s_{i-1}) = x_i + s_{i−1}$, 虽然简单，但这种RNN其实忽略了数据的时序性质。一般意义上的RNN是指Elman Network or Simple-RNN (S-RNN)(Elman [1990]), $s_i = R_{SRNN}(x_i, s_{i-1}) = g(x_iW^x + s_{i−1}W^s + b)$, 也就是把历史信息先进行线性变换(乘以矩阵), 再和bias加起来, 再通过一个非线性激活函数(tanh或ReLU). 添加了线性变换再进行非线性激活, 使网络对输入的顺序变得敏感。在使用时, 给定输入序列（单词序列或语音）得出输出序列的过程如下： ...

Python Digest

What you will get from this Python digest: 1, Learn advanced python programming. 2, Learn new concepts, patterns, and methods that will expand your programming abilities, helping move you from a novice to an expert programmer. 3, Practice going from a problem description to a solution, using a series of assignments. Operator Emulating numeric types In-place operation: One modifies the data-structure itself object.__iadd__(self, other) object.__isub__(self, other) object.__imul__(self, other) object.__imatmul__(self, other) object.__itruediv__(self, other) object.__ifloordiv__(self, other) object.__imod__(self, other) object.__ipow__(self, other[, modulo]) object.__ilshift__(self, other) object.__irshift__(self, other) object.__iand__(self, other) object.__ixor__(self, other)¶ object.__ior__(self, other) These methods are called to implement the augmented arithmetic assignments. These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self). If x is an instance of a class with an __iadd__() method, x += y is equivalent to x = operator.iadd(x, y) ...

Computer Systems - A Programmer's Perspective (CSAPP) - CMU 15213

CSAPP 非常巧妙的把程序设计及优化、数字电路基础、指令集体系、汇编语言、存储器体系结构、链接与装载、进程、虚存等来自不同学科的核心知识点和在一起，并以程序员的视角呈现; 告诉我们作为一个程序员，究竟需要对计算机的硬件了解到什么程度？本笔记是 CMU CSAPP 的学习笔记, 使用 CMU 15-213, UW CSE351 的课程视频, lab, 作业, project 辅助练习. Computer Systems: A Programmer’s Perspective (csapp), 豆瓣-深入理解计算机系统卡内基梅隆大学 CMU 15-213 Introduction to Computer Systems (ICS) 华盛顿大学 UW CSE351: The Hardware/Software Interface 信息的表达与操作 Information is Bits + Context. Study systems by tracing the lifetime of the hello program, from the time it is created by a programmer, until it runs on a system, prints its simple message, and terminates. ...

神经网络用于文本分类

文本分类文本分类是很多业务问题中广泛使用到的NLP/监督机器学习（ML）。文本分类的目标是自动将文本/文档分类为一个或多个预定义类别。目前的成熟思路是用词向量解码文本，然后使用传统机器学习模型或者深度神经网络模型来做分类。文本分类是学术界和工业界非常活跃的研究领域。本文主要介绍用于文本分类的几种神经网络模型方法，并比较它们的性能，代码实现主要基于Keras。文中代码都在这个DeepTextGitHub项目中. 文本分类的一些示例包括：从社交媒体中了解受众情绪（😁 😐 😥）检测垃圾邮件和非垃圾邮件自动标记客户查询将新闻文章📰分类为预定义主题端到端文本分类流水线端到端文本分类流水线由以下组件组成：训练文本：输入文本，有监督模型能够通过已标注数据来学习和预测所需的类。特征向量：特征向量是用于解码输入数据特征的信息的向量。标签：预定义的类别/类，作为模型预测的目标。算法模型：能够处理文本分类的算法（在我们的例子中：CNN，RNN，HAN, Fasttext）预测：已经在历史数据集上训练过的模型，可以用于执行标签预测。这里使用汽车消费者的评测数据集，在tsv文件中, 第一列是序号对我们没用, 第二列是label(0, 1)，分别代表（消极，积极）评价，第三列是文本. 1 操控性舒服、油耗低，性价比高 0 动力的确有点点让我相信了up的确是个代步车而已! 1 1。车的外观很喜欢。2。省油，现在磨合期7.3，相信以后还会下降。 1 内饰的做工和用料同级别同价位最厚道的 0 减震系统太硬！数据处理使用的类，具体见代码链接 class DataProcessor(object): """ Base class for data converters for sequence classification data sets. helper funcitons [read_tsv, read_text, read_json] """ ... class SampleProcessor(DataProcessor): """ Sample processor for the classification data set. Tranform the text to tensor for training if use pre-train model, need vocabulary file usage: process data files >>> processer = SampleProcessor(config, ) provide your own data in list format [train_X, train_Y, test_X, test_Y] >>> processer = SampleProcessor(config, data) """ 词向量使用包含外部知识的embedding表达字词是目前的主流方法，经典的如word2vec，GLoVe，较新进的 ELMo，BERT，等预训练向量，集成了关于单词的新信息（词汇和语义），这些信息已经在非常大的数据集上进行了训练和提炼。 ...