John Schulman和Yoav Goldberg关于Behavior Cloning(BC)、RL and Truthfulness的观点

Cong Chen University of Edinburgh John Schulman最近在Berkeley分享了关于BC、RLHF and Truthfulness的观点1,Yoav Goldberg也针对John Schulman的观点进行了总结和扩展2,同时南大的俞扬教授也对BC和RL的对比进行了观点分享3。 归纳的核心观点有三个: Behavior Cloning(BC, learning from demonstrations, or SFT)是最Effective的方法。RLHF过程中重度使用了BC,包括冷启动和奖励模型训练都用了BC。虽然BC更有效,相比RL也更容易work,但BC因为自身局限性,有一些固有的问题无法解决: 核心问题是,BC训练越泛化意味着LLM越会Hallucination和撒谎;而我们想鼓励LLM根据它的内部知识来回答,问题是我们不知道其内部知识包含什么,所以要利用RLHF让LLM知道什么问题是超过自己的知识范围的(让模型知道自己不知道)。 除此之外,RL还允许负反馈,而 negative feedback is much more powerful 基于 Ranking 的 Reward学习虽然不够好,但是实践起来更容易 未来优化方向:当LLM知道自己不知道时,目前更多的是诚实地表达“I dont know”来拒识,OpenAI的方向是让LLM尝试去搜索外部知识,生成更可信、带citing source的回答,也就是从Honest进化到Truthfulness。参考下面的 ChatGPT Browsing 详细分享 - by John Schulman Why there is Hallucination Is “if a model know something” a meaningful question? RL is the correct ways Long form QA (LFQA) is much difficult that short QA...

<span title='2023-04-30 00:00:00 +0000 UTC'>2023-04-30</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong Chan

Paper Reading - Complexity-Based Prompting for Multi-Step Reasoning

Tags: 2023, ICLR Links: https://github.com/FranxYao/chain-of-thought-hub Paper: Fu, Yao, et al. Complexity-Based Prompting for Multi-Step Reasoning. arXiv:2210.00720, arXiv, 30 Jan. 2023. arXiv.org, http://arxiv.org/abs/2210.00720. Motivation Example selection is a central problem in the prompting literature. For CoT prompting, example selection is further related to annotation efficiency, as CoT requires manually-annotated reasoning chains. Which reasoning examples make the most effective prompts. Propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning....

<span title='2023-04-19 00:00:00 +0000 UTC'>2023-04-19</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong

CoT on BBH - Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

CoT on BBH:M. Suzgun et al., ‘Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them’. arXiv, Oct. 17, 2022. Available: http://arxiv.org/abs/2210.09261 Method Applying chain-of-thought (CoT) prompting to BIG-Bench Hard tasks Evaluate few-shot performance via standard “answer-only” prompting and chain-of-thought prompting on BIG-Bench Hard Benchmark Results/Analysis/Findings Benchmark: BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. many tasks in BBH require multi-step reasoning...

<span title='2022-11-13 00:00:00 +0000 UTC'>2022-11-13</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;Cong Chan

Efficient Training of Language Models to Fill in the Middle

Bavarian, Mohammad, et al. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255, arXiv, 28 July 2022. arXiv.org, http://arxiv.org/abs/2207.14255. data: https://www.github.com/openai/human-eval-infilling TL:DR Autoregressive language models can effectively learn to infill text by moving a span of text from the middle of a document to its end, without harming the original generative capability. The training models with this technique, called fill-in-the-middle (FIM), is useful, simple, and efficient, and should be used by default in future autoregressive language models....

<span title='2022-11-11 00:00:00 +0000 UTC'>2022-11-11</span>&nbsp;·&nbsp;8 min&nbsp;·&nbsp;Cong Chan

The Curious Case of Neural Text Degeneration

Holtzman, Ari, et al. The Curious Case of Neural Text Degeneration. arXiv:1904.09751, arXiv, 14 Feb. 2020. arXiv.org, http://arxiv.org/abs/1904.09751. Introduction 从语言模型生成文本(例如生成故事)的最佳解码策略是什么仍然是一个悬而未决的问题。违反直觉的经验观察是,即使使用似然作为训练目标可以为广泛的语言理解任务生成高质量的模型,但基于maximization-based decoding的解码方法(例如beam search)会导致退化(degeneration)——输出文本平淡无奇,不连贯,或陷入重复循环。 文本生成中的decoding strategy主要可以分为两大类: Argmax Decoding: 主要包括beam search, class-factored softmax等 Stochastic Decoding: 主要包括temperature sampling, top-k sampling等 为了解决这个问题,提出了 Nucleus Sampling(Top-p Sampling),这是一种简单但有效的方法,可以从神经语言模型中提取比以前的解码策略质量更高的文本。The key idea is to use the shape of the probability distribution to determine the set of tokens to be sampled from. Method 通过截断概率分布的不可靠尾部分布、从包含绝大多数概率质量的标记的dynamic nucleus中采样来避免文本退化。 效果/Analysis/Findings 为了正确检查当前基于最大化和随机的解码方法,我们将这些方法中的每一种的生成与人类文本从几个方向(如可能性、多样性和重复)的分布进行了比较。 结果表明(1)对于开放式文本生成,maximization不是合适的解码目标,(2)当前最好的语言模型的概率分布有一个不可靠的长尾,需要在生成过程中截断,以及(3)Nucleus Sampling是目前最佳的解码策略,用于生成高质量的长文本——根据人类评估衡量——并且与人类编写的文本一样多样化。 延伸阅读 https://zhuanlan.zhihu.com/p/68383015

<span title='2021-12-23 00:00:00 +0000 UTC'>2021-12-23</span>&nbsp;·&nbsp;1 min&nbsp;·&nbsp;Cong Chan

Codex - Evaluating Large Language Models Trained on Code

Codex:M. Chen et al., ‘Evaluating Large Language Models Trained on Code’. arXiv, Jul. 14, 2021. Available: http://arxiv.org/abs/2107.03374 Intro Codex, a GPT language model finetuned on publicly available code from GitHub Task: docstring-conditional code generation Method Codex: fine-tune GPT3 models containing up to 12B parameters on code to produce Codex. Codex-S: fine-tune Codex on standalone, correctly implemented functions. Inference: assemble each HumanEval problem into a prompt consisting of a header, a signature, and a docstring....

<span title='2021-12-20 00:00:00 +0000 UTC'>2021-12-20</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong Chan

Scaling Laws for Neural Language Models

Kaplan, Jared, et al. ‘Scaling Laws for Neural Language Models’. arXiv:2001.08361 [Cs, Stat], Jan. 2020. arXiv.org, http://arxiv.org/abs/2001.08361. TL:DR key findings for Transformer language models are are as follows: Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D, and the amount of compute C used for training....

<span title='2021-12-19 00:00:00 +0000 UTC'>2021-12-19</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;Cong Chan

Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Links: https://arxiv.org/abs/2101.03961 “SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY”,提出了一种可以扩展到万亿参数的网络,有两个比较大的创新,基于Transformer MoE网络结构,简化了MoE的routing机制,降低了计算量;进一步通过数据并行+模型并行+expert并行的方式降低了训练通信量,提升训练性能。 模型 Simplifying Sparse Routing Mixture of Expert Routing which takes as an input a token representation x and then routes this to the best deter- mined top-k experts Switch Routing: route to only a single expert, this simplification preserves model quality, reduces routing computation and performs better. Sparse routing通过参数Wr计算出一个在N个experts上的softmax分布,对每个token输入筛选概率最高的 top k 个 experts,对应的是MOE中的门控机制。这样对算力的需求并没有随着参数量的增加而大幅增长,使得这个模型更加容易训练。 EFFICIENT SPARSE ROUTING 并行Switch实现 tensor shapes are statically determined at compilation time computation is dynamic due to the routing decisions at training and inference....

<span title='2021-07-10 00:00:00 +0000 UTC'>2021-07-10</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;Cong Chan

Mixture of Experts (MOE)

Mixture of Experts (MOE) MOE属于Ensemble Method中的一个方法, 采用分治思想: 将复杂的建模任务分解为多个相对简单的子任务,为每个子任务训练专门的模型:涉及子任务分解,或者Clustering 需要一个门控模型,基于数据输入选择如何组合多个专家模型的结果 Mixture of experts aims at increasing the accuracy of a function approximation by replacing a single global model by a weighted sum of local models (experts). It is based on a partition of the problem domain into several subdomains via clustering algorithms followed by a local expert training on each subdomain. Local Models & Global Models Hinton的课件介绍了模型拟合分布的两个极端方式: Very local models: 使用很多非常局部化的模型, e....

<span title='2021-07-03 00:00:00 +0000 UTC'>2021-07-03</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;Cong Chan

Survey - Pre-Trained Models - Past, Present and Future

Links: https://arxiv.org/abs/2106.07139 最新出炉的 Pre-Trained Models 综述速览。 先确定综述中的一些名词的定义 Transfer learning:迁移学习,一种用于应对机器学习中的data hungry问题的方法,是有监督的 Self-Supervised Learning:自监督学习,也用于应对机器学习中的data hungry问题,特别是针对完全没有标注的数据,可以通过某种方式以数据自身为标签进行学习(比如language modeling)。所以和无监督学习有异曲同工之处。 一般我们说无监督主要集中于clustering, community discovery, and anomaly detection等模式识别问题 而self-supervised learning还是在监督学习的范畴,集中于classification and generation等问题 Pre-trained models (PTMs) :预训练模型,Pre-training是一种具体的训练方案,可以采用transfer learning或者Self-Supervised Learning方法 2 Background 脉络图谱 Pre-training 可分为两大类: 2.1 Transfer Learning and Supervised Pre-Training 此类可进一步细分为 feature transfer 和 parameter transfer. 2.2 Self-Supervised Learning and Self-Supervised Pre-Training Transfer learning 可细分为四个子类 inductive transfer learning (Lawrence and Platt, 2004; Mihalkova et al., 2007; Evgeniou and Pontil, 2007), transductive transfer learning (Shimodaira, 2000; Zadrozny,2004; Daume III and Marcu, 2006), self-taught learning (Raina et al....

<span title='2021-06-19 00:00:00 +0000 UTC'>2021-06-19</span>&nbsp;·&nbsp;10 min&nbsp;·&nbsp;Cong Chan