John Schulman和Yoav Goldberg关于Behavior Cloning(BC)、RL and Truthfulness的观点

Cong Chen University of Edinburgh John Schulman最近在Berkeley分享了关于BC、RLHF and Truthfulness的观点1,Yoav Goldberg也针对John Schulman的观点进行了总结和扩展2,同时南大的俞扬教授也对BC和RL的对比进行了观点分享3。 归纳的核心观点有三个: Behavior Cloning(BC, learning from demonstrations, or SFT)是最Effective的方法。RLHF过程中重度使用了BC,包括冷启动和奖励模型训练都用了BC。虽然BC更有效,相比RL也更容易work,但BC因为自身局限性,有一些固有的问题无法解决: 核心问题是,BC训练越泛化意味着LLM越会Hallucination和撒谎;而我们想鼓励LLM根据它的内部知识来回答,问题是我们不知道其内部知识包含什么,所以要利用RLHF让LLM知道什么问题是超过自己的知识范围的(让模型知道自己不知道)。 除此之外,RL还允许负反馈,而 negative feedback is much more powerful 基于 Ranking 的 Reward学习虽然不够好,但是实践起来更容易 未来优化方向:当LLM知道自己不知道时,目前更多的是诚实地表达“I dont know”来拒识,OpenAI的方向是让LLM尝试去搜索外部知识,生成更可信、带citing source的回答,也就是从Honest进化到Truthfulness。参考下面的 ChatGPT Browsing 详细分享 - by John Schulman Why there is Hallucination Is “if a model know something” a meaningful question? RL is the correct ways Long form QA (LFQA) is much difficult that short QA...

2023-04-30 · 2 min · Cong Chan

Boosting Large Language Models Alignment - A Data-Driven Bootstrap Flywheel

Cong Chen University of Edinburgh InstructGPT1, ChatGPT2, and GPT-43 are cutting-edge Large Language Models (LLMs) that have astounded the world. With their ability to follow human instructions and align with human preferences, they can act as chatbots or helpful assistants. Despite impressing people for a while, their development lifecycles have not yet been thoroughly elaborated. In this blog, I will provide my observations and thoughts based on my recent experience with large language model training and alignment....

2023-03-30 · 6 min · Cong Chan

A Better Practice to Define Reward Model with HuggingFace's transformers

Cong Chen University of Edinburgh There are various implementation of reward modeling in RLHF(reinforcement learning with human feedback), each has different pros and cons. Inspired by some open-sourced works about reward modeling, I would like to share one of the best practice for reward modeling. For those who are not familiar with reward modeling and RLHF, I recommend take a look at the Huggingface rlhf blog1 or OpenAI rlhf paper2....

2023-03-25 · 7 min · Cong Chan

Paper Reading - Let’s Verify Step by Step

TLDR In order to train more dependable models, there are two known options: outcome supervision, which gives feedback on the final result, and process supervision, which provides feedback on each intermediate reasoning step. This papers provides two finding: The use of process supervision yields significantly better results than outcome supervision when training models to solve problems from the challenging MATH dataset. The efficacy of process supervision is significantly improved by active learning....

2023-06-18 · 9 min · Cong

Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Links: https://arxiv.org/abs/2101.03961 “SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY”,提出了一种可以扩展到万亿参数的网络,有两个比较大的创新,基于Transformer MoE网络结构,简化了MoE的routing机制,降低了计算量;进一步通过数据并行+模型并行+expert并行的方式降低了训练通信量,提升训练性能。 模型 Simplifying Sparse Routing Mixture of Expert Routing which takes as an input a token representation x and then routes this to the best deter- mined top-k experts Switch Routing: route to only a single expert, this simplification preserves model quality, reduces routing computation and performs better. Sparse routing通过参数Wr计算出一个在N个experts上的softmax分布,对每个token输入筛选概率最高的 top k 个 experts,对应的是MOE中的门控机制。这样对算力的需求并没有随着参数量的增加而大幅增长,使得这个模型更加容易训练。 EFFICIENT SPARSE ROUTING 并行Switch实现 tensor shapes are statically determined at compilation time computation is dynamic due to the routing decisions at training and inference....

2021-07-10 · 4 min · Cong Chan

Mixture of Experts (MOE)

Mixture of Experts (MOE) MOE属于Ensemble Method中的一个方法, 采用分治思想: 将复杂的建模任务分解为多个相对简单的子任务,为每个子任务训练专门的模型:涉及子任务分解,或者Clustering 需要一个门控模型,基于数据输入选择如何组合多个专家模型的结果 Mixture of experts aims at increasing the accuracy of a function approximation by replacing a single global model by a weighted sum of local models (experts). It is based on a partition of the problem domain into several subdomains via clustering algorithms followed by a local expert training on each subdomain. Local Models & Global Models Hinton的课件介绍了模型拟合分布的两个极端方式: Very local models: 使用很多非常局部化的模型, e....

2021-07-03 · 3 min · Cong Chan

Survey - Pre-Trained Models - Past, Present and Future

Links: https://arxiv.org/abs/2106.07139 最新出炉的 Pre-Trained Models 综述速览。 先确定综述中的一些名词的定义 Transfer learning:迁移学习,一种用于应对机器学习中的data hungry问题的方法,是有监督的 Self-Supervised Learning:自监督学习,也用于应对机器学习中的data hungry问题,特别是针对完全没有标注的数据,可以通过某种方式以数据自身为标签进行学习(比如language modeling)。所以和无监督学习有异曲同工之处。 一般我们说无监督主要集中于clustering, community discovery, and anomaly detection等模式识别问题 而self-supervised learning还是在监督学习的范畴,集中于classification and generation等问题 Pre-trained models (PTMs) :预训练模型,Pre-training是一种具体的训练方案,可以采用transfer learning或者Self-Supervised Learning方法 2 Background 脉络图谱 Pre-training 可分为两大类: 2.1 Transfer Learning and Supervised Pre-Training 此类可进一步细分为 feature transfer 和 parameter transfer. 2.2 Self-Supervised Learning and Self-Supervised Pre-Training Transfer learning 可细分为四个子类 inductive transfer learning (Lawrence and Platt, 2004; Mihalkova et al., 2007; Evgeniou and Pontil, 2007), transductive transfer learning (Shimodaira, 2000; Zadrozny,2004; Daume III and Marcu, 2006), self-taught learning (Raina et al....

2021-06-19 · 10 min · Cong Chan

CorefQA - Coreference resolution as query-based span prediction

2020, ACL data: CoNLL-2012, GAP task: Coreference Resolution 通过QA方式处理coreference问题,A query is generated for each candidate mention using its surrounding con- text, and a span prediction module is em- ployed to extract the text spans of the corefer- ences within the document using the generated query. 近期的方法有consider all text spans in a document as potential mentions and learn to find an antecedent for each possible mention. There。这种仅依靠mention的做对比的方法的缺点: At the task formalization level: 因为当前数据集有很多遗漏的mention, mentions left out at the mention proposal stage can never be recov- ered since the downstream module only operates on the proposed mentions....

2021-05-11 · 2 min · Cong Chan

在loss层面针对样本不平衡问题的优化

针对样本不平衡问题,除了上下采样,调整样本权重等统计方法,还有可以通过对loss函数进行设计。 对于多分类问题(n选1),一般使用softmax;对于多标签分类问题(n选k),一般是转换为n各sigmoid二分类问题。 ...

2021-05-07 · 3 min · Cong Chan

Early Rumour Detection

2019, ACL data: TWITTER, WEIBO links: https://www.aclweb.org/anthology/N19-1163, https://github.com/DeepBrainAI/ERD task: Rumour Detection 这篇文章采用GRU编码社交媒体posts stream,作为环境的状态表示;训练一个分类器以GRU的状态输出为输入,对文本做二分类判断是否是rumor。用DQN训练agent,根据状态做出是否启动rumor分类器进行判断,并根据分类结果对错给予奖惩。目标就是尽可能准尽可能早地预测出社交媒体posts是否是rumor。 Focuses on the task of rumour detection; particularly, we are in- terested in understanding how early we can detect them. Our model treats social media posts (e.g. tweets) as a data stream and integrates reinforcement learning to learn the number minimum num- ber of posts required before we classify an event as a rumour. Let $E$ denote an event, and it consists of a series of relevant posts $x_i$, where $x_0$ denotes the source message and $x_T$ the last relevant message....

2021-05-01 · 3 min · Cong Chan