Cong's Log

Paper Reading - Constitutional AI

Bai, Yuntao, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, arXiv, 15 Dec. 2022. arXiv.org, http://arxiv.org/abs/2212.08073. The paper introduces Constitutional AI (CAI), a method to train helpful and harmless AI assistants without human labels for harmful outputs, relying instead on a set of guiding principles. Here’s a structured summary: 1. Objective Train AI systems to be helpful, honest, and harmless using AI feedback for supervision, reducing reliance on human labels. The approach aims to address the tension between helpfulness and harmlessness (where prior models often became evasive) and improve transparency through explicit principles. ...

Paper Reading - Let’s Verify Step by Step

TLDR In order to train more dependable models, there are two known options: outcome supervision, which gives feedback on the final result, and process supervision, which provides feedback on each intermediate reasoning step. This papers provides two finding: The use of process supervision yields significantly better results than outcome supervision when training models to solve problems from the challenging MATH dataset. The efficacy of process supervision is significantly improved by active learning. The exclusive focus of this paper is to provide insights on training the most reliable reward model. ...

John Schulman和Yoav Goldberg关于Behavior Cloning(BC)、RL and Truthfulness的观点

Cong Chen University of Edinburgh John Schulman最近在Berkeley分享了关于BC、RLHF and Truthfulness的观点1，Yoav Goldberg也针对John Schulman的观点进行了总结和扩展2，同时南大的俞扬教授也对BC和RL的对比进行了观点分享3。归纳的核心观点有三个： Behavior Cloning（BC, learning from demonstrations, or SFT）是最Effective的方法。RLHF过程中重度使用了BC，包括冷启动和奖励模型训练都用了BC。虽然BC更有效，相比RL也更容易work，但BC因为自身局限性，有一些固有的问题无法解决：核心问题是，BC训练越泛化意味着LLM越会Hallucination和撒谎；而我们想鼓励LLM根据它的内部知识来回答，问题是我们不知道其内部知识包含什么，所以要利用RLHF让LLM知道什么问题是超过自己的知识范围的（让模型知道自己不知道）。除此之外，RL还允许负反馈，而 negative feedback is much more powerful 基于 Ranking 的 Reward学习虽然不够好，但是实践起来更容易未来优化方向：当LLM知道自己不知道时，目前更多的是诚实地表达“I dont know”来拒识，OpenAI的方向是让LLM尝试去搜索外部知识，生成更可信、带citing source的回答，也就是从Honest进化到Truthfulness。参考下面的 ChatGPT Browsing 详细分享 - by John Schulman Why there is Hallucination Is “if a model know something” a meaningful question? RL is the correct ways Long form QA (LFQA) is much difficult that short QA ...

Paper Reading - Complexity-Based Prompting for Multi-Step Reasoning

Tags: 2023, ICLR Links: https://github.com/FranxYao/chain-of-thought-hub Paper: Fu, Yao, et al. Complexity-Based Prompting for Multi-Step Reasoning. arXiv:2210.00720, arXiv, 30 Jan. 2023. arXiv.org, http://arxiv.org/abs/2210.00720. Motivation Example selection is a central problem in the prompting literature. For CoT prompting, example selection is further related to annotation efficiency, as CoT requires manually-annotated reasoning chains. Which reasoning examples make the most effective prompts. Propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multistep reasoning tasks over strong baselines. ...

CoT on BBH - Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

CoT on BBH：M. Suzgun et al., ‘Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them’. arXiv, Oct. 17, 2022. Available: http://arxiv.org/abs/2210.09261 Method Applying chain-of-thought (CoT) prompting to BIG-Bench Hard tasks Evaluate few-shot performance via standard “answer-only” prompting and chain-of-thought prompting on BIG-Bench Hard Benchmark Results/Analysis/Findings Benchmark: BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. many tasks in BBH require multi-step reasoning ...

Efficient Training of Language Models to Fill in the Middle

Bavarian, Mohammad, et al. Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255, arXiv, 28 July 2022. arXiv.org, http://arxiv.org/abs/2207.14255. data: https://www.github.com/openai/human-eval-infilling TL:DR Autoregressive language models can effectively learn to infill text by moving a span of text from the middle of a document to its end, without harming the original generative capability. The training models with this technique, called fill-in-the-middle (FIM), is useful, simple, and efficient, and should be used by default in future autoregressive language models. The study provides best practices and strong default settings for training FIM models and releases infilling benchmarks to aid future research. ...

The Curious Case of Neural Text Degeneration

Holtzman, Ari, et al. The Curious Case of Neural Text Degeneration. arXiv:1904.09751, arXiv, 14 Feb. 2020. arXiv.org, http://arxiv.org/abs/1904.09751. Introduction 从语言模型生成文本（例如生成故事）的最佳解码策略是什么仍然是一个悬而未决的问题。违反直觉的经验观察是，即使使用似然作为训练目标可以为广泛的语言理解任务生成高质量的模型，但基于maximization-based decoding的解码方法（例如beam search）会导致退化（degeneration）——输出文本平淡无奇，不连贯，或陷入重复循环。文本生成中的decoding strategy主要可以分为两大类： Argmax Decoding: 主要包括beam search, class-factored softmax等 Stochastic Decoding: 主要包括temperature sampling, top-k sampling等为了解决这个问题，提出了 Nucleus Sampling（Top-p Sampling），这是一种简单但有效的方法，可以从神经语言模型中提取比以前的解码策略质量更高的文本。The key idea is to use the shape of the probability distribution to determine the set of tokens to be sampled from. Method 通过截断概率分布的不可靠尾部分布、从包含绝大多数概率质量的标记的dynamic nucleus中采样来避免文本退化。效果/Analysis/Findings 为了正确检查当前基于最大化和随机的解码方法，我们将这些方法中的每一种的生成与人类文本从几个方向（如可能性、多样性和重复）的分布进行了比较。 ...

Codex - Evaluating Large Language Models Trained on Code

Codex：M. Chen et al., ‘Evaluating Large Language Models Trained on Code’. arXiv, Jul. 14, 2021. Available: http://arxiv.org/abs/2107.03374 Intro Codex, a GPT language model finetuned on publicly available code from GitHub Task: docstring-conditional code generation Method Codex: fine-tune GPT3 models containing up to 12B parameters on code to produce Codex. Codex-S: fine-tune Codex on standalone, correctly implemented functions. Inference: assemble each HumanEval problem into a prompt consisting of a header, a signature, and a docstring. We use nucleus sampling (Holtzman et al., 2020) with top p = 0.95 for all sampling evaluation in this work ...

Scaling Laws for Neural Language Models

Kaplan, Jared, et al. ‘Scaling Laws for Neural Language Models’. arXiv:2001.08361 [Cs, Stat], Jan. 2020. arXiv.org, http://arxiv.org/abs/2001.08361. TL:DR key findings for Transformer language models are are as follows: Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embeddings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3) Smooth power laws: Performance has a power-law relationship with each of the three scale factors N, D, C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude (see Figure 1). We observe no signs of deviation from these trends on the upper end, though performance must flatten out eventually before reaching zero loss. (Section 3) ...

Switch Transformers - Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Links: https://arxiv.org/abs/2101.03961 “SWITCH TRANSFORMERS: SCALING TO TRILLION PARAMETER MODELS WITH SIMPLE AND EFFICIENT SPARSITY”，提出了一种可以扩展到万亿参数的网络，有两个比较大的创新，基于Transformer MoE网络结构，简化了MoE的routing机制，降低了计算量；进一步通过数据并行+模型并行+expert并行的方式降低了训练通信量，提升训练性能。模型 Simplifying Sparse Routing Mixture of Expert Routing which takes as an input a token representation x and then routes this to the best deter- mined top-k experts Switch Routing: route to only a single expert, this simplification preserves model quality, reduces routing computation and performs better. Sparse routing通过参数Wr计算出一个在N个experts上的softmax分布，对每个token输入筛选概率最高的 top k 个 experts，对应的是MOE中的门控机制。这样对算力的需求并没有随着参数量的增加而大幅增长，使得这个模型更加容易训练。 ...