Paper Reading - The Instruction Hierarchy - Training LLMs to Prioritize Privileged Instructions

Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions” Wallace, Eric, et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208, arXiv, 19 Apr. 2024. arXiv.org, http://arxiv.org/abs/2404.13208. 1. Problem Statement Modern large language models (LLMs) are vulnerable to attacks like prompt injections and jailbreaks because they treat system prompts, user messages, and third-party inputs (e.g., tool outputs) as equal in priority. This allows adversaries to override intended instructions, leading to risks such as data exfiltration or unauthorized actions ....

<span title='2024-04-20 00:00:00 +0000 UTC'>2024-04-20</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong

Paper Reading - Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. https://arxiv.org/abs/2312.09390 Research Context and Objectives The paper addresses a critical challenge in aligning superhuman AI models: when human supervision becomes insufficient due to the models’ complex behaviors, can weak supervision (e.g., from weaker models) effectively elicit the full capabilities of stronger models?...

<span title='2023-12-20 00:00:00 +0000 UTC'>2023-12-20</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong

Paper Reading - Constitutional AI

Bai, Yuntao, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, arXiv, 15 Dec. 2022. arXiv.org, http://arxiv.org/abs/2212.08073. The paper introduces Constitutional AI (CAI), a method to train helpful and harmless AI assistants without human labels for harmful outputs, relying instead on a set of guiding principles. Here’s a structured summary: 1. Objective Train AI systems to be helpful, honest, and harmless using AI feedback for supervision, reducing reliance on human labels....

<span title='2023-08-10 00:00:00 +0000 UTC'>2023-08-10</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;Cong

John Schulman和Yoav Goldberg关于Behavior Cloning(BC)、RL and Truthfulness的观点

Cong Chen University of Edinburgh John Schulman最近在Berkeley分享了关于BC、RLHF and Truthfulness的观点1,Yoav Goldberg也针对John Schulman的观点进行了总结和扩展2,同时南大的俞扬教授也对BC和RL的对比进行了观点分享3。 归纳的核心观点有三个: Behavior Cloning(BC, learning from demonstrations, or SFT)是最Effective的方法。RLHF过程中重度使用了BC,包括冷启动和奖励模型训练都用了BC。虽然BC更有效,相比RL也更容易work,但BC因为自身局限性,有一些固有的问题无法解决: 核心问题是,BC训练越泛化意味着LLM越会Hallucination和撒谎;而我们想鼓励LLM根据它的内部知识来回答,问题是我们不知道其内部知识包含什么,所以要利用RLHF让LLM知道什么问题是超过自己的知识范围的(让模型知道自己不知道)。 除此之外,RL还允许负反馈,而 negative feedback is much more powerful 基于 Ranking 的 Reward学习虽然不够好,但是实践起来更容易 未来优化方向:当LLM知道自己不知道时,目前更多的是诚实地表达“I dont know”来拒识,OpenAI的方向是让LLM尝试去搜索外部知识,生成更可信、带citing source的回答,也就是从Honest进化到Truthfulness。参考下面的 ChatGPT Browsing 详细分享 - by John Schulman Why there is Hallucination Is “if a model know something” a meaningful question? RL is the correct ways Long form QA (LFQA) is much difficult that short QA...

<span title='2023-04-30 00:00:00 +0000 UTC'>2023-04-30</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong Chan