Readings

Paper Reading - Inference-Time Scaling for Generalist Reward Modeling

Liu, Zijun, et al. Inference-Time Scaling for Generalist Reward Modeling. arXiv:2504.02495, arXiv, 5 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.02495. Problem Statement Reinforcement Learning (RL) has become pivotal in post-training large language models (LLMs), but generating accurate reward signals for diverse domains remains challenging. Existing reward models (RMs) often rely on human-designed rules or verifiable tasks, struggling with generalizability and inference-time scalability. This paper addresses how to improve RM effectiveness through increased inference compute and adaptive learning methods for general queries. ...

Paper Reading - The Instruction Hierarchy - Training LLMs to Prioritize Privileged Instructions

Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions” Wallace, Eric, et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208, arXiv, 19 Apr. 2024. arXiv.org, http://arxiv.org/abs/2404.13208. 1. Problem Statement Modern large language models (LLMs) are vulnerable to attacks like prompt injections and jailbreaks because they treat system prompts, user messages, and third-party inputs (e.g., tool outputs) as equal in priority. This allows adversaries to override intended instructions, leading to risks such as data exfiltration or unauthorized actions . ...

Paper Reading - Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. https://arxiv.org/abs/2312.09390 Research Context and Objectives The paper addresses a critical challenge in aligning superhuman AI models: when human supervision becomes insufficient due to the models’ complex behaviors, can weak supervision (e.g., from weaker models) effectively elicit the full capabilities of stronger models? The authors from OpenAI explore this through empirical experiments, aiming to bridge the gap between current alignment techniques (like RLHF) and the needs for superhuman model alignment. ...

Paper Reading - Constitutional AI

Bai, Yuntao, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, arXiv, 15 Dec. 2022. arXiv.org, http://arxiv.org/abs/2212.08073. The paper introduces Constitutional AI (CAI), a method to train helpful and harmless AI assistants without human labels for harmful outputs, relying instead on a set of guiding principles. Here’s a structured summary: 1. Objective Train AI systems to be helpful, honest, and harmless using AI feedback for supervision, reducing reliance on human labels. The approach aims to address the tension between helpfulness and harmlessness (where prior models often became evasive) and improve transparency through explicit principles. ...

Paper Reading - Let’s Verify Step by Step

TLDR In order to train more dependable models, there are two known options: outcome supervision, which gives feedback on the final result, and process supervision, which provides feedback on each intermediate reasoning step. This papers provides two finding: The use of process supervision yields significantly better results than outcome supervision when training models to solve problems from the challenging MATH dataset. The efficacy of process supervision is significantly improved by active learning. The exclusive focus of this paper is to provide insights on training the most reliable reward model. ...