Awesome Large Language Model (LLM) Post-training - [2025 Update]

In the race to build truly helpful AI assistants, we’ve discovered a fundamental truth: raw intelligence isn’t enough. A model that masters calculus but can’t refuse harmful requests is like a library with no librarian - overflowing with knowledge but dangerously uncurated. This is the alignment problem: how do we transform raw language models into trustworthy collaborators? For years, Reinforcement Learning from Human Feedback (RLHF) reigned supreme. Its PPO-based approach taught ChatGPT to decline malicious requests and helped Claude write harmless poetry. But beneath the surface, RLHF’s complexity was showing: ...

2025-05-30 · 46 min · Cong

Multi-token Prediction

Multi-token prediction vs Next-token prediction Next-token prediction is the standard training objective for most large language models (LLMs), where the model learns to predict the subsequent token in a sequence given all preceding tokens. The model is trained to maximize the probability of the next token \( x_{t+1} \) given the context \( x_{1:t} \) (all tokens up to position \( t \)). The cross-entropy loss for next-token prediction is defined as: ...

2025-06-29 · 7 min · Cong

The Evolution of Reward Modeling - From Human Feedback to Generative Inference-Time Scaling

Reward modeling (RM) has emerged as a cornerstone of large language model (LLM) alignment, guiding models to align with human values and perform complex tasks. Early approaches relied heavily on Reinforcement Learning from Human Feedback (RLHF), but recent research has shifted toward more scalable, efficient, and generalizable RM frameworks. This blog explores the developmental arc of RM, connecting four seminal papers that have shaped the field: from Constitutional AI and self-evaluation mechanisms to inference-time scaling for generalist RM. ...

2025-05-25 · 9 min · Cong

Paper Reading - Inference-Time Scaling for Generalist Reward Modeling

Liu, Zijun, et al. Inference-Time Scaling for Generalist Reward Modeling. arXiv:2504.02495, arXiv, 5 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.02495. Problem Statement Reinforcement Learning (RL) has become pivotal in post-training large language models (LLMs), but generating accurate reward signals for diverse domains remains challenging. Existing reward models (RMs) often rely on human-designed rules or verifiable tasks, struggling with generalizability and inference-time scalability. This paper addresses how to improve RM effectiveness through increased inference compute and adaptive learning methods for general queries. ...

2025-05-05 · 2 min · Cong

Paper Reading - Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. https://arxiv.org/abs/2312.09390 Research Context and Objectives The paper addresses a critical challenge in aligning superhuman AI models: when human supervision becomes insufficient due to the models’ complex behaviors, can weak supervision (e.g., from weaker models) effectively elicit the full capabilities of stronger models? The authors from OpenAI explore this through empirical experiments, aiming to bridge the gap between current alignment techniques (like RLHF) and the needs for superhuman model alignment. ...

2023-12-20 · 2 min · Cong

Paper Reading - Let’s Verify Step by Step

TLDR In order to train more dependable models, there are two known options: outcome supervision, which gives feedback on the final result, and process supervision, which provides feedback on each intermediate reasoning step. This papers provides two finding: The use of process supervision yields significantly better results than outcome supervision when training models to solve problems from the challenging MATH dataset. The efficacy of process supervision is significantly improved by active learning. The exclusive focus of this paper is to provide insights on training the most reliable reward model. ...

2023-06-18 · 9 min · Cong

Paper Reading - Complexity-Based Prompting for Multi-Step Reasoning

Tags: 2023, ICLR Links: https://github.com/FranxYao/chain-of-thought-hub Paper: Fu, Yao, et al. Complexity-Based Prompting for Multi-Step Reasoning. arXiv:2210.00720, arXiv, 30 Jan. 2023. arXiv.org, http://arxiv.org/abs/2210.00720. Motivation Example selection is a central problem in the prompting literature. For CoT prompting, example selection is further related to annotation efficiency, as CoT requires manually-annotated reasoning chains. Which reasoning examples make the most effective prompts. Propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multistep reasoning tasks over strong baselines. ...

2023-04-19 · 2 min · Cong