An overview of Post-training algorithms for large language model (LLM)

In the race to build truly helpful AI assistants, we’ve discovered a fundamental truth: raw intelligence isn’t enough. A model that masters calculus but can’t refuse harmful requests is like a library with no librarian - overflowing with knowledge but dangerously uncurated. This is the alignment problem: how do we transform raw language models into trustworthy collaborators? For years, Reinforcement Learning from Human Feedback (RLHF) reigned supreme. Its PPO-based approach taught ChatGPT to decline malicious requests and helped Claude write harmless poetry....

<span title='2025-05-30 00:00:00 +0000 UTC'>2025-05-30</span>&nbsp;·&nbsp;15 min&nbsp;·&nbsp;Cong

The Evolution of Reward Modeling - From Human Feedback to Generative Inference-Time Scaling

An Overview: The Critical Role of Reward Modeling in LLM Alignment Reward modeling (RM) has emerged as a cornerstone of large language model (LLM) alignment, guiding models to align with human values and perform complex tasks. Early approaches relied heavily on Reinforcement Learning from Human Feedback (RLHF), but recent research has shifted toward more scalable, efficient, and generalizable RM frameworks. This blog explores the developmental arc of RM, connecting four seminal papers that have shaped the field: from Constitutional AI and self-evaluation mechanisms to inference-time scaling for generalist RM....

<span title='2025-05-25 00:00:00 +0000 UTC'>2025-05-25</span>&nbsp;·&nbsp;9 min&nbsp;·&nbsp;Cong

Paper Reading - Inference-Time Scaling for Generalist Reward Modeling

Liu, Zijun, et al. Inference-Time Scaling for Generalist Reward Modeling. arXiv:2504.02495, arXiv, 5 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.02495. Problem Statement Reinforcement Learning (RL) has become pivotal in post-training large language models (LLMs), but generating accurate reward signals for diverse domains remains challenging. Existing reward models (RMs) often rely on human-designed rules or verifiable tasks, struggling with generalizability and inference-time scalability. This paper addresses how to improve RM effectiveness through increased inference compute and adaptive learning methods for general queries....

<span title='2025-05-05 00:00:00 +0000 UTC'>2025-05-05</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong

DeepSeek-R1

DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948. Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Large language models (LLMs) have made remarkable strides in mimicking human-like cognition, but their ability to reason through complex problems—from math proofs to coding challenges—remains a frontier. In a recent breakthrough, DeepSeek-AI introduces DeepSeek-R1, a family of reasoning-focused models that leverages reinforcement learning (RL) to unlock advanced reasoning capabilities, without relying on traditional supervised fine-tuning (SFT) as a crutch....

<span title='2025-01-25 00:00:00 +0000 UTC'>2025-01-25</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;Cong Chan

Paper Reading - The Instruction Hierarchy - Training LLMs to Prioritize Privileged Instructions

Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions” Wallace, Eric, et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208, arXiv, 19 Apr. 2024. arXiv.org, http://arxiv.org/abs/2404.13208. 1. Problem Statement Modern large language models (LLMs) are vulnerable to attacks like prompt injections and jailbreaks because they treat system prompts, user messages, and third-party inputs (e.g., tool outputs) as equal in priority. This allows adversaries to override intended instructions, leading to risks such as data exfiltration or unauthorized actions ....

<span title='2024-04-20 00:00:00 +0000 UTC'>2024-04-20</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong

Paper Reading - Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. https://arxiv.org/abs/2312.09390 Research Context and Objectives The paper addresses a critical challenge in aligning superhuman AI models: when human supervision becomes insufficient due to the models’ complex behaviors, can weak supervision (e.g., from weaker models) effectively elicit the full capabilities of stronger models?...

<span title='2023-12-20 00:00:00 +0000 UTC'>2023-12-20</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong

A Better Practice to Define Reward Model with HuggingFace's transformers

Cong Chen University of Edinburgh There are various implementation of reward modeling in RLHF(reinforcement learning with human feedback), each has different pros and cons. Inspired by some open-sourced works about reward modeling, I would like to share one of the best practice for reward modeling. For those who are not familiar with reward modeling and RLHF, I recommend take a look at the Huggingface rlhf blog1 or OpenAI rlhf paper2....

<span title='2023-09-25 00:00:00 +0000 UTC'>2023-09-25</span>&nbsp;·&nbsp;7 min&nbsp;·&nbsp;Cong Chan

Boosting Large Language Models Alignment - A Data-Driven Bootstrap Flywheel

Cong Chen University of Edinburgh InstructGPT1, ChatGPT2, and GPT-43 are cutting-edge Large Language Models (LLMs) that have astounded the world. With their ability to follow human instructions and align with human preferences, they can act as chatbots or helpful assistants. Despite impressing people for a while, their development lifecycles have not yet been thoroughly elaborated. In this blog, I will provide my observations and thoughts based on my recent experience with large language model training and alignment....

<span title='2023-08-21 00:00:00 +0000 UTC'>2023-08-21</span>&nbsp;·&nbsp;6 min&nbsp;·&nbsp;Cong Chan

Paper Reading - Constitutional AI

Bai, Yuntao, et al. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073, arXiv, 15 Dec. 2022. arXiv.org, http://arxiv.org/abs/2212.08073. The paper introduces Constitutional AI (CAI), a method to train helpful and harmless AI assistants without human labels for harmful outputs, relying instead on a set of guiding principles. Here’s a structured summary: 1. Objective Train AI systems to be helpful, honest, and harmless using AI feedback for supervision, reducing reliance on human labels....

<span title='2023-08-10 00:00:00 +0000 UTC'>2023-08-10</span>&nbsp;·&nbsp;3 min&nbsp;·&nbsp;Cong

Paper Reading - Let’s Verify Step by Step

TLDR In order to train more dependable models, there are two known options: outcome supervision, which gives feedback on the final result, and process supervision, which provides feedback on each intermediate reasoning step. This papers provides two finding: The use of process supervision yields significantly better results than outcome supervision when training models to solve problems from the challenging MATH dataset. The efficacy of process supervision is significantly improved by active learning....

<span title='2023-06-18 00:00:00 +0000 UTC'>2023-06-18</span>&nbsp;·&nbsp;9 min&nbsp;·&nbsp;Cong