An overview of Post-training algorithms for large language model (LLM)

In the race to build truly helpful AI assistants, we’ve discovered a fundamental truth: raw intelligence isn’t enough. A model that masters calculus but can’t refuse harmful requests is like a library with no librarian - overflowing with knowledge but dangerously uncurated. This is the alignment problem: how do we transform raw language models into trustworthy collaborators? For years, Reinforcement Learning from Human Feedback (RLHF) reigned supreme. Its PPO-based approach taught ChatGPT to decline malicious requests and helped Claude write harmless poetry....

<span title='2025-05-30 00:00:00 +0000 UTC'>2025-05-30</span>&nbsp;·&nbsp;15 min&nbsp;·&nbsp;Cong

The Evolution of Reward Modeling - From Human Feedback to Generative Inference-Time Scaling

An Overview: The Critical Role of Reward Modeling in LLM Alignment Reward modeling (RM) has emerged as a cornerstone of large language model (LLM) alignment, guiding models to align with human values and perform complex tasks. Early approaches relied heavily on Reinforcement Learning from Human Feedback (RLHF), but recent research has shifted toward more scalable, efficient, and generalizable RM frameworks. This blog explores the developmental arc of RM, connecting four seminal papers that have shaped the field: from Constitutional AI and self-evaluation mechanisms to inference-time scaling for generalist RM....

<span title='2025-05-25 00:00:00 +0000 UTC'>2025-05-25</span>&nbsp;·&nbsp;9 min&nbsp;·&nbsp;Cong

Paper Reading - Inference-Time Scaling for Generalist Reward Modeling

Liu, Zijun, et al. Inference-Time Scaling for Generalist Reward Modeling. arXiv:2504.02495, arXiv, 5 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.02495. Problem Statement Reinforcement Learning (RL) has become pivotal in post-training large language models (LLMs), but generating accurate reward signals for diverse domains remains challenging. Existing reward models (RMs) often rely on human-designed rules or verifiable tasks, struggling with generalizability and inference-time scalability. This paper addresses how to improve RM effectiveness through increased inference compute and adaptive learning methods for general queries....

<span title='2025-05-05 00:00:00 +0000 UTC'>2025-05-05</span>&nbsp;·&nbsp;2 min&nbsp;·&nbsp;Cong

DeepSeek-R1

DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948. Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Large language models (LLMs) have made remarkable strides in mimicking human-like cognition, but their ability to reason through complex problems—from math proofs to coding challenges—remains a frontier. In a recent breakthrough, DeepSeek-AI introduces DeepSeek-R1, a family of reasoning-focused models that leverages reinforcement learning (RL) to unlock advanced reasoning capabilities, without relying on traditional supervised fine-tuning (SFT) as a crutch....

<span title='2025-01-25 00:00:00 +0000 UTC'>2025-01-25</span>&nbsp;·&nbsp;4 min&nbsp;·&nbsp;Cong Chan