2025 | Cong's Log

Awesome Large Language Model (LLM) Post-training - [2025 Update]

In the race to build truly helpful AI assistants, we’ve discovered a fundamental truth: raw intelligence isn’t enough. A model that masters calculus but can’t refuse harmful requests is like a library with no librarian - overflowing with knowledge but dangerously uncurated. This is the alignment problem: how do we transform raw language models into trustworthy collaborators? For years, Reinforcement Learning from Human Feedback (RLHF) reigned supreme. Its PPO-based approach taught ChatGPT to decline malicious requests and helped Claude write harmless poetry. But beneath the surface, RLHF’s complexity was showing: ...

Multi-token Prediction

Multi-token prediction vs Next-token prediction Next-token prediction is the standard training objective for most large language models (LLMs), where the model learns to predict the subsequent token in a sequence given all preceding tokens. The model is trained to maximize the probability of the next token \( x_{t+1} \) given the context \( x_{1:t} \) (all tokens up to position \( t \)). The cross-entropy loss for next-token prediction is defined as: ...

The Evolution of Reward Modeling - From Human Feedback to Generative Inference-Time Scaling

Reward modeling (RM) has emerged as a cornerstone of large language model (LLM) alignment, guiding models to align with human values and perform complex tasks. Early approaches relied heavily on Reinforcement Learning from Human Feedback (RLHF), but recent research has shifted toward more scalable, efficient, and generalizable RM frameworks. This blog explores the developmental arc of RM, connecting four seminal papers that have shaped the field: from Constitutional AI and self-evaluation mechanisms to inference-time scaling for generalist RM. ...

Paper Reading - Inference-Time Scaling for Generalist Reward Modeling

Liu, Zijun, et al. Inference-Time Scaling for Generalist Reward Modeling. arXiv:2504.02495, arXiv, 5 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.02495. Problem Statement Reinforcement Learning (RL) has become pivotal in post-training large language models (LLMs), but generating accurate reward signals for diverse domains remains challenging. Existing reward models (RMs) often rely on human-designed rules or verifiable tasks, struggling with generalizability and inference-time scalability. This paper addresses how to improve RM effectiveness through increased inference compute and adaptive learning methods for general queries. ...

DeepSeek-R1

DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948. Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Large language models (LLMs) have made remarkable strides in mimicking human-like cognition, but their ability to reason through complex problems—from math proofs to coding challenges—remains a frontier. In a recent breakthrough, DeepSeek-AI introduces DeepSeek-R1, a family of reasoning-focused models that leverages reinforcement learning (RL) to unlock advanced reasoning capabilities, without relying on traditional supervised fine-tuning (SFT) as a crutch. The paper “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” unveils a paradigm shift in how we train LLMs to think critically, with implications for both research and real-world applications. ...