Welcome to Cong’s Log

Hi, this is Cong. I’m documenting my learning notes in this blog.

Agentic Search & Deep Research

Overview of Agentic Search and Deep Research Agentic Search and Deep Research serve two very different masteries: one is about finding, and the other is about synthesizing. But in practice, they are often intertwined. Agentic Search is the “engine” that powers Deep Research, while Deep Research is the “full-service agency” (the complete project from plan to final report). Internally, it uses Large Language Models (LLMs) to interleve reasoning and tool use action, allowing it to “think” about what it needs to find, then “act” by calling tools, and “reflect” on the results to refine its next actions. ...

Matrix Stability - Manifold-Constrained Hyper-Connections & Muon Optimizer

1. The Dichotomy of Matrices In modern architectures like Transformers, stability is not a monolithic property. The properties of these matrices – their sizes, entries, and eigenvalues – crucially affect how information (and gradients) flow. For example, if a weight matrix has very large or very small singular values, it can amplify or attenuate signals and gradients. Orthonormal matrices (whose rows and columns are orthogonal unit vectors) preserve vector norms and avoid distortion, serving as a gold standard for stable signal propagation. In general, keeping matrix products well-conditioned (e.g. with moderate spectral norms) is key to avoiding exploding or vanishing signals in deep nets. ...

Awesome Large Language Model (LLM) Post-training - [2025 Update]

In the race to build truly helpful AI assistants, we’ve discovered a fundamental truth: raw intelligence isn’t enough. A model that masters calculus but can’t refuse harmful requests is like a library with no librarian - overflowing with knowledge but dangerously uncurated. This is the alignment problem: how do we transform raw language models into trustworthy collaborators? For years, Reinforcement Learning from Human Feedback (RLHF) reigned supreme. Its PPO-based approach taught ChatGPT to decline malicious requests and helped Claude write harmless poetry. But beneath the surface, RLHF’s complexity was showing: ...

Connection Between Imitation Learning and RLHF

There have been many questions about whether DPO is a form of imitation learning or (offline) reinforcement learning. The more I observe the distributions of DPO’s chosen and rejection losses, the stronger the feeling becomes that DPO is more like a form of imitation learning. The paper, Xiao, Teng, et al. On a Connection Between Imitation Learning and RLHF. arXiv:2503.050791 also expresses that DPO is a form of imitation learning. ...

Multi-token Prediction

Multi-token prediction vs Next-token prediction Next-token prediction is the standard training objective for most large language models (LLMs), where the model learns to predict the subsequent token in a sequence given all preceding tokens. The model is trained to maximize the probability of the next token \( x_{t+1} \) given the context \( x_{1:t} \) (all tokens up to position \( t \)). The cross-entropy loss for next-token prediction is defined as: ...

The Evolution of Reward Modeling - From Human Feedback to Generative Inference-Time Scaling

Reward modeling (RM) has emerged as a cornerstone of large language model (LLM) alignment, guiding models to align with human values and perform complex tasks. Early approaches relied heavily on Reinforcement Learning from Human Feedback (RLHF), but recent research has shifted toward more scalable, efficient, and generalizable RM frameworks. This blog explores the developmental arc of RM, connecting four seminal papers that have shaped the field: from human labeled preference to AI feedback, from Generative Reward Models(GRM) to inference-time scaling for GRM. ...

Paper Reading - Inference-Time Scaling for Generalist Reward Modeling

Liu, Zijun, et al. Inference-Time Scaling for Generalist Reward Modeling. arXiv:2504.02495, arXiv, 5 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.02495. Problem Statement Reinforcement Learning (RL) has become pivotal in post-training large language models (LLMs), but generating accurate reward signals for diverse domains remains challenging. Existing reward models (RMs) often rely on human-designed rules or verifiable tasks, struggling with generalizability and inference-time scalability. This paper addresses how to improve RM effectiveness through increased inference compute and adaptive learning methods for general queries. ...

DeepSeek-R1

DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948. Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Large language models (LLMs) have made remarkable strides in mimicking human-like cognition, but their ability to reason through complex problems—from math proofs to coding challenges—remains a frontier. In a recent breakthrough, DeepSeek-AI introduces DeepSeek-R1, a family of reasoning-focused models that leverages reinforcement learning (RL) to unlock advanced reasoning capabilities, without relying on traditional supervised fine-tuning (SFT) as a crutch. The paper “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” unveils a paradigm shift in how we train LLMs to think critically, with implications for both research and real-world applications. ...

Paper Reading - The Instruction Hierarchy - Training LLMs to Prioritize Privileged Instructions

Summary of “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions” Wallace, Eric, et al. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. arXiv:2404.13208, arXiv, 19 Apr. 2024. arXiv.org, http://arxiv.org/abs/2404.13208. 1. Problem Statement Modern large language models (LLMs) are vulnerable to attacks like prompt injections and jailbreaks because they treat system prompts, user messages, and third-party inputs (e.g., tool outputs) as equal in priority. This allows adversaries to override intended instructions, leading to risks such as data exfiltration or unauthorized actions . ...

Paper Reading - Weak-to-Strong Generalization - Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. https://arxiv.org/abs/2312.09390 Research Context and Objectives The paper addresses a critical challenge in aligning superhuman AI models: when human supervision becomes insufficient due to the models’ complex behaviors, can weak supervision (e.g., from weaker models) effectively elicit the full capabilities of stronger models? The authors from OpenAI explore this through empirical experiments, aiming to bridge the gap between current alignment techniques (like RLHF) and the needs for superhuman model alignment. ...