RLHF | Cong's Log

Connection Between Imitation Learning and RLHF

There have been many questions about whether DPO is a form of imitation learning or (offline) reinforcement learning. The more I observe the distributions of DPO’s chosen and rejection losses, the stronger the feeling becomes that DPO is more like a form of imitation learning. The paper, Xiao, Teng, et al. On a Connection Between Imitation Learning and RLHF. arXiv:2503.050791 also expresses that DPO is a form of imitation learning. ...

A Better Practice to Define Reward Model with HuggingFace's transformers

Cong Chen University of Edinburgh There are various implementation of reward modeling in RLHF(reinforcement learning with human feedback), each has different pros and cons. Inspired by some open-sourced works about reward modeling, I would like to share one of the best practice for reward modeling. For those who are not familiar with reward modeling and RLHF, I recommend take a look at the Huggingface rlhf blog1 or OpenAI rlhf paper2. ...

Boosting Large Language Models Alignment - A Data-Driven Bootstrap Flywheel

Cong Chen University of Edinburgh InstructGPT1, ChatGPT2, and GPT-43 are cutting-edge Large Language Models (LLMs) that have astounded the world. With their ability to follow human instructions and align with human preferences, they can act as chatbots or helpful assistants. Despite impressing people for a while, their development lifecycles have not yet been thoroughly elaborated. In this blog, I will provide my observations and thoughts based on my recent experience with large language model training and alignment. Instead of introducing these LLMs and highlighting their impressive performance, I will focus on the bootstrap flywheel that continuously improving these models. The bootstrap flywheel is showed in below graph with further illustration in this post. ...

John Schulman和Yoav Goldberg关于Behavior Cloning(BC)、RL and Truthfulness的观点

Cong Chen University of Edinburgh John Schulman最近在Berkeley分享了关于BC、RLHF and Truthfulness的观点1，Yoav Goldberg也针对John Schulman的观点进行了总结和扩展2，同时南大的俞扬教授也对BC和RL的对比进行了观点分享3。归纳的核心观点有三个： Behavior Cloning（BC, learning from demonstrations, or SFT）是最Effective的方法。RLHF过程中重度使用了BC，包括冷启动和奖励模型训练都用了BC。虽然BC更有效，相比RL也更容易work，但BC因为自身局限性，有一些固有的问题无法解决：核心问题是，BC训练越泛化意味着LLM越会Hallucination和撒谎；而我们想鼓励LLM根据它的内部知识来回答，问题是我们不知道其内部知识包含什么，所以要利用RLHF让LLM知道什么问题是超过自己的知识范围的（让模型知道自己不知道）。除此之外，RL还允许负反馈，而 negative feedback is much more powerful 基于 Ranking 的 Reward学习虽然不够好，但是实践起来更容易未来优化方向：当LLM知道自己不知道时，目前更多的是诚实地表达“I dont know”来拒识，OpenAI的方向是让LLM尝试去搜索外部知识，生成更可信、带citing source的回答，也就是从Honest进化到Truthfulness。参考下面的 ChatGPT Browsing 详细分享 - by John Schulman Why there is Hallucination Is “if a model know something” a meaningful question? RL is the correct ways Long form QA (LFQA) is much difficult that short QA ...