Connection Between Imitation Learning and RLHF
There have been many questions about whether DPO is a form of imitation learning or (offline) reinforcement learning. The more I observe the distributions of DPO’s chosen and rejection losses, the stronger the feeling becomes that DPO is more like a form of imitation learning. The paper, Xiao, Teng, et al. On a Connection Between Imitation Learning and RLHF. arXiv:2503.050791 also expresses that DPO is a form of imitation learning. ...