Connection Between Imitation Learning and RLHF

There have been many questions about whether DPO is a form of imitation learning or (offline) reinforcement learning. The more I observe the distributions of DPO’s chosen and rejection losses, the stronger the feeling becomes that DPO is more like a form of imitation learning.

The paper, Xiao, Teng, et al. On a Connection Between Imitation Learning and RLHF. arXiv:2503.05079¹ also expresses that DPO is a form of imitation learning.

Direct Imitation Learning (DIL)

This study explores the alignment problem between large language models and preference data from the perspective of imitation learning. Researchers establish a close theoretical connection between Reinforcement Learning from Human Feedback (RLHF) and Imitation Learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Based on this connection, Direct Imitation Learning (DIL) is proposed, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective for the alignment problem, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By linking IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments show that DIL outperforms existing methods on various challenging benchmarks.

Background

Aligning Large Language Models (LLMs) with human preferences is crucial to ensuring that the responses generated by LLMs meet human expectations.

The Proposal of RLHF and Its Issues

In recent years, Reinforcement Learning from Human Feedback (RLHF) has emerged as a widely adopted framework for fine-tuning language models based on human preference data.

DPO Addressing RLHF’s Issues and Its Own Limitations

RLHF relies on a two-step reinforcement learning process, which leads to problems such as low computational efficiency and instability during training. To alleviate these limitations, researchers have proposed alternative one-stage methods, such as Direct Preference Optimization (DPO) and its variants. These methods replace RLHF with supervised learning, eliminating the need for explicit reward modeling. Instead, they directly define an implicit reward based on the likelihood of preference data, significantly improving efficiency while maintaining competitive performance.

Problems with DPO and Triggered Reflections

Although DPO theoretically aims to find the same optimal policy as RLHF, it and its variants essentially still follow a reward maximization objective, determined by parametric models (e.g., the Bradley-Terry (BT) model). This makes them prone to overfitting, leading to suboptimal alignment with preference data. This raises a fundamental and open research question: Can we understand and design effective preference optimization algorithms from a new perspective?

Significance of This Study

This paper re-examines RLHF from the perspective of imitation learning. Specifically, researchers show that RLHF is a special case of a general imitation learning problem, expressed solely through pairwise preferences. They theoretically demonstrate that alignment with RLHF is highly similar to imitation learning and implicitly optimizes the same objective. Leveraging this insight, they design DIL, a general framework for effective alignment based on density ratio reward estimation.

Key Contributions

It is proven that RLHF for alignment is essentially an imitation learning problem, providing a novel analysis that offers clear guidance for the design of alignment algorithms.
DIL, a simple and general imitation learning alignment framework, is proposed. DIL unifies imitation learning on preference data and bridges the gap between density ratio estimation and preference alignment.
Empirically, the effectiveness of DIL is verified on widely used benchmarks, demonstrating its superiority over previous alignment methods.

Theoretical Derivations

Preliminary Knowledge

Problem Setup
Let \( x = [x_1, x_2, \ldots] \) be the input prompt, \( y_w = [y_1, y_2, \ldots] \) be the positive sample (preferred response), and \( y_l \) be the negative sample (non-preferred response). These two samples are typically drawn from the same reference policy \( \pi_{\text{ref}}(y|x) \). Meanwhile, \( y_w \succ y_l | x \) indicates that for the same input \( x \), \( y_w \) is more in line with human preferences than \( y_l \). Thus, the preference distribution is generally expressed as:
\[ p(y_w \succ y_l | x) = g(r(x, y_w) - r(x, y_l)) \tag{1} \]

Here, \( g \) refers to the sigmoid function \( \sigma(x) = \frac{1}{1+e^{-x}} \), a conclusion derived from the Bradley-Terry model (which can be verified by dividing the numerator and denominator by \( r(x, y_l) \) to convert it into a sigmoid form). Given a preference dataset \( \mathcal{D} \) containing feedback, where each data entry is formatted as \( (x, y_w, y_l) \), our alignment goal is to learn an LLM policy \( \pi(y|x) \) based on the preference data.
Reinforcement Learning from Human Feedback (RLHF)
Given an estimated reward function \( r(x, y) \), RLHF fine-tunes the policy \( \pi_\theta \) according to human preferences through the following optimization objective:
\[ \mathop{\max}\limits_{\pi_\theta}\mathbb{E}_{y \sim \pi_\theta(y|x)}[r(x, y)] - \beta\mathbb{D}_{KL}[\pi_\theta(y|x)||\pi_{\text{ref}}(y|x)] \tag{2} \]

The core idea of this formula is to maximize the reward signal of human preferences while preventing the model from deviating too much from the original pre-trained distribution (to avoid collapse). Here, \( \mathbb{D}_{KL} = \mathbb{E}_{y \sim \pi_{\theta}} \left[ \log \frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)} \right] \) is the Kullback-Leibler divergence, describing the difference between the model’s strategy distribution and the original reference model’s. \( \beta > 0 \) is an appropriate KL penalty coefficient. Optimization methods typically use RL approaches such as Proximal Policy Optimization (PPO).
Reward Modeling
Standard reward modeling uses the BT preference model in Equation (1) to fit a reward function \( r_\phi(x, y) \). Specifically, the reward function can be estimated by maximizing the log-likelihood of preference feedback \( (x, y_w, y_l) \):
\[ \mathcal{L}_{\text{RM}}(\phi;\mathcal{D})=\mathbb{E}_{(x, y_w,y_l)\sim\mathcal{D}}[-\log\sigma(r_{\phi}(x,y_w)-r_{\phi}(x,y_l))] \tag{3} \]
Supervised Fine-Tuning (SFT)
Given a demonstration dataset \( \mathcal{D} \), the goal of SFT is to minimize the negative log-likelihood of the model on the demonstration dataset:
\[ \mathcal{L}_{\text{SFT}}(\theta;\mathcal{D})=-\mathbb{E}_{(x,y)\sim\mathcal{D}}[\log\pi_{\theta}(y|x)] \tag{4} \]

SFT is equivalent to Behavior Cloning (BC), a classic offline imitation learning method. Its goal is to minimize the forward KL divergence between the learned policy \( \pi_{\theta} \) and the data policy \( \pi_{\text{data}} \):
\[ \mathop{\min}\limits_{\theta}KL(\pi_{\text{data}}(y|x)||\pi_\theta(y|x))=-\mathbb{E}_{\pi_{\text{data}}(y|x)}[\log\pi_{\theta}(y|x)] \tag{5} \]

It is evident that SFT and BC have the same optimal solution.
Direct Preference Optimization (DPO)
To simplify RLHF’s optimization process, DPO uses the log-likelihood of the learned policy to implicitly represent the reward function:
\[ r_\theta(x,y)=\beta[\log\pi_\theta(y|x)-\log\pi_{\text{ref}}(y|x)] + \beta\log Z_\theta(x) \tag{6} \]

Here, \( Z_\theta(x)=\sum_y\pi_{\text{ref}}(y|x)\exp(r_\theta(x,y)/\beta) \) is the partition function.
Concept Supplement: Partition Function
The partition function ensures that the probability distribution is normalized, i.e., the sum of probabilities of all possible states equals 1. Specifically, in a probabilistic model, given an input \( x \), the probability of output \( y \) can be expressed as:
\[ p(y|x) = \frac{1}{Z(x)} \exp(-E(y, x)) \]

where \( E(y, x) \) is the energy function, measuring the “mismatch” or “cost” of a specific output \( y \) for a given input \( x \). The partition function \( Z(x) \) is defined as the sum of energy exponents over all possible outputs \( y \):
\[ Z(x) = \sum_y \exp(-E(y, x)) \]

For continuous variables, the sum is replaced by an integral:
\[ Z(x) = \int \exp(-E(y, x)) dy \]

Does this principle resemble the softmax function? Indeed, softmax is an application of the partition function.
Returning to the reward function, rearranging terms gives:
\[ r_\theta(x,y)+\beta\log\pi_{\text{ref}}(y|x)-\beta\log Z_\theta(x)=\beta\log\pi_\theta(y|x) \]

Applying the natural exponential function to both sides of the equation:
\[ \frac{\exp(r_\theta(x,y))e^\beta\pi_{\text{ref}}(y|x)}{e^\beta Z_\theta(x)}=e^\beta\pi_\theta(y|x) \]

Since \( Z_\theta(x)=\sum_y\pi_{\text{ref}}(y|x)\exp(r_\theta(x,y)/\beta)=e^{-\beta}\sum_y\pi_{\text{ref}}(y|x)\exp(r_\theta(x,y)) \), substituting into the equation and eliminating common factors on both sides yields:
\[ \pi_\theta(y|x)=\frac{\pi_{\text{ref}}(y|x)\exp(r_\theta(x,y))}{\sum_y\pi_{\text{ref}}(y|x)\exp(r_\theta(x,y))} \]

Intuitively, the higher the reward corresponding to parameter \( \theta \), the closer the model policy \( \pi_\theta(y|x) \) is to 1 (i.e., the higher the probability of adopting that policy), while the model policy \( \pi_\theta(y|x) \) is normalized to a distribution where the sum of probabilities equals 1—this is the essence of the reward function design.
By incorporating this reward into the BT model in Equation (1) and simplifying, DPO’s objective promotes the comparison and differentiation of preferred and non-preferred data:
\[ \mathcal{L}_{\text{DPO}}(\theta;\mathcal{D})=\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[-\log\sigma\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right] \tag{7} \]
Energy-based Models (EBMs)
EBMs define distributions through energy functions. For \( y\in \mathbb{R}^D \), the probability density can be expressed as:
\[ p_\theta(y)=\exp(-E_\theta(y)/Z_\theta(y)) \tag{8} \]

where \( E_\theta(y):\mathbb{R}^D\rightarrow\mathbb{R} \) is the energy function, mapping \( y \) to a scalar, and \( Z_\theta(y) = \sum_y \exp(-E_\theta(y)) \) is the unknown normalization constant, as mentioned in the supplementary knowledge about the partition function.

Core Derivations

The derivations are mainly refereced from this blog²

1 RLHF as a Form of Imitation Learning

Researchers demonstrate that RLHF is a special case of imitation learning based on reverse KL divergence over the distribution of selected responses.

The specific proof is as follows:

First, define the following policy based on energy-based models (EBMs):

\[ \pi_\phi(y|x)=\pi_{\text{ref}}(y|x)\exp(r_\phi(x,y))/Z_\phi(x) \tag{9} \]

where \( \phi \) denotes model parameters, and as described in the preliminary knowledge, \( Z_\phi(x)=\sum_y\pi_{\text{ref}}(y|x)\exp(r_\phi(x,y)) \).

To learn parameter \( \phi \), Behavior Cloning (BC)—a classic and widely used imitation learning method (as mentioned in the preliminary knowledge, SFT is equivalent to BC)—can be applied. This method formulates the task as minimizing the KL divergence between the policy \( \pi_\phi \) and the expert policy \( \pi_{\text{chosen}} \) that generates preferred responses \( y_w \). In other words, IL learns parameter \( \phi \) such that the model distribution imitates the distribution of preferred responses in the preference dataset:

\[ \mathop{\min}\limits_{\phi}KL(\pi_{\text{chosen}}(y|x)||\pi_\phi(y|x)) \tag{10} \]

By selecting responses from the preference data to minimize the above forward KL divergence:

\[ \mathop{\min}\limits_{\phi}\mathbb{E}_{(x,y_w)\sim\mathcal{D}}[-\log\pi_{\text{ref}}(y_w|x)\exp(r_\phi(x,y_w))/Z_\phi(x)] \Rightarrow \\\mathop{\min}\limits_{\phi}\mathbb{E}_{(x,y_w)\sim\mathcal{D}}\left[-r_\phi(x,y_w)+\log\sum_y\pi_{\text{ref}}(y|x)\exp(r_\phi(x,y))\right] \tag{11} \]

(Equation (11) is obtained by removing constants and substituting the partition function.)

There are multiple choices for sampling from the reference distribution \( \pi_{\text{ref}}(y|x) \). One setting that simplifies the above expression and practically yields RLHF is: \( \pi_{\text{ref}}(y|x) = \frac{1}{2} \mathbb{I}(Y = y_l) + \frac{1}{2} \mathbb{I}(Y = y_w) \) (note: this is a key approximation that allows RLHF to be reduced to IL). In this case, the sample-based approximation of the second term is:

\[ \mathop{\min}\limits_{\phi}\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[-r_\phi(x,y_w)+\log(\exp(r_\phi(x,y_w)) + \exp(r_\phi(x,y_l)))\right] \]

This equation is derived by substituting \( \sum_y \pi_{\text{ref}}(y|x)\exp(r_\varphi(x, y)) = \frac{1}{2} \exp(r_\varphi(x, y_w)) + \frac{1}{2} \exp(r_\varphi(x, y_l)) \) into Equation (11) and removing the constant term \( -\log2 \). Further merging the \( -r_\phi(x,y_w) \) term into the right-hand side \( \log(\exp(r_\phi(x,y_w)) + \exp(r_\phi(x,y_l))) \) and simplifying gives the equivalent form:

\[ \mathop{\min}\limits_{\phi}\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[-\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))\right] \tag{12} \]

It can be noted that the imitation learning loss based on the energy strategy is identical to the reward loss of RLHF based on the BT assumption (Equation (3)). By optimizing this loss function, we can directly obtain the optimal energy strategy in Equation (9). Unfortunately, even if we use the estimated reward function \( r_\phi \), estimating the partition function \( Z_\phi(x) \) remains costly, making the derived representation impractical and leading to significantly increased inference costs.

Key Discussion: Why This Method Is Costly

Excessively Large Output Space
The output of a language model is a sequence of tokens; for example, a response may contain dozens or even hundreds of tokens. Each token has thousands to tens of thousands of candidate words (depending on the vocabulary size), resulting in an exponentially growing number of total output combinations:
Assuming 50,000 word choices per step and generating 20 tokens, there are \( 50000^{20} \) possible responses!
This means it is impossible to enumerate all \( y \) to compute \( Z_\phi(x) \).
Requiring Extensive Sampling to Approximate Summation
Since enumeration is impossible, we can only estimate the summation through sampling:
\[ Z_\phi(x) \approx \frac{1}{N} \sum_{i=1}^N \exp(r_\phi(x, y_i)), \quad y_i \sim \pi_{\text{ref}}(y|x) \]

However, to ensure estimation accuracy, it is necessary to:
- Sample enough \( y_i \)
- Run the reward model \( r_\phi(x, y_i) \) for each \( y_i \)
  This leads to:
- High computational resource consumption (running the reward model many times per inference)
- Increased inference time
Reward Models May Be “Heavy”
The reward model \( r_\phi(x, y) \) is typically a large neural network (e.g., based on the GPT architecture), which is already time-consuming to run once. Running it multiple times on multiple samples incurs significant costs.

To address the issue of high computational costs, researchers propose a method. Before introducing this method, we first introduce forward and reverse KL divergences (those already familiar can skip this part):

Concept Supplement: Forward and Reverse KL Divergences

Basic Concept of KL Divergence
KL divergence (Kullback-Leibler Divergence), also known as relative entropy, measures the difference between two probability distributions \( P \) and \( Q \). For discrete distributions, KL divergence is defined as:
\[ D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \]

For continuous distributions, the sum is replaced by an integral:
\[ D_{\text{KL}}(P \| Q) = \int P(x) \log \frac{P(x)}{Q(x)}dx \]
Forward KL Divergence (\( D_{\text{KL}}(P \| Q) \))
- Goal: Make \( Q \) as close as possible to \( P \).
- Characteristics: Heavily penalizes events with high probability in \( P \) but low or zero probability in \( Q \). In other words, it focuses on accurately capturing all modes in \( P \).
- Application Scenarios: When there is a true distribution \( P \) (e.g., the true distribution of data) and a model \( Q \) needs to be trained to approximate this distribution, forward KL divergence is typically used. This is because forward KL encourages the model to cover all modes, even if it means generating some low-probability but existing samples.
- Properties: If \( P \) has a point with probability greater than 0 where \( Q \) has zero probability, the KL divergence tends to infinity. Thus, using forward KL divergence, the model tends to cover more possibilities, even unlikely events.
Reverse KL Divergence (\( D_{\text{KL}}(Q \| P) \))
- Goal: Make \( P \) as close as possible to \( Q \).
- Characteristics: More focused on avoiding generating events with very low probability in \( P \) but high probability in \( Q \). That is, it tends to find the most likely modes and ignore less likely cases.
- Application Scenarios: In some cases, such as Generative Adversarial Networks (GANs) or when a more deterministic model output is desired, reverse KL divergence may be a better choice. Because it encourages the model to focus on the most likely outcomes rather than trying to cover all possibilities.
- Properties: Unlike forward KL, reverse KL allows \( Q \) to be zero in some regions as long as \( P \) is also zero or very small there. This means reverse KL can produce sparser and more concentrated distributions, sometimes leading to “mode collapse”—generating only a few specific types of samples while ignoring others.

Using reverse knowledge distillation (which employs reverse KL divergence, with characteristics consistent with those of reverse KL divergence in the model distillation process), the optimal strategy in Equation (9) is “distilled” into a strategy with an analytical form using reverse KL divergence, enabling the final strategy \( \pi_\theta \) to require only one sampling during inference (note: here \( \theta \) is the main model parameter, and \( \phi \) is the reward model parameter):

\[ \mathop{\min}\limits_{\theta}KL\left(\pi_\theta(y|x)||\pi_{\text{ref}}(y|x)\exp(r_\phi(x,y)/\beta)/Z_\phi(x)\right) \tag{13} \]

where \( \beta \) is the temperature hyperparameter in the distillation process. After isolating the term \( -\mathbb{E}_{\pi_\theta(y|x)}[r_\phi(x,y)] \), removing multiplicative and additive constants, and combining the remaining terms into \( \beta KL(\pi_\theta(y|x)||\pi_{\text{ref}}(y|x)) \), it is transformed into the following objective function:

\[ \mathcal{L}(\theta)=-\mathbb{E}_{\pi_\theta(y|x)}[r_\phi(x,y)] + \beta KL(\pi_\theta(y|x)||\pi_{\text{ref}}(y|x)) \tag{14} \]

It can be observed that this distillation objective corresponds exactly to the objective of RLHF in Equation (2). Thus, researchers provide two key conclusions:

(i) Reward learning in RLHF is equivalent to an imitation learning problem for preferred responses, achieved by minimizing the forward KL divergence between \( \pi_{\text{chosen}} \) and \( \pi_\phi \) based on energy-based models (EBMs), as shown in Equation (12);

(ii) The RL step in RLHF can be interpreted as a reverse knowledge distillation process, where the EBM-based imitation strategy \( \pi_\phi \) is distilled into the final analytical strategy \( \pi_\theta \) by minimizing the reverse KL divergence in Equation (13), with the temperature parameter \( \beta \) determining the degree of KL regularization.

This problem is transformed into the following proposition:

Assume the preferred response distribution \( p(y|x) \), the energy-based model \( \pi_\phi(y|x) \), and the model \( \pi_\theta(y|x) \). When \( \beta = 1 \), KL-regularized RLHF can be regarded as the following problem:

\[ \min_{\pi_\theta} \mathrm{KL}(\pi_\theta \| \pi^*_\phi) \quad \text{s.t.} \quad \pi^*_\phi = \arg\min_{\pi_\phi} \mathrm{KL}(\pi_{\text{chosen}} \| \pi_\phi) \tag{15} \]

where \( \pi_{\text{chosen}}(y|x) = \pi_\phi(y|x) = \pi_\theta(y|x) \) is the equilibrium state.

Thus, imitation learning on preferred responses is equivalent to solving a standard KL-regularized RLHF problem.

Furthermore, we observe that when \( \pi^*_\phi = \pi_{\text{chosen}} \) (i.e., the optimal solution achieved by the lower-level objective), the upper-level objective essentially optimizes a reverse KL divergence \( \mathrm{KL}(\pi_\theta \| \pi_{\text{chosen}}) \).

At this point, I am truly amazed; this proof process is akin to Maxwell’s equations unifying electromagnetism in physics.

To this end, we have proven that RLHF is a special type of IL, with equivalent conditions:

Using EBM and forward KL divergence to fit the Reward Model (RM)
Using EBM and reverse knowledge distillation based on reverse KL divergence to complete RL training

The proof is complete. Researchers then pose an interesting question:

Why does SFT — which directly optimizes the forward KL divergence \( \mathrm{KL}(\pi_{\text{chosen}} \| \pi_\theta) \) in Equation (5) — perform worse than RLHF in alignment tasks?

Theoretically, minimizing the objective functions of SFT and RLHF should lead to the same optimal solution \( \pi_\theta \). However, in practice, this requires complete data coverage and unlimited computational resources, conditions rarely met.

Thus, in practical settings, minimizing different KL divergences results in learned policies with distinct characteristics. Specifically, the forward KL divergence \( \mathrm{KL}(\pi_{\text{chosen}} \| \pi_\theta) \) promotes mass-covering behavior, while the reverse KL divergence \( \mathrm{KL}(\pi_\theta \| \pi_{\text{chosen}}) \) encourages mode-seeking behavior (see supplementary knowledge on forward and reverse KL divergences above).

Mass-covering behavior tends to assign similar probabilities to all responses in the dataset, overestimating the long-tail portion of the target distribution; mode-seeking behavior concentrates probability mass in specific high-reward regions. Therefore, the goal of alignment is to generate a certain type of high-quality response, which can be more effectively achieved by minimizing the reverse KL divergence.

In summary, the reason RLHF outperforms SFT is that the full performance of forward KL divergence requires complete data coverage and unlimited computational resources, which is nearly impossible to achieve. In contrast, reverse KL divergence, although unable to learn to cover the distribution, can still learn high-quality responses, resulting in better performance of reverse KL divergence over forward KL divergence.

2 Direct Imitation Learning (DIL)

In the previous section, we re-examined RLHF from the perspective of imitation learning. The analysis clearly indicates that RLHF is essentially optimized to closely align with the distribution of preferred responses. The sample-based approximation of EBM in RLHF leads to a reward loss similar to the BT model, as shown in Equation (12). However, the BT assumption does not always hold. Based on these insights, researchers propose a new alignment method, DIL (Direct Imitation Learning), that does not rely on the BT assumption. Thus, the objective of imitation learning is directly formulated as minimizing the reverse KL divergence between \( \pi_\theta \) and the unknown preferred response distribution \( \pi_{\text{chosen}} \):

\[ \min_\theta L_{\text{DIL}}(\theta) = \mathrm{KL} \left( \pi_\theta(y|x) \| \pi_{\text{chosen}}(y|x) \right) = \mathbb{E}_{\pi_\theta(y|x)} \left[ \log \left( \frac{\pi_\theta(y|x)}{\pi_{\text{chosen}}(y|x)} \right) \right] \tag{16} \]

Here, we minimize the reverse KL divergence, unlike SFT which minimizes the forward KL divergence as in Equation (5). However, using reverse KL divergence for mode concentration is typically challenging. Directly optimizing Equation (16) cannot effectively utilize preference data, especially since the data policy \( \pi_{\text{chosen}} \) is unknown. In reinforcement learning literature, these challenges are addressed through adversarial training. However, such methods require complex and unstable adversarial training to learn the reward function, which is impractical for large models. In this paper, a simple alternative is proposed to directly utilize offline human preference data without learning the reward function through adversarial training. The DIL objective is reformulated as follows:

\[ \max_\theta \mathbb{E}_{\pi_\theta(y|x)} \left[ \log \frac{\pi_{\text{chosen}}(y|x)}{\pi_{\text{ref}}(y|x)} - \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \right]=\\ \mathbb{E}_{\pi_\theta(y|x)} \left[ \log r(x, y) \right] - \mathrm{KL} \left( \pi_\theta(y|x) \parallel \pi_{\text{ref}}(y|x) \right) \tag{17} \]

where \( r(x, y) \triangleq \frac{\pi_{\text{chosen}}(y|x)}{\pi_{\text{ref}}(y|x)} \) can be regarded as an auxiliary reward function. Equations (16) and (17) are equivalent by adding and subtracting the same term \( \log \pi_{\text{ref}}(y|x) \) in the expectation.

Interestingly, researchers find that even when only preference data is available, the form of this objective function is similar to that of RLHF in Equation (2). The main difference is that the reward here is the estimated log density ratio, which is often difficult to obtain directly in practice. Optimizing this objective involving the density ratio \( r(x, y) \) is non-intuitive and challenging. The next section will show how to efficiently optimize this objective function by effectively utilizing offline human preference data.

3 Density Ratio Reward Estimation

Before delving into the problem in Equation (17), we first describe how to compute the auxiliary reward function based on the density ratio. In a tabular setting, we can directly compute \( \pi_{\text{ref}}(y|x) \) and \( \pi_{\text{chosen}}(y|x) \). However, in high-dimensional language domains, estimating densities separately and computing their ratios is ineffective due to error accumulation.

Before introducing the solution, we first understand Bregman divergence:

Concept Supplement: Bregman Divergence

Bregman Divergence is a measure of the difference between two points, defined by the properties of convex functions. Specifically, given a strictly convex and twice differentiable function \( F \), Bregman divergence is defined as the difference between the function and its linear approximation at a certain point.

Definition
Suppose \( F \) is a strictly convex function defined on a convex set. The Bregman divergence \( D_F \) generated by \( F \) can be defined as:
\[ D_F(\mathbf{p} \| \mathbf{q}) = F(\mathbf{p}) - F(\mathbf{q}) - \langle \nabla F(\mathbf{q}), (\mathbf{p} - \mathbf{q}) \rangle \]

where:
- \( \mathbf{p} \) and \( \mathbf{q} \) are two points in the space;
- \( \nabla F(\mathbf{q}) \) denotes the gradient of \( F \) at point \( \mathbf{q} \);
- \( \langle \cdot, \cdot \rangle \) denotes the inner product operation.
Simply put, \( D_F(\mathbf{p} \| \mathbf{q}) \) measures the gap between the value of function \( F \) at point \( \mathbf{p} \) and the first-order Taylor expansion of \( F \) at point \( \mathbf{q} \).
Properties
- Non-negativity: For all \( \mathbf{p} \) and \( \mathbf{q} \), \( D_F(\mathbf{p} \| \mathbf{q}) \geq 0 \), with equality if and only if \( \mathbf{p} = \mathbf{q} \).
- Asymmetry: In general, \( D_F(\mathbf{p} \| \mathbf{q}) \neq D_F(\mathbf{q} \| \mathbf{p}) \), meaning Bregman divergence is not a symmetric measure.
- Sandwich Inequality: Bregman divergence does not satisfy the triangle inequality, i.e., for any three points \( \mathbf{x}, \mathbf{y}, \mathbf{z} \), it is not necessarily true that \( D_F(\mathbf{x} \| \mathbf{y}) + D_F(\mathbf{y} \| \mathbf{z}) \geq D_F(\mathbf{x} \| \mathbf{z}) \).
Common Examples
Different convex functions \( F \) yield different types of Bregman divergences. Here are some common examples:
- Squared Euclidean distance: If \( F(\mathbf{x}) = \|\mathbf{x}\|^2 \), the corresponding Bregman divergence is the square of the Euclidean distance.
- Kullback-Leibler divergence: If \( F(\mathbf{x}) = \sum_i x_i \log x_i \), the corresponding Bregman divergence is the KL divergence.
- Itakura-Saito distance: If \( F(\mathbf{x}) = -\sum_i \log x_i \), the corresponding Bregman divergence is the Itakura-Saito distance.

The solution proposed in this paper is to directly estimate the density ratio \( \pi_{\text{chosen}}(y|x)/\pi_{\text{ref}}(y|x) \) based on Bregman divergence. Assume the target density ratio is \( r^*(x, y) = \pi_{\text{chosen}}(y|x)/\pi_{\text{ref}}(y|x) \), and a parameterized discriminator \( r_\phi \) is used to estimate this ratio:

\[ \min_\phi D_h(r^* \| r_\phi) =\\\sum_y \pi_{\text{ref}}(y|x) B_h\left(r^*(x, y) \| r_\phi(x, y)\right) =\\\sum_y \pi_{\text{ref}}(y|x) \left[ h\left(r^*(x, y)\right) - h\left(r_\phi(x, y)\right) - \partial h\left(r_\phi(x, y)\right) \left(r^*(x, y) - r_\phi(x, y)\right) \right] \tag{18} \]

where \( B_h \) is the sample-level Bregman divergence.

For a twice continuously differentiable convex function \( h \) with a bounded derivative \( \partial h \), this divergence measures the difference between two density ratios. By subtracting the constant term \( \sum_y \pi_{\text{ref}}(y|x) h(r^*(x, y)) \) and substituting \( r^*(x, y) = \pi_{\text{chosen}}(y|x)/\pi_{\text{ref}}(y|x) \), we obtain (ignoring the constant term):

\[ \sum_y \pi_{\text{ref}}(y|x) \left[ \partial h\left(r_\phi(x, y)\right) r_\phi(x, y) - h\left(r_\phi(x, y)\right) \right] - \sum_y \pi_{\text{chosen}}(y|x) \left[ \partial h\left(r_\phi(x, y)\right) \right] \tag{19} \]

Non-exhaustive examples of Bregman divergences include Least-Squared Importance Fitting (LSIF), Binary Cross Entropy (BCE), and unbounded Kullback-Leibler (UKL) divergence.

For example, LSIF defines \( h_{\text{LSIF}} = (r - 1)^2 / 2 \), resulting in the following form of Bregman divergence for density ratios:

\[ \min_\phi D_{h_{\text{LSIF}}}(r^* \| r_\phi) = \sum_y \frac{1}{2} \pi_{\text{ref}}(y|x) r_\phi^2(x, y) - \pi_{\text{chosen}}(y|x) r_\phi(x, y) \tag{20} \]

In this case, a sample-based approximation of Equation (20) yields the following loss function:

\[ \mathcal{L}(\phi; \mathcal{D}) = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \frac{1}{2} r_\phi^2(x, y_l) - r_\phi(x, y_w) \right] \tag{21} \]

Here, the rejected (non-preferred) response set \( y_l \sim \pi_{\text{ref}}(y|x) \) is used to approximate the expectation over \( \pi_{\text{ref}}(y|x) \). Researchers argue that using rejected responses \( y_l \) from the preference dataset \( \mathcal{D} \) to approximate the expectation is reasonable; it is even possible to use both preferred and rejected responses. However, since the goal is to reduce the likelihood of rejected responses, rejected responses are chosen to approximate the expectation, and good performance is observed in subsequent experiments.

Intuitively, the first term pushes the model to reduce the density ratio of rejected responses, while the second term increases the density ratio of preferred responses.

Furthermore, this direct estimation method based on Bregman divergence indicates that there exists a viable family of divergences for density ratio estimation, as shown in Table 1; other \( h \) functions, such as BCE and UKL (introduced later), are further discussed in Appendix A. Researchers also empirically analyze the impact of different \( h \) function objectives in Section 6.3.

4 Method Optimization

Thus far, it has been observed that combining the RL-like objective in Equation (17) with the density ratio estimation method in Equation (21) can effectively utilize preference datasets for imitation learning. However, this two-stage process is complex and unstable: first, a reward model needs to be fitted to estimate the density ratio, and then the language model policy is fine-tuned using the RL-like objective in Equation (17).

To address these issues, a simpler method is introduced. First, note that the optimal policy in Equation (17) has a closed-form solution:

\[ \pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\log r^*(x, y)\right) \tag{22} \]

where \( Z(x) = \sum_y \pi_{\text{ref}}(y|x) \exp\left(\log r^*(x, y)\right) = \sum_y \pi_{\text{chosen}}(y|x) = 1 \), meaning the optimal policy \( \pi^*(y|x) \) is forced into a self-normalized form!

This property, determined by the definition of the reward function in Equation (17), offers a significant advantage: it allows our imitation learning to theoretically generalize to a broader class of loss functions than the pairwise BT preference model used in DPO.

Taking the logarithm of both sides of Equation (22) and performing some algebraic operations yields the following expression:

\[ \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} = \log r^*(x, y) \tag{23} \]

where \( r^*(x, y) \) is the density ratio estimated from the preference dataset using Equation (21).

Since the optimal density ratio is now represented by the optimal policy rather than a discriminator model, we can explicitly derive a maximum likelihood objective for the parameterized policy on the preference dataset. Similar to the approach used in density ratio estimation and leveraging variable substitution techniques, the DIL objective can be formalized as:

\[ \mathcal{L}_{\text{DIL}}(\theta; \mathcal{D}) = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ - \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \frac{1}{2} \left( \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right)^2 \right] \tag{24} \]

where we use the alternative parameterization in Equation (23) to directly fit the density ratio implicitly defined in Equation (21).

Interestingly, the loss function has no hyperparameters (those familiar with DPO and its variants will appreciate the value of “no hyperparameters,” as it eliminates the cost of hyperparameter tuning, greatly enhancing the feasibility of algorithm deployment in industrial scenarios), yet experiments show it still achieves satisfactory performance. Since the above process is equivalent to fitting a reparameterized density ratio estimation model, it theoretically performs imitation learning by minimizing the reverse KL divergence relative to the unknown preferred response distribution. Table 1 shows a family of objective functions satisfying the definition of Bregman divergence.

5 Discussion: DPO as a Special Case of DIL

Before proceeding, we introduce a method to prepare for the subsequent proof.

Concept Supplement: Contrastive Predictive Coding

Contrastive Predictive Coding (CPC) — familiar to those in the speech domain — is a self-supervised learning method proposed by Oord et al. It is primarily used to learn effective representations from unlabeled datasets by leveraging local dependencies in sequence data.

Core Idea
CPC aims to enable the model to learn to predict future data points from current ones through contrastive learning, maximizing the mutual information between the current representation and future representations. Specifically, given a sequence \( \mathbf{x} = (x_1, x_2, \ldots, x_T) \), CPC attempts to learn an encoder \( f_\phi \) that maps each time point \( x_t \) to a latent representation \( z_t = f_\phi(x_t) \). A scoring function is then used to compute the similarity between \( z_t \) and the representation of a future time point \( z_{t+k} \), encouraging high scores for positive pairs (current and true future representations) and low scores for negative pairs (current and other time point representations).
InfoNCE Loss Function
CPC uses a specific contrastive loss called InfoNCE loss, defined as:
\[ \mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}_{(x_t, x_{t+k})} \left[ \log \frac{\exp(g(z_t, z_{t+k}))}{\sum_{x_j} \exp(g(z_t, z_j))} \right] \]

where \( g(z_t, z_j) \) is a scoring function measuring the similarity between \( z_t \) and \( z_j \), and the expectation \( \mathbb{E} \) averages over all possible time point pairs. This loss encourages the model to assign high scores to positive pairs and low scores to negative pairs.

In this section, researchers demonstrate that DPO can also be viewed as a special case of the DIL framework by using CPC for density ratio estimation. Given a prompt distribution \( p(x) \) and the conditional distribution of preferred responses \( \pi_{\text{chosen}}(y|x) \), we sample \( x \sim p(x) \), \( y_w \sim \pi_{\text{chosen}}(y|x) \), and \( y_l \sim \pi_{\text{ref}}(y|x) \). CPC optimizes the following objective:

\[ \mathcal{L}_{\text{CPC}}(\phi; \mathcal{D}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[ \log \frac{\exp(f_\phi(x^\top y_w)/\beta)}{\exp(f_\phi(x^\top y_w)/\beta) + \exp(f_\phi(x^\top y_l)/\beta)} \right] \tag{25} \]

where \( f_\phi: \mathcal{X} \times \mathcal{Y} \rightarrow \mathbb{R} \) is a parameterized evaluation function.

The optimal evaluation function for this CPC with one negative sample satisfies the following condition:

\[ f^*(x, y)/\beta = \log \frac{\pi_{\text{chosen}}(y|x)}{\pi_{\text{ref}}(y|x)} c(x) = \log r^*(x, y) - \log c(x) \tag{26} \]

where \( c(x) \) is a function dependent only on \( x \) and not on \( y \). Thus, CPC also estimates the density ratio reward in the IL objective, as shown in Equation (17).

Similar to the previous section, using the closed-form optimal policy in Equation (22) and leveraging variable substitution, we obtain:

\[ \mathcal{L}_{\text{DIL}}(\theta; \mathcal{D}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right] \tag{27} \]

This is identical to DPO’s objective. Thus, DIL can reinterpret DPO. Specifically, researchers demonstrate that DPO also conforms to the imitation learning objective in Equation (16) and essentially uses the CPC method for density ratio reward estimation.

In summary, DPO is equivalent to DIL if it uses the CPC method for density ratio reward estimation.

Key Discussion: Why the BT Assumption Reduces the Likelihood of Preferred Responses

Oversimplifying Preference Structures:
The BT model assumes that preferences between each pair of options can be compared independently and follow a specific probabilistic form. However, in practical applications, especially complex language generation tasks, this assumption may oversimplify the true preference structure. For example, preferences may be based on a combination of multiple complex factors, not just a simple comparison between two options. This may prevent the model from accurately capturing the factors that determine high-quality responses, thereby reducing the likelihood of preferred responses.
Data Bias and Noise:
When using the BT model for preference estimation, if the training data contains bias or noise, the learned preference relationships may be inaccurate. For example, if certain types of responses are over-sampled or under-sampled due to biases in the data collection process, the BT model trained on such data may incorrectly estimate true preferences, leading to lower scores for preferred responses.
Limitations of the Optimization Objective:
Using the BT model as the optimization objective may guide the model to optimize toward maximizing pairwise preference probabilities rather than directly optimizing the ability to generate high-quality responses. This may cause the model to sacrifice overall quality to improve win rates in specific comparisons, especially for responses with high intrinsic quality that are not easily highlighted in pairwise comparisons, whose likelihood may thus decrease.

Xiao, Teng, et al. On a Connection Between Imitation Learning and RLHF. arXiv:2503.05079, arXiv, 7 Mar. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2503.05079. ↩︎
https://zhuanlan.zhihu.com/p/1910382777079165403 ↩︎

Direct Imitation Learning (DIL)#

Background#

The Proposal of RLHF and Its Issues#

DPO Addressing RLHF’s Issues and Its Own Limitations#

Problems with DPO and Triggered Reflections#

Significance of This Study#

Key Contributions#

Theoretical Derivations#

Preliminary Knowledge#

Core Derivations#

1 RLHF as a Form of Imitation Learning#

2 Direct Imitation Learning (DIL)#

3 Density Ratio Reward Estimation#

4 Method Optimization#

5 Discussion: DPO as a Special Case of DIL#