The Hidden Infinity in Preference Learning

An illustration of how length normalization aids learning from model-annotated data

TL;DR: I demonstrate from first principles how offline preference learning algorithms (e.g., SimPO) can benefit from length normalization, especially when training on model-annotated preference data. The derivation also lends insight into a subtle but important challenge in training LMs: infinite strings.


Aligning language models using human feedback has become increasingly popular. The process of collecting this feedback (i.e., a human rater reading and ranking pieces of text) is non-differentiable, so people turned to reinforcement learning (RL).

Learning from Human Feedback as RL

It turns out that it is confusing to translate the problem of improving language models via human feedback to the RL setting. First question: how should we define the states and the actions?

Originally, Jaques et al., 2017 designed the states in the problem to be the partial context, usually some combination of pre-filled prompt tokens and generated tokens, so the action was to pick a single token to generate next. I’ll refer to this as the token-level formulation. On the other hand, later works (Ziegler et al., 2019; Stiennon et al., 2020) framed the problem in a bandit setting, where the state is the pre-filled prompt and the action is to select which sequence to generate out of a few candidates. I’ll refer to this as the bandit formulation. Both of these designs are not perfectly suited to our setting, because we get human feedback on a sequence level but ask our LM to act on (i.e., generate) individual tokens. More recently, some works have designed objectives that incorporate token-level rewards or constraints alongside sequence-level feedback (Zeng et al., 2024; Lai et al., 2024). On the more theoretical and conceptual side, Rafailov et al., 2024 has a great illustration of this topic and draws meaningful connections between the two settings, and I’m sure that many RL papers are dedicated to parsing this gap between trajectory-level rewards and step-level actions.

The reason I bring up these two settings is to see how each one deals with the case of a model producing infinite-length strings. And yes, I know that it’s impossible in practice for LMs to actually generate an infinite-length string. But they can certainly put a tiny amount of probability mass on an infinite-length string. And if they do so for many such strings, then this ends up being a huge amount of probability mass lost to sequences that we’ll never see.

In the token-level formulation, it’s clear that an infinite-length string corresponds to essentially an infinite-length trajectory rolled out from the policy. One may think the bandit formulation somehow circumvents this problem because it provides only two bounded-length sequences for the model to choose from. But, when deriving optimal solutions to the bandit setting, you still need to deal with the problem of infinite “arms” (i.e., possible responses sampled from your policy). For example, Appendix A.1 of the DPO paper glosses over the fact that the partition function $Z_\theta$ may not exist if you don’t bound the length of the generations or somehow assume that the reward decays with length. If you’re still not convinced that infinite-length strings matter, then stay tuned to see how this idea eventually plays out into a rigorous understanding of the role of length normalization.

Length Normalization

We can now derive how length normalization might make sense from the RL perspective. My disclaimer here is that I definitely worked backwards from the idea of length normalization to arrive at this justification, so there may be many other algorithms that are equally or better suited to operating in the setting I described. This derivation is also hand-wavy and meant more to provide intuition more than anything else. I’ll use the bandit-level formulation.

Suppose we have a prompt $x$ and two candidate completions $y_1$ and $y_2$. We will use $\Pr[y_1\succ y_2\mid x]$ to denote the proportion of people who prefer $y_1$ over $y_2$ as a completion to $x$.

The first thing we need to do is make some assumption about the reward structure. The Bradley-Terry (BT) model was derived to rigorously describe the distribution of human preferences:

Pr[y1y2x]=exp(rh(x,y1))exp(rh(x,y1))+exp(rh(x,y2)) \Pr[y_1\succ y_2\mid x] = \frac{\exp(r_\text{h}(x, y_1))}{\exp(r_\text{h}(x, y_1)) + \exp(r_\text{h}(x, y_2))}

where $r_\text{h}(x,y)$ is the human ground-truth reward function that we don’t have access to. We already arrive at our first complication. In practice, some datasets are constructed by using an auxiliary model (e.g., GPT-4 or PairRM by Jiang et al., 2023) to pick which response is preferred. So, let’s call the reward function of such a language model $r_\text{LM}(x,y)$ and now consider the case where our data is annotated by a model (i.e., “RLAIF”).

One thing we do kind of know about $r_\text{LM}(x,y)$ is that it usually grows with the length of the response $y$. That is, models tend to favor longer responses. And in fact, one popular paper (Singhal et al., 2023) showed that for some models, the reward grows linearly with the length of $y$. But in the case of alignment, it’s not clear if we always want our models to output longer sequences. For example, if we are trying to prevent harmful behavior, length is likely a spurious correlation. On the other hand, if we want to promote helpful behavior, increasing the length may somewhat correlate with providing useful information. Regardless, let’s disentangle length from the reward that we want to maximize and define $r^\star(x,y)$ such that

rLM(x,y)=yr(x,y)r_\text{LM}(x,y) = |y| r^\star(x,y)

Now, we want to learn a model $\pi^\star$ that is trained with access to only data annotated according to $r_\text{GPT}$ but maximizes the length-normalized reward $r^\star$. In other words, we can write:

r(x,y)=log(π(x,y))y r^\star(x,y) = \frac{\log(\pi^\star(x,y))}{|y|}

where I’m keeping with the standard ML practice of considering the log-likelihood of a softmax-parametrized model instead of the likelihood directly. We can now define a score $\Lambda$ using $r^\star$.

Λ[y1y2x]=σ(r(x,y1)r(x,y2)) \Lambda[y_1\succ y_2 \mid x] = \sigma(r^\star(x,y_1) - r^\star(x,y_2))

where $\sigma$ is the sigmoid function. Note that the DPO derivation does not rely on pulling this score out of thin air because they used the Bradley-Terry assumption on $r_\text{h}$ to show that, if the data is human-annotated, then $\sigma(r_\text{h}(x, y_1) - r_\text{h}(x, y_2))$ is exactly equal to the ground-truth probability that $y_1$ is preferred over $y_2$. Here, however, I want to make it clear that we have no reason to believe that $r^\star$ obeys the Bradley-Terry assumption. But, optimistically, if it did, then $\Lambda$ would define a valid probability distribution capturing how much the model prefers $y_1$ over $y_2$ if we removed the bias that favored long responses.

Now, we want to train the model to maximize this score in the case that $y_1 = y_w$ is the winning response and $y_2=y_l$ is the losing one. So we can write our length-normalized objective as

L(πθ,D)=E(x,yw,yl)D[logσ(logπθ(ywx)ywlogπθ(ylx)yl)] \mathcal{L}(\pi_\theta, \mathcal{D}) = \mathop{\mathbb{E}}_{(x, y_w, y_l)\sim\mathcal{D}} \left[\log \sigma \left(\frac{\log \pi_\theta (y_w\mid x)}{|y_w|} - \frac{\log \pi_\theta(y_l\mid x)}{|y_l|}\right)\right]

And so we arrive at a length-normalized preference learning objective. There are some interesting takeaways from this derivation.

  1. If we’re going to annotate preference data using other models, then we can actually adjust our objectives to counteract known biases in the scoring. Length normalization is one example of this strategy. If and when other ones come to light, we could feasibly use this approach to allow our models to learn from imperfectly annotated data. We can analogously use this approach to correct biases in human-annotated data as well, though we ought to exercise a lot more caution and thought when shifting or otherwise modifying human-expressed preferences.
  2. Note that no matter what I do, this current strategy cannot link $r_{\text{LM}}$ to $r_\text{h}$ – that is, I can’t really describe how the rewards that the model provides relate to the ground-truth human rewards. Instead, all I can say is that length-normalized objectives provide one possible strategy for mitigating the absorption of biases from the models used to annotate the data. By the same token, it doesn’t matter if the annotator LM was trained on human data or not. These nuances are something I hope to explore in future work!
  3. Connection to average reward maximization: Another way to see $r^\star(x,y)$ is that it’s the average reward of each token in the sequence. This links length normalization to a long line of RL algorithms that maximize the average reward instead of the total reward (see a survey in Mahadevan, 1996). This formulation is especially useful when dealing with cyclical or infinite-horizon problems. Note that another way to deal with infinite-horizon problems is to discount rewards that will arrive in the future – if you were to do this, then you would arrive at the length-regularized DPO objective (Park et al., 2024).
  4. If we just look at the final length-normalized objective, then we can see that it explicitly requires $\pi_\theta(y_w\mid x)$ to increase much more than it would if there was no normalization. Other modifications to the objective, like target reward margins, do not explicitly ensure that this will happen, because they operate on the distance between the two likelihoods. After all is said and done, we usually just use $\pi_\theta$ to generate sequences, so directly modifying it seems like it would be more effective.

Discussion

The derivation above demonstrates how we might adjust our objective functions to enable learning from imperfectly (e.g., automatically) annotated data. I think that using other models to grade or annotate preference data will become increasingly prevalent, due to growing evidence that preference learning with on-policy data is more effective (Tajwar et al., 2024) and the rising cost of finding and hiring qualified human annotators. In this post, I considered the case where the preference data was model-annotated and the model had a bias favoring longer responses. The analysis provides some insight into one particular situation in which length normalization makes sense, but it by no means provides a general prescriptive guideline for objective design.

I was originally motivated to derive this after reading about SimPO (Meng et al., 2024), which proposes a length-normalized preference learning objective. If you are curious about the other modifications in SimPO – removing the reference model and setting a global target reward margin – then you might be interested in reading our recent paper on how the reference model can prevent DPO from aligning effectively.

The possibility of infinite-length strings has been explored in several prior works, mainly with a focus on characterizing how prone a particular model architecture (e.g., transformer or RNN) is to leaking probability mass to infinite-length strings and how certain objectives might exacerbate or mitigate this issue (Welleck et al., 2020; Du et al., 2023).

There is also a big leap missing from what I have derived so far. We may train a policy $\pi^\star$ to maximize rewards but when we sample from the model (e.g., via greedy decoding), we don’t actually obtain a sequence that maximizes the log likelihood prescribed by the model. This was, after all, the motivation for developing many different decoding strategies, including ones that implicitly plan or search. In the context of post-training, I find this observation fascinating, given how the field has primarily focused on next-token prediction objectives as a means to constrain or guide long-form generations.1 Some of my research going forward will be on this topic of the gap between next-token prediction and long-form generation. If you are interested in this direction, I enjoyed reading several of Clara Meister’s papers.

Since this is a blog post and not a paper, I’m not doing a full literature survey on these topics. As I mentioned, there are many RL papers on these topics, and my main research area is not RL. That being said, if I didn’t mention your paper and you think it is relevant, kindly send me an email!

Acknowledgments: Thank you (in alphabetical order) to Angelica Chen, Xinyi Chen, Yu Meng, and Mengzhou Xia for discussions that helped me refine my derivation and for suggesting several relevant papers. And also thank you to Adithya Bhaskar, Tianyu Gao, Lucy He, and Kaifeng Lyu for helping me proofread this post. Thanks also to Ben Eysenbach for pointing me in the direction of average reward maximization.

Citation: If you find this blog post to be helpful in your work, please use the following bibtex citation.

@misc{malladi2024hiddeninfinity,
  title   = {The Hidden Infinity in Preference Learning},
  author  = {Malladi, Sadhika},
  year    = {2024},
  month   = {July},
  url     = {https://www.cs.princeton.edu/~smalladi/blog/2024/06/27/dpo-infinity/},
}

  1. OK, to be fair, there are papers that now train models to predict several tokens into the future (Gloeckle et al., 2024), but I would argue this is just a patch over the underlying problem. Similarly, there are papers that allow a model to access and train on its own generations (e.g., STaR, Zelikman et al., 2022), but I’m not fully convinced that this closes the gap between training and inference.