LLM & RLHF - Paper Reading Notes

Table of Contents

Summary:
- Compared with the typical method of complex RL systems (which manually set complex goals), this paper explore another approach which define the goals just as human preferences between pairs of trajectory segments (learn a reward function from human feedback). This approach can effectively solve complex RL tasks without hand-engineer reward function. Also, this approach can solve the big issue about the misalignment between our values and the objectives of our RL systems.
Methods
- fit the reward function with human preference data (only query the trajectories that exhibit high variance across the ensembles of the reward functions)
- based on the Bradley-Terry model, assume human’s probability of preferring a segment depends exponentially on the value of the latent reward.

Summary: This paper proposes an reparameterization idea to directly optimize a LM to align with human preferences, without fitting a reward function or using reinforcement learning. This paper also provides theoretical analysis to show that the reparameterization won’t constrain the class of learned reward models.
Definitions
- RLHF pipeline: 1. SFT Phase; 2. Reward Modelling Phase; 3. RL Fine-Tuning Phase
- DPO: Direct Preference Optimization - by deducing the optimal policy of the RLHF and using the property of the Bradley-Terry model (i.e. we only need the reward difference to calculate the preference probability), we can eliminate the need of fitting a reward function and change the loss function from the original expression of reward function to the expression of optimal policies by designing the reparameterization of the reward function.

Summary: 1) Optimize a set of N weights (from the same pre-trained model) which corresponds to N different proxy rewards. 2) Linearly interpolate the weights to get a continuous set of reward soups. The paper constructs sufficient experiments over foundation models, and demonstrates that rewarded soup can mitigate reward misspecification.
Definitions:
- Linear Mode Connectivity (LMC) & Weight Interpolations (WI): weights fine-tuned from a shared pre-trained initialization remain linearly connected and thus can be interpolated. (In other words, when the LMC holds, combining networks in weights combines their abilities.)
- Multiple-Objective Reinforcement Learning (MORL): We have N different proxy rewards. By interpolating them using M different weightings, we could train M different models. Then, based on the user’s specific reward, we choose one of the M models that maximize the user’s reward.
- Reward Misspecification: during the reward optimization process of LLMs, we can only optimize a proxy reward (not true reward during test time) during training. However, we don’t actually know the true reward when we apply the models in real world. Such model misspecification may hinder the alignment of the network.

Summary: Proposes Multi-Objective DPO (MODPO), an RL-free algorithm that extends DPO for multiple alignment objectives with minimal overheads. Key idea: by doing DPO math deduction on the multiple objective formula, they find out that compared with the typical DPO, we only need to train one margin reward model to optimize the loss of MODPO. The results show that this method can achieve the comparable performance to MORLHF and much better than reward soups or loss weighting.
Typical approach for multiple LMs
- Divide human feedback into multiple fine-grained dimensions
- Create distinct reward models
- Optimize different LMs with different reward weightings during fine-tuning phase

Summary:
- This paper provides a method to automatically generate evaluation datasets by leveraging LMs (Language Models) and PMs (Preference Models). By using this proposed method, we can evaluate the behaviors of LMs without the need of human-written evaluation datasets (which usually cost large amounts of human efforts).
- By testing the generated 154 evaluation datasets, this paper finds new cases of inverse scaling where LMs get worse with size.
  - Larger LMs repeat back a dialogue user’s preferred answer (“sycophancy”) and express greater desire to pursue concerning goals like resource acquisition and goal preservation.
  - More RLHF makes LMs worse, e.g. RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down.
Definitions:
- Evaluation datasets: the datasets used to evaluate the behaviors of LMs, such as Gender bias test, Stated desire for power test and Ends-Justify-Means reasoning test.
- Preference models: the model used for RLHF training. This model is trained to mimic the human preference between two replies from LMs.
- Inverse scaling: the behaviors of LMs become worse, as the size of LMs increases.
Methods: