RLHF (Reinforcement Learning from Human Feedback) is the standard recipe for aligning language models to be helpful and harmless. But classical RLHF requires training a separate reward model — which is expensive and finicky. Direct Preference Optimization (DPO) sidesteps this by framing alignment directly as a classification problem over preference pairs.
I explored this with Microsoft’s Phi-2 — a 2.7B parameter model that punches well above its weight — on the Anthropic Helpful and Harmless (HH) dataset.
Setup
Model: Microsoft Phi-2 (2.7B params) Dataset: Anthropic HH — pairs of (chosen, rejected) dialogue responses Method: DPO via HuggingFace TRL Efficiency: LoRA adapters + 8-bit quantization (bitsandbytes)
LoRA and quantization made this trainable on a single GPU without sacrificing much quality — a practical requirement for research-scale experiments.
What DPO Does
Given a pair of responses (chosen, rejected) to the same prompt, DPO optimizes the policy to increase the relative likelihood of the chosen response while implicitly using the base model as a reference. No reward model, no PPO loop.
The loss is:
L_DPO = -E[log σ(β * (log π(y_w|x) - log π_ref(y_w|x)) - β * (log π(y_l|x) - log π_ref(y_l|x)))]
where y_w is the preferred response, y_l is the rejected one, and β controls how far the policy can deviate from the reference.
Results
The fine-tuned model showed improved reward margins — when scored with a reward model, it consistently preferred the helpful/harmless responses over the rejected ones. You can see the full training run on W&B.
Takeaways
- DPO is remarkably clean to implement with TRL — the training loop is almost identical to standard SFT
- Phi-2 responds well to preference tuning despite being small
- LoRA + 8-bit makes alignment experiments accessible without large GPU clusters
This is a good starting point if you want to experiment with alignment at a smaller scale before moving to larger models.