Tag: rlhf
All the articles with the tag "rlhf".
-
Fine-tuning Phi-2 with DPO on the Anthropic HH Dataset
Fine-tuning Microsoft's Phi-2 using Direct Preference Optimization (DPO) on the Anthropic Helpful and Harmless dataset with LoRA and 8-bit quantization.