Tag: rlhf

All the articles with the tag "rlhf".

Fine-tuning Phi-2 with DPO on the Anthropic HH Dataset

29 Feb, 2024

Fine-tuning Microsoft's Phi-2 using Direct Preference Optimization (DPO) on the Anthropic Helpful and Harmless dataset with LoRA and 8-bit quantization.