SFT, RLHF, PPO, Constitutional AI, DPO, KTO — how a pretrained LLM becomes a helpful, harmless, honest assistant. The full pipeline that powers Claude, GPT-4, and LLaMA-chat.
Next-token prediction on trillions of tokens. Learns world knowledge, language structure.
Supervised fine-tuning on (instruction, ideal response) pairs. Teaches instruction-following format.
Train a model to score response quality from human preference rankings.
Use PPO or DPO to optimize the policy toward high-reward, low-KL responses.
A pretrained LLM is a next-token predictor. Ask it "How do I make coffee?" and it might output "How do I make coffee? Let me explain the history of coffee cultivation…" — continuing the document, not answering you. SFT teaches the model the format of helpful conversation. RL fine-tuning teaches it to produce responses that humans actually prefer — truthful, safe, and helpful — not just statistically plausible.
SFT is standard supervised learning: given a prompt X, maximize P(Y|X) where Y is a high-quality demonstration response. The loss is computed only on the response tokens, not the prompt.
OpenAI hired contractors to write (instruction, ideal response) pairs. Alpaca used GPT-3.5 to generate 52K examples from 175 seed tasks — "self-instruct" for cheap SFT data.
OpenAssistant (OASST), Dolly-15K, FLAN Collection, ShareGPT, OpenHermes 2.5, Orca 2 (reasoning traces), Infinity-Instruct.
SFT teaches format and average behavior, but doesn't optimize for nuanced human preferences — safety, helpfulness tradeoffs, tone, refusing harmful requests. That's what RL brings.
Humans rank multiple responses to the same prompt. The reward model (RM) is trained to predict human preferences — effectively learning what makes a response good.
The KL penalty is critical — without it, the model would optimize the reward function in degenerate ways (reward hacking), producing outputs that score high on r_φ but are bizarre or harmful. The KL term anchors the policy near the well-behaved SFT model.
Human feedback is expensive and inconsistent. What if we could use an LLM to judge and improve another LLM — guided by explicit principles (a "constitution")? That's Constitutional AI (CAI) — the basis for Claude. It replaces human preference labelers with AI feedback (RLAIF), scaled by a written set of values.
CAI scales much better than RLHF: AI preference labels cost ~$0.002 each vs ~$1–10 for human labels. The constitution makes the safety criteria explicit and auditable — much better than the implicit values learned from human raters.
DPO used in: Zephyr-7B, Tulu-2, LLaMA-2-Chat (partial), Mistral Instruct. LoRA + DPO fits on a single consumer GPU (8–24GB) — democratizing alignment fine-tuning.
DPO requires paired preferences (for each prompt, a preferred AND rejected response). KTO (Ethayarajh et al., 2024) relaxes this — it only needs individual labeled examples: (prompt, response, good/bad label).
| Method | Data Needed | Models in Training | Stability | Used By |
|---|---|---|---|---|
| SFT only | Demonstration pairs | 1 | ✅ Very stable | All models (base) |
| PPO-RLHF | Preference pairs + RM | 4 | ⚠️ Unstable | GPT-4, Claude (early), InstructGPT |
| DPO | Paired preferences | 2 | ✅ Stable | LLaMA-2, Zephyr, Mistral |
| KTO | Binary labels (no pairs needed) | 2 | ✅ Stable | Mistral, newer open models |
| Constitutional AI | Constitution + AI labels | 3–4 | ✅ Scalable | Claude 2/3 |
| RLAIF | Prompts + AI judge | 3–4 | ✅ Cheaper than RLHF | Gemini, Claude variants |
1. In RLHF with PPO, why is the KL divergence penalty term essential?
2. What is the key mathematical insight that makes DPO work without a separate reward model?
3. Constitutional AI (CAI) replaces human preference labelers with what?
4. What key advantage does KTO have over DPO in terms of data requirements?