🗺️

The Post-Training Pipeline

From raw pretraining → helpful assistant
1. Pretraining

Next-token prediction on trillions of tokens. Learns world knowledge, language structure.

2. SFT

Supervised fine-tuning on (instruction, ideal response) pairs. Teaches instruction-following format.

3. Reward Modeling

Train a model to score response quality from human preference rankings.

4. RL Fine-tuning

Use PPO or DPO to optimize the policy toward high-reward, low-KL responses.

A pretrained LLM is a next-token predictor. Ask it "How do I make coffee?" and it might output "How do I make coffee? Let me explain the history of coffee cultivation…" — continuing the document, not answering you. SFT teaches the model the format of helpful conversation. RL fine-tuning teaches it to produce responses that humans actually prefer — truthful, safe, and helpful — not just statistically plausible.

📚

Supervised Fine-Tuning (SFT)

Teaching instruction-following format

📝 SFT: Mechanics & Data

WhatHow

SFT is standard supervised learning: given a prompt X, maximize P(Y|X) where Y is a high-quality demonstration response. The loss is computed only on the response tokens, not the prompt.

# SFT objective: next-token prediction on response tokens only L_SFT = -Σᵢ log P_θ(yᵢ | y₁,...,yᵢ₋₁, x) # where x = system prompt + user message # y = ideal assistant response # Loss NOT computed on prompt tokens # Data format (chat template) <|system|>You are a helpful assistant. <|user|>Explain gradient descent in one sentence. <|assistant|>Gradient descent is an optimization algorithm... ^^ Loss computed only here

InstructGPT / Alpaca Style

OpenAI hired contractors to write (instruction, ideal response) pairs. Alpaca used GPT-3.5 to generate 52K examples from 175 seed tasks — "self-instruct" for cheap SFT data.

Open Datasets (2024)

OpenAssistant (OASST), Dolly-15K, FLAN Collection, ShareGPT, OpenHermes 2.5, Orca 2 (reasoning traces), Infinity-Instruct.

SFT Alone Isn't Enough

SFT teaches format and average behavior, but doesn't optimize for nuanced human preferences — safety, helpfulness tradeoffs, tone, refusing harmful requests. That's what RL brings.

🎯

RLHF: Reinforcement Learning from Human Feedback

PPO · Reward Modeling · KL Penalty

🏆 Reward Model Training

HowCritical

Humans rank multiple responses to the same prompt. The reward model (RM) is trained to predict human preferences — effectively learning what makes a response good.

# Reward model: LLM + linear head r_φ(x, y) = linear(LLM_hidden(x,y)) # scalar reward # Training: Bradley-Terry preference model # Given: human prefers y_w over y_l for prompt x L_RM = -log σ(r_φ(x, y_w) - r_φ(x, y_l)) # Maximize: reward(preferred) - reward(rejected) # Reward hacking risk: # Model learns to game the RM, not truly be better # Mitigation: KL penalty, diverse RM ensemble, iterative RLHF

⚙️ PPO for RLHF: The Full Objective PhD

PhD Depth
# RLHF objective (per token y given prompt x) max_π E_{x∼D, y∼π(·|x)} [ r_φ(x,y) - β·KL[π_θ(·|x) || π_ref(·|x)] ] # Components: # r_φ(x,y) = reward model score (be good) # -β·KL = stay close to reference policy (don't deviate) # β ≈ 0.2 = KL coefficient (typical) # PPO Clipped Objective (prevents large policy updates): L_CLIP = E[min(rₜ·Aₜ, clip(rₜ, 1-ε, 1+ε)·Aₜ)] # rₜ = π_θ(aₜ|sₜ)/π_old(aₜ|sₜ) (policy ratio) # Aₜ = advantage estimate (GAE) # ε = 0.2 (clip range) # 4 models running simultaneously: # 1. Policy π_θ (updated) # 2. Reference π_ref (frozen SFT model) # 3. Reward model r_φ (frozen) # 4. Value function V_φ (updated for baseline)

The KL penalty is critical — without it, the model would optimize the reward function in degenerate ways (reward hacking), producing outputs that score high on r_φ but are bizarre or harmful. The KL term anchors the policy near the well-behaved SFT model.

📜

Constitutional AI (Anthropic, 2022)

Scaling safety without human labelers

⚖️ Constitutional AI: RLAIF

WhatWhyAnthropic

Human feedback is expensive and inconsistent. What if we could use an LLM to judge and improve another LLM — guided by explicit principles (a "constitution")? That's Constitutional AI (CAI) — the basis for Claude. It replaces human preference labelers with AI feedback (RLAIF), scaled by a written set of values.

Constitutional AI Pipeline: STAGE 1 — Supervised Learning (SL-CAI): 1. Red-team: generate harmful responses to test prompts 2. Self-critique: ask model to identify problems per constitution Prompt: "Which of these responses violates principle {X}?" 3. Self-revision: generate better response 4. SFT on revised responses STAGE 2 — RLAIF (Reinforcement Learning from AI Feedback): 1. Generate response pairs (y_i, y_j) for prompts 2. Ask feedback model: "Which is more {helpful/harmless/honest}? Consider principle: {constitutional principle}" 3. Use AI preference labels to train reward model 4. PPO on AI reward model Constitution examples (Anthropic's CAI paper): "Choose the response that is least likely to contain harmful or unethical content." "Choose the response that is most supportive of human autonomy and individual freedoms."

CAI scales much better than RLHF: AI preference labels cost ~$0.002 each vs ~$1–10 for human labels. The constitution makes the safety criteria explicit and auditable — much better than the implicit values learned from human raters.

📐

DPO, KTO & Alignment Without RL

Preference optimization as supervised learning

🎯 DPO: Direct Preference Optimization PhD

NeurIPS 2023
# Key insight: RLHF optimal policy has closed form π*(y|x) ∝ π_ref(y|x) · exp(r*(x,y) / β) # Rearranging: reward = β · log policy ratio + constant r*(x,y) = β · log(π*(y|x) / π_ref(y|x)) + β · log Z(x) # Substitute into Bradley-Terry preference model: # P(y_w ≻ y_l | x) = σ(r*(x,y_w) - r*(x,y_l)) # = σ(β·log(π*/π_ref)(y_w) - β·log(π*/π_ref)(y_l)) # DPO loss: replace π* with π_θ and train directly L_DPO(θ) = -E log σ( β·log(π_θ(y_w|x)/π_ref(y_w|x)) - β·log(π_θ(y_l|x)/π_ref(y_l|x)) ) # ↑ Pure supervised objective on (prompt, chosen, rejected) triples # No reward model. No RL loop. Stable training.

DPO used in: Zephyr-7B, Tulu-2, LLaMA-2-Chat (partial), Mistral Instruct. LoRA + DPO fits on a single consumer GPU (8–24GB) — democratizing alignment fine-tuning.

🧠 KTO: Kahneman-Tversky Optimization 2024

Prospect Theory

DPO requires paired preferences (for each prompt, a preferred AND rejected response). KTO (Ethayarajh et al., 2024) relaxes this — it only needs individual labeled examples: (prompt, response, good/bad label).

# Kahneman-Tversky: humans feel losses more than gains # Loss aversion: v(x) = x if x≥0, λx if x<0 (λ>1) # KTO "desirability" function (prospect theory) L_KTO = E_{(x,y,z)} [ λ_D · σ(r_KTO(x,y) - z_ref) if z=1 (desirable) λ_U · σ(z_ref - r_KTO(x,y)) if z=0 (undesirable) ] # where r_KTO = β·log(π_θ(y|x)/π_ref(y|x)) # and λ_D, λ_U weight desirable/undesirable losses # and z_ref = KL baseline (mean reward under current policy) # Advantage: works with binary labels, no paired preferences # 1 positive + 1 negative example is all you need per type

📊 Alignment Methods Comparison

MethodData NeededModels in TrainingStabilityUsed By
SFT onlyDemonstration pairs1✅ Very stableAll models (base)
PPO-RLHFPreference pairs + RM4⚠️ UnstableGPT-4, Claude (early), InstructGPT
DPOPaired preferences2✅ StableLLaMA-2, Zephyr, Mistral
KTOBinary labels (no pairs needed)2✅ StableMistral, newer open models
Constitutional AIConstitution + AI labels3–4✅ ScalableClaude 2/3
RLAIFPrompts + AI judge3–4✅ Cheaper than RLHFGemini, Claude variants

🧠 Knowledge Check: Post-Training Alignment

1. In RLHF with PPO, why is the KL divergence penalty term essential?

2. What is the key mathematical insight that makes DPO work without a separate reward model?

3. Constitutional AI (CAI) replaces human preference labelers with what?

4. What key advantage does KTO have over DPO in terms of data requirements?