Week 13 – Generative AI, Prompt Engineering & RAG

🤖

What is Generative AI?

M12T1L1 · The shift from classification to generation

📖 Story: Why Generation is Fundamentally Different

Every model we've seen so far — Naive Bayes, SVM, LSTM, BERT — takes input text and classifies it. They answer yes/no or pick a label. Generative AI does something far more open-ended: given the sentence "The weather today is", it must predict what comes next. And then after outputting "sunny", it must predict the word after that, and so on. This is a completely different game — instead of one decision, the model makes hundreds of decisions sequentially, each one conditioned on everything before it.

Core Idea & Why It Matters

▼

What

Generative AI is a class of AI models trained to create content — text, images, code, audio, video — based on patterns learned from massive datasets. Unlike discriminative models (which draw a boundary between classes), generative models learn the underlying data distribution and can sample from it.

Why This Approach

Before generative models, every NLP task needed its own labeled dataset and custom classifier. Generative LLMs are trained on unsupervised text prediction (no labels needed — the next word is always the label). This allows training on trillions of tokens. The resulting model captures such rich language understanding that it can be adapted to almost any task via prompting alone, without retraining.

When To Use

Generative models shine when the output space is open-ended: summarization, question answering, code generation, creative writing, translation, chatbots. They're overkill (and expensive) if you only need a simple label like "spam vs. not spam" with a small dataset.

🏭 Proprietary Models

GPT-4o / GPT-4 — OpenAI
Claude 3/4 — Anthropic
Gemini 1.5/2 — Google DeepMind

🔓 Open Source Models

Llama 3 (8B, 70B, 405B) — Meta
Mistral 7B / Mixtral 8×7B — Mistral AI
Qwen 2.5 / DeepSeek-V3 — Alibaba/DeepSeek

How LLMs Generate Text: The Decoder Loop DEEP DIVE

▼

How — Step by Step

Generation is an autoregressive process — each token is generated one at a time, and each new token is conditioned on all previously generated tokens. The loop runs until a special end-of-sequence token is produced or a maximum length is reached.

📝 Input: User Prompt → Tokenized to integer IDs

e.g., "How are you?" → [5299, 389, 345, 30]

↓

🔢 Token Embedding: Each ID → dense vector (e.g., 4096-dim)

↓

📍 + Positional Encoding: RoPE or sinusoidal, encodes token position

↓

🔀 Masked Self-Attention: Each token attends only to PAST tokens (causal)

↓

⚙️ Feed-Forward Network: Non-linear transformation per position

↓ (repeat N times = N Transformer blocks)

📊 Output Layer (LM Head): Linear → Softmax → Vocabulary probability distribution

↓

✅ Token Selection: argmax / sampling / beam search → next token

↓ (append token, loop back)

🔑 Key Insight: Why "Masked" Attention? During training, we show the model the whole sentence at once (efficient). But generation must be causal — token 5 cannot see token 6. Masking the future tokens during training makes the model learn to predict based only on context to the LEFT. This is the crucial architectural difference between BERT (sees both directions) and GPT (sees only left).

✅ Autoregressive vs. Non-Autoregressive Generation GPT-style models generate left-to-right, token by token (autoregressive). Research has explored non-autoregressive models that generate all tokens in parallel for speed — but quality is generally lower because each token's generation is independent of the others being generated.

Training LLMs: Language Modeling vs. Instruction Fine-Tuning DEEP DIVE

▼

Stage	Technique	Goal	Data
Pre-training	Next-token prediction (CLM)	Learn language, world knowledge	Trillions of tokens, unsupervised
Pre-training	Masked Language Model (MLM — BERT-style)	Bidirectional context understanding	Large unlabeled corpus
Post-training	Supervised Fine-Tuning (SFT)	Follow instructions	~10K–100K curated instruction-response pairs
Alignment	RLHF (Reinforcement Learning from Human Feedback)	Helpful, harmless, honest outputs	Human preference rankings
Alignment	DPO (Direct Preference Optimization)	Same as RLHF, more stable/efficient	Preference pairs (chosen vs. rejected)

💡 Why RLHF → DPO? RLHF requires training a separate reward model and then running PPO (a complex RL algorithm) — unstable and resource-intensive. DPO (Rafailov et al., 2023) showed you can directly optimize on preference data using a classification objective, achieving similar or better alignment without the RL training loop. Most open-source models (Llama-3-Instruct, Mistral-Instruct) now use DPO variants.

# DPO Objective (simplified intuition): # Maximize probability of chosen response OVER rejected response # Relative to a reference (pre-SFT) model L_DPO = -E[ log σ( β · log(π_θ(y_w|x)/π_ref(y_w|x)) - β · log(π_θ(y_l|x)/π_ref(y_l|x)) ) ] where: y_w = chosen (preferred) response y_l = loser (rejected) response π_θ = model being trained π_ref = reference model (frozen SFT model) β = temperature controlling deviation from reference

🎛️ Interactive: Temperature & Top-p Sampling

Adjust the sliders to understand how temperature and top-p affect text diversity. These are the core decoding hyperparameters you set when calling any LLM API.

Temperature: 0.7 (0 = deterministic, 1 = creative)

Top-p (nucleus): 0.9 (0.1 = narrow vocab, 1.0 = full vocab)

Prompt: "The weather in Atlanta today is..." Temperature=0.7, Top-p=0.9 → Balanced, coherent but varied Example: "The weather in Atlanta today is warm and partly cloudy, with a high of 82°F and a slight chance of afternoon showers."

🧠

Prompt Engineering

M12T1L2 · Communicating with LLMs effectively

📖 Story: Why Prompting Matters More Than You'd Think

Imagine you hired a brilliant consultant who has read every book ever written. Now, if you ask them "what do you think?", you get a vague answer. But if you say "You are a senior data scientist. Here are 3 examples of similar problems and their solutions. Now, think step-by-step, and tell me how to detect anomalies in sensor data using isolation forests" — you get a masterpiece. That's prompt engineering: not changing the consultant's knowledge, just learning how to talk to them.

Anatomy of a Prompt

▼

What

A well-designed prompt has four components that together specify exactly what you want from the model:

# COMPONENT 1: SYSTEM / ROLE You are a senior NLP researcher at Georgia Tech. Respond with technical precision appropriate for a PhD audience. # COMPONENT 2: CONTEXT (relevant background) The following text is from a student NLP assignment on sequence labeling using CRFs. # COMPONENT 3: INSTRUCTION + QUERY Classify the following as: Helpful | Partially Helpful | Not Helpful Text: "The Viterbi algorithm computes the most likely tag sequence." # COMPONENT 4: OUTPUT INDICATOR Respond ONLY with one of: Helpful | Partially Helpful | Not Helpful Then explain your reasoning in one sentence.

✅ Best Practices for Prompt Design Be specific about the role. Provide examples when possible. Constrain the output format. Use "Think step by step" for reasoning tasks. Separate components with clear delimiters (###, XML tags, etc.).

Prompting Techniques: From Zero-Shot to Advanced DEEP DIVE

▼

BASIC

Zero-Shot Prompting

Ask the model to perform a task with no examples at all. Relies entirely on pre-trained knowledge. Works well for common tasks. Fails for novel formats.

BASIC

Few-Shot Prompting

Provide 2–10 examples of (input → output) pairs before your actual query. Demonstrates the pattern you want. Dramatically improves performance on custom formats.

INTERMEDIATE

Chain-of-Thought (CoT)

Instruct the model to show its reasoning steps before the final answer. Magic phrase: "Let's think step by step." Massively improves arithmetic, logic, multi-step reasoning.

ADVANCED

Self-Consistency

Run CoT N times (e.g., 5–20). Collect N answers. Use majority vote as the final answer. Addresses LLM stochasticity — more reliable for math/logic.

ADVANCED

ReAct (Reason + Act)

The model alternates between Thought → Action → Observation. Actions can call real tools (search, calculator, APIs). Enables grounded, verifiable reasoning with external data.

ADVANCED

Prompt Chaining

Output of prompt N becomes input of prompt N+1. Break complex multi-step tasks into a pipeline of simpler prompts. Foundation of modern AI agent pipelines.

RESEARCH

Tree-of-Thought (ToT)

Generalizes CoT: generate multiple candidate reasoning branches, evaluate each, explore the most promising paths via BFS/DFS/beam search. Published by Princeton/Google (2023).

📊 When to Use Which Technique (Decision Guide) Zero-shot → simple tasks with common formats. Few-shot → custom formats or less common tasks. CoT → math, logic, multi-step reasoning. Self-consistency → high-stakes CoT tasks. ReAct → tasks needing real-world information lookup. Prompt chaining → complex multi-stage workflows. ToT → extremely hard combinatorial reasoning (planning, games).

🔬 Interactive: Standard vs. Chain-of-Thought

Click each approach to see how the LLM response changes for an arithmetic reasoning problem.

Question: "John has 3 bags. Each bag has 5 apples. He gives away 7 apples total. How many does he have?" STANDARD PROMPT response: → "8" (sometimes wrong: "3×5=15, 15-7=8" but may output "7") Problem: Model jumps to answer without showing work.

Advanced: Structured Output & JSON Mode 2025 PRACTICE

▼

Why This Matters

In production systems, you rarely want free text — you need structured, parseable output that your code can reliably process. Modern LLMs support JSON mode and structured generation (constrained decoding) to guarantee output format compliance.

# Example: Constrained JSON output via system prompt System: You extract entities from text. Respond ONLY with valid JSON matching this schema: {"entities": [{"text": str, "type": "PERSON|ORG|LOC", "confidence": float}]} User: "Dr. Smith from Georgia Tech arrived in Atlanta." Model output: { "entities": [ {"text": "Dr. Smith", "type": "PERSON", "confidence": 0.97}, {"text": "Georgia Tech", "type": "ORG", "confidence": 0.99}, {"text": "Atlanta", "type": "LOC", "confidence": 0.98} ] }

✅ Libraries for Structured Output (2025) Instructor (Python), Outlines, Guidance, Pydantic + OpenAI function calling. These frameworks enforce schema compliance at the token sampling level — making output predictable for downstream code.

🔍

Retrieval-Augmented Generation (RAG)

M12T1L3 · Giving LLMs real-world, up-to-date knowledge

📖 Story: The Problem RAG Solves

Imagine a brilliant professor who graduated in 2023 and has read everything published before then. Ask them about events in 2025 and they'll confidently make things up — this is hallucination. Now imagine the same professor, but before answering, they can search the internet, pull 5 relevant articles, and read them first. That's RAG. It gives the frozen LLM access to current, specific, domain-relevant knowledge at inference time — without retraining the model at all.

Why RAG? The Limitations of Base LLMs

▼

Problem	Without RAG	With RAG
Knowledge cutoff	Frozen at training date	Real-time retrieval from updated docs
Hallucination	Generates plausible but false facts	Grounds answers in retrieved evidence
Domain specificity	Generic training data, weak on niche domains	Retrieves from your proprietary/domain corpus
Verifiability	No source citations	Cites retrieved passages
Cost	Fine-tuning = expensive, slow	No model retraining needed

RAG Architecture: Full Pipeline DEEP DIVE

▼

RAG has two distinct phases: Indexing (offline, one-time) and Querying (online, per request).

📦 Phase 1: Indexing (Offline)

📄 Raw Documents

→

✂️ Chunk & Split
(e.g., 512 tokens)

→

🔢 Embedding Model
(text → vector)

→

🗄️ Vector Database
(Pinecone, Chroma, Weaviate)

🔎 Phase 2: Querying (Online, per user request)

❓ User Query

→

🔢 Embed Query
(same embedding model)

→

🔍 Semantic Search
(cosine similarity top-k)

→

📎 Augment Prompt
Query + Retrieved chunks

→

🤖 LLM Generation

→

✅ Grounded Answer

# Augmented prompt structure sent to the LLM: System: Answer using ONLY the provided context. If the answer isn't in the context, say "I don't know." Context: [Retrieved Chunk 1]: "RAG was introduced by Lewis et al., 2020 at Meta AI..." [Retrieved Chunk 2]: "Vector similarity search uses cosine distance..." [Retrieved Chunk 3]: "Chunking strategy affects retrieval quality..." User Query: When was RAG introduced and by whom? # LLM response is now grounded in the retrieved context # → citable, verifiable, domain-specific

Chunking Strategies: The Most Underrated RAG Decision DEEP DIVE

▼

Why Chunking Matters

If your chunks are too large, you retrieve noisy context that distracts the LLM. Too small, and you lose the surrounding context needed to understand the passage. The right chunk strategy depends on your document type and query patterns.

Strategy	How	Best For	Drawback
Fixed-size	Split every N tokens with overlap	Simple, fast baseline	May split mid-sentence, losing coherence
Sentence-based	Split on sentence boundaries	Short factual queries	Some sentences need surrounding context
Semantic	Split where topic changes (embedding similarity drops)	Long, topic-diverse docs	Computationally expensive
Hierarchical	Small chunks for retrieval, large parent chunks for generation	Complex documents	More complex indexing pipeline
Agentic	Agent decides retrieval strategy dynamically	Multi-hop questions	Requires agentic framework

Advanced RAG Patterns BEYOND CLASS

▼

🔄 RAG-Fusion

Generate multiple reformulations of the query → retrieve for each → fuse results via Reciprocal Rank Fusion (RRF). Dramatically improves recall by covering different phrasings of the same intent.

💡 HyDE

Hypothetical Document Embedding: Ask the LLM to generate a hypothetical answer first → embed that answer → use it to retrieve real documents. Bridges the gap between short queries and long passages.

🤖 Agentic RAG

Instead of a single retrieval step, an agent decides when and what to retrieve, can perform multiple retrieval rounds, call APIs, synthesize across sources. State-of-the-art for complex Q&A.

🔬 Research: RAG vs. Fine-Tuning vs. Long Context Three ways to give LLMs specific knowledge: (1) RAG — retrieve at inference time; (2) Fine-tuning — bake knowledge into weights; (3) Long context — stuff everything in the prompt. In 2025, the consensus is: RAG for dynamic/large knowledge, fine-tuning for style/format, long-context for small but crucial documents. Many production systems use all three.

RAG Evaluation Metrics

▼

🔍 Retrieval Quality

Precision@k: fraction of retrieved chunks that are relevant
Recall@k: fraction of relevant chunks retrieved
MRR: Mean Reciprocal Rank — how early the first relevant chunk appears

📝 Generation Quality

Faithfulness: does the answer only use retrieved context?
Answer Relevance: cosine similarity vs. expected answer
Hallucination Rate: facts not supported by context

⚡ System Performance

Latency: end-to-end response time
Cost per query: embedding + vector search + LLM tokens
LLM-as-judge: a stronger LLM evaluates weaker LLM's output

RAG Governance & Safety

▼

⚠️ Prompt Injection

Malicious content in retrieved documents can override system instructions. Mitigation: input sanitization, sandboxed retrieval, strict system prompt boundaries.

🔒 PII & Data Privacy

If the knowledge base contains personal data, retrieved context may expose it. Mitigation: PII detection/redaction before indexing, access-controlled vector stores.

🚫 Toxicity & Bias

Retrieved documents may contain biased or toxic content. Mitigation: content filtering on retrieved chunks before augmentation.

🎯 Grounding Failures

Model ignores retrieved context and falls back to parametric memory. Mitigation: explicit grounding instructions, citation requirements, faithfulness scoring.

RAG vs. Fine-Tuning vs. Prompt Engineering: Decision Framework BEYOND CLASS

▼

Dimension	Prompt Engineering	RAG	Fine-Tuning
Data needed	None	Documents (unstructured)	Labeled examples (expensive)
Knowledge updates	Stuck in prompt	Update vector DB cheaply	Retrain required
Factual accuracy	Can hallucinate	Grounded in retrieved docs	Bakes in training data
Style/format control	Moderate	Moderate	Best — model learns new patterns
Cost	Cheapest	Moderate (vector DB)	Most expensive
Latency	Lowest	Moderate (retrieval adds ~100ms)	Same as base model
Best use case	Common tasks, quick iteration	Domain Q&A, dynamic knowledge	Custom tone, specialized tasks

🧠 Quiz 10 Prep — Generative AI, Prompting & RAG

1. What is the key architectural difference between BERT and GPT that determines whether they can be used for generation?

✅ Correct! GPT's causal masking means each token can only attend to previous tokens, enabling autoregressive generation. BERT sees the full context bidirectionally, making it excellent for classification but unable to generate coherently without modification.

❌ Review the decoder architecture. The key difference is about masking — GPT masks future tokens during self-attention, enabling left-to-right generation.

2. In Chain-of-Thought prompting, what does adding "Let's think step by step" accomplish at a technical level?

✅ Correct! The reasoning tokens generated before the final answer are in the context window and directly influence what the model predicts as the answer. The model uses its own intermediate thoughts as additional context.

❌ CoT is all about the tokens in context. By generating intermediate reasoning steps, those steps become part of the context that influences the final answer — no weight changes needed.

3. In a RAG system, why is the same embedding model used for BOTH indexing documents AND encoding user queries?

✅ Correct! Cosine similarity only makes sense if vectors are in the same geometric space. If you indexed with model A and queried with model B, the similarity scores would be meaningless — you'd be comparing vectors from two different coordinate systems.

❌ Think about what vector similarity means geometrically. The same embedding model is essential for the vector space to be consistent.

4. A company wants to deploy a QA chatbot for its internal HR policy documents. The documents are updated quarterly. Which approach is MOST appropriate?

✅ Correct! RAG is ideal here: documents update quarterly (no retraining cost), the knowledge is too large for a context window, and grounding answers in retrieved docs reduces hallucination. Update the vector DB when policies change.

❌ Consider: fine-tuning requires expensive retraining each quarter. Long context has limits and high cost. Zero-shot would hallucinate company-specific policies. RAG is the right tool here.

← Week 12: Topic Modeling Week 14: Agentic AI →

Benchmark	What It Measures
GLUE / SuperGLUE	Classic NLU tasks (NLI, QA, similarity)
MMLU (57 subjects)	World knowledge, professional-level QA
BIG-Bench Hard	Reasoning tasks where models fail
HELM	Holistic: accuracy + calibration + fairness + efficiency
MT-Bench	Multi-turn instruction following (LLM judge)
HumanEval / MBPP	Code generation
MATH / GSM8K	Mathematical reasoning

Generative AI, Prompt Engineering & RAG

What is Generative AI?

Core Idea & Why It Matters

🏭 Proprietary Models

🔓 Open Source Models

How LLMs Generate Text: The Decoder Loop DEEP DIVE

Training LLMs: Language Modeling vs. Instruction Fine-Tuning DEEP DIVE

🎛️ Interactive: Temperature & Top-p Sampling

Prompt Engineering

Anatomy of a Prompt

Prompting Techniques: From Zero-Shot to Advanced DEEP DIVE

Zero-Shot Prompting

Few-Shot Prompting

Chain-of-Thought (CoT)

Self-Consistency

ReAct (Reason + Act)

Prompt Chaining

Tree-of-Thought (ToT)

🔬 Interactive: Standard vs. Chain-of-Thought

Advanced: Structured Output & JSON Mode 2025 PRACTICE

Retrieval-Augmented Generation (RAG)

Why RAG? The Limitations of Base LLMs

RAG Architecture: Full Pipeline DEEP DIVE

Chunking Strategies: The Most Underrated RAG Decision DEEP DIVE

Advanced RAG Patterns BEYOND CLASS

🔄 RAG-Fusion

💡 HyDE

🤖 Agentic RAG

RAG Evaluation Metrics

🔍 Retrieval Quality

📝 Generation Quality

⚡ System Performance

RAG Governance & Safety

⚠️ Prompt Injection

🔒 PII & Data Privacy

🚫 Toxicity & Bias

🎯 Grounding Failures

RAG vs. Fine-Tuning vs. Prompt Engineering: Decision Framework BEYOND CLASS

🧠 Quiz 10 Prep — Generative AI, Prompting & RAG

Evaluation Science: BLEU, BERTScore, RAGAS & Beyond

📏 BLEU: The Classic You Must Know — And Know Its Limits Core

BLEU Derivation

Known Failure Modes

📊 ROUGE, BERTScore & Beyond

ROUGE Family (Summarization)

BERTScore (2020)

🔍 RAGAS: Reference-Free RAG Evaluation 2024 Standard

Faithfulness

Answer Relevancy

Context Precision

Context Recall

Answer Correctness

Context Entity Recall

⚖️ LLM-as-Judge & Evaluation Benchmarks Frontier

LLM-as-Judge (MT-Bench, G-Eval)

Benchmark Taxonomy (2026)