๐Ÿค–

What is Generative AI?

M12T1L1 ยท The shift from classification to generation

Every model we've seen so far โ€” Naive Bayes, SVM, LSTM, BERT โ€” takes input text and classifies it. They answer yes/no or pick a label. Generative AI does something far more open-ended: given the sentence "The weather today is", it must predict what comes next. And then after outputting "sunny", it must predict the word after that, and so on. This is a completely different game โ€” instead of one decision, the model makes hundreds of decisions sequentially, each one conditioned on everything before it.

Core Idea & Why It Matters

โ–ผ
What

Generative AI is a class of AI models trained to create content โ€” text, images, code, audio, video โ€” based on patterns learned from massive datasets. Unlike discriminative models (which draw a boundary between classes), generative models learn the underlying data distribution and can sample from it.

Why This Approach

Before generative models, every NLP task needed its own labeled dataset and custom classifier. Generative LLMs are trained on unsupervised text prediction (no labels needed โ€” the next word is always the label). This allows training on trillions of tokens. The resulting model captures such rich language understanding that it can be adapted to almost any task via prompting alone, without retraining.

When To Use

Generative models shine when the output space is open-ended: summarization, question answering, code generation, creative writing, translation, chatbots. They're overkill (and expensive) if you only need a simple label like "spam vs. not spam" with a small dataset.

๐Ÿญ Proprietary Models

  • GPT-4o / GPT-4 โ€” OpenAI
  • Claude 3/4 โ€” Anthropic
  • Gemini 1.5/2 โ€” Google DeepMind

๐Ÿ”“ Open Source Models

  • Llama 3 (8B, 70B, 405B) โ€” Meta
  • Mistral 7B / Mixtral 8ร—7B โ€” Mistral AI
  • Qwen 2.5 / DeepSeek-V3 โ€” Alibaba/DeepSeek

How LLMs Generate Text: The Decoder Loop DEEP DIVE

โ–ผ
How โ€” Step by Step

Generation is an autoregressive process โ€” each token is generated one at a time, and each new token is conditioned on all previously generated tokens. The loop runs until a special end-of-sequence token is produced or a maximum length is reached.

๐Ÿ“ Input: User Prompt โ†’ Tokenized to integer IDs
e.g., "How are you?" โ†’ [5299, 389, 345, 30]
โ†“
๐Ÿ”ข Token Embedding: Each ID โ†’ dense vector (e.g., 4096-dim)
โ†“
๐Ÿ“ + Positional Encoding: RoPE or sinusoidal, encodes token position
โ†“
๐Ÿ”€ Masked Self-Attention: Each token attends only to PAST tokens (causal)
โ†“
โš™๏ธ Feed-Forward Network: Non-linear transformation per position
โ†“ (repeat N times = N Transformer blocks)
๐Ÿ“Š Output Layer (LM Head): Linear โ†’ Softmax โ†’ Vocabulary probability distribution
โ†“
โœ… Token Selection: argmax / sampling / beam search โ†’ next token
โ†“ (append token, loop back)
๐Ÿ”‘ Key Insight: Why "Masked" Attention? During training, we show the model the whole sentence at once (efficient). But generation must be causal โ€” token 5 cannot see token 6. Masking the future tokens during training makes the model learn to predict based only on context to the LEFT. This is the crucial architectural difference between BERT (sees both directions) and GPT (sees only left).
โœ… Autoregressive vs. Non-Autoregressive Generation GPT-style models generate left-to-right, token by token (autoregressive). Research has explored non-autoregressive models that generate all tokens in parallel for speed โ€” but quality is generally lower because each token's generation is independent of the others being generated.

Training LLMs: Language Modeling vs. Instruction Fine-Tuning DEEP DIVE

โ–ผ
StageTechniqueGoalData
Pre-trainingNext-token prediction (CLM)Learn language, world knowledgeTrillions of tokens, unsupervised
Pre-trainingMasked Language Model (MLM โ€” BERT-style)Bidirectional context understandingLarge unlabeled corpus
Post-trainingSupervised Fine-Tuning (SFT)Follow instructions~10Kโ€“100K curated instruction-response pairs
AlignmentRLHF (Reinforcement Learning from Human Feedback)Helpful, harmless, honest outputsHuman preference rankings
AlignmentDPO (Direct Preference Optimization)Same as RLHF, more stable/efficientPreference pairs (chosen vs. rejected)
๐Ÿ’ก Why RLHF โ†’ DPO? RLHF requires training a separate reward model and then running PPO (a complex RL algorithm) โ€” unstable and resource-intensive. DPO (Rafailov et al., 2023) showed you can directly optimize on preference data using a classification objective, achieving similar or better alignment without the RL training loop. Most open-source models (Llama-3-Instruct, Mistral-Instruct) now use DPO variants.
# DPO Objective (simplified intuition): # Maximize probability of chosen response OVER rejected response # Relative to a reference (pre-SFT) model L_DPO = -E[ log ฯƒ( ฮฒ ยท log(ฯ€_ฮธ(y_w|x)/ฯ€_ref(y_w|x)) - ฮฒ ยท log(ฯ€_ฮธ(y_l|x)/ฯ€_ref(y_l|x)) ) ] where: y_w = chosen (preferred) response y_l = loser (rejected) response ฯ€_ฮธ = model being trained ฯ€_ref = reference model (frozen SFT model) ฮฒ = temperature controlling deviation from reference

๐ŸŽ›๏ธ Interactive: Temperature & Top-p Sampling

Adjust the sliders to understand how temperature and top-p affect text diversity. These are the core decoding hyperparameters you set when calling any LLM API.
Prompt: "The weather in Atlanta today is..." Temperature=0.7, Top-p=0.9 โ†’ Balanced, coherent but varied Example: "The weather in Atlanta today is warm and partly cloudy, with a high of 82ยฐF and a slight chance of afternoon showers."
๐Ÿง 

Prompt Engineering

M12T1L2 ยท Communicating with LLMs effectively

Imagine you hired a brilliant consultant who has read every book ever written. Now, if you ask them "what do you think?", you get a vague answer. But if you say "You are a senior data scientist. Here are 3 examples of similar problems and their solutions. Now, think step-by-step, and tell me how to detect anomalies in sensor data using isolation forests" โ€” you get a masterpiece. That's prompt engineering: not changing the consultant's knowledge, just learning how to talk to them.

Anatomy of a Prompt

โ–ผ
What

A well-designed prompt has four components that together specify exactly what you want from the model:

# COMPONENT 1: SYSTEM / ROLE You are a senior NLP researcher at Georgia Tech. Respond with technical precision appropriate for a PhD audience. # COMPONENT 2: CONTEXT (relevant background) The following text is from a student NLP assignment on sequence labeling using CRFs. # COMPONENT 3: INSTRUCTION + QUERY Classify the following as: Helpful | Partially Helpful | Not Helpful Text: "The Viterbi algorithm computes the most likely tag sequence." # COMPONENT 4: OUTPUT INDICATOR Respond ONLY with one of: Helpful | Partially Helpful | Not Helpful Then explain your reasoning in one sentence.
โœ… Best Practices for Prompt Design Be specific about the role. Provide examples when possible. Constrain the output format. Use "Think step by step" for reasoning tasks. Separate components with clear delimiters (###, XML tags, etc.).

Prompting Techniques: From Zero-Shot to Advanced DEEP DIVE

โ–ผ
BASIC

Zero-Shot Prompting

Ask the model to perform a task with no examples at all. Relies entirely on pre-trained knowledge. Works well for common tasks. Fails for novel formats.

BASIC

Few-Shot Prompting

Provide 2โ€“10 examples of (input โ†’ output) pairs before your actual query. Demonstrates the pattern you want. Dramatically improves performance on custom formats.

INTERMEDIATE

Chain-of-Thought (CoT)

Instruct the model to show its reasoning steps before the final answer. Magic phrase: "Let's think step by step." Massively improves arithmetic, logic, multi-step reasoning.

ADVANCED

Self-Consistency

Run CoT N times (e.g., 5โ€“20). Collect N answers. Use majority vote as the final answer. Addresses LLM stochasticity โ€” more reliable for math/logic.

ADVANCED

ReAct (Reason + Act)

The model alternates between Thought โ†’ Action โ†’ Observation. Actions can call real tools (search, calculator, APIs). Enables grounded, verifiable reasoning with external data.

ADVANCED

Prompt Chaining

Output of prompt N becomes input of prompt N+1. Break complex multi-step tasks into a pipeline of simpler prompts. Foundation of modern AI agent pipelines.

RESEARCH

Tree-of-Thought (ToT)

Generalizes CoT: generate multiple candidate reasoning branches, evaluate each, explore the most promising paths via BFS/DFS/beam search. Published by Princeton/Google (2023).

๐Ÿ“Š When to Use Which Technique (Decision Guide) Zero-shot โ†’ simple tasks with common formats. Few-shot โ†’ custom formats or less common tasks. CoT โ†’ math, logic, multi-step reasoning. Self-consistency โ†’ high-stakes CoT tasks. ReAct โ†’ tasks needing real-world information lookup. Prompt chaining โ†’ complex multi-stage workflows. ToT โ†’ extremely hard combinatorial reasoning (planning, games).

๐Ÿ”ฌ Interactive: Standard vs. Chain-of-Thought

Click each approach to see how the LLM response changes for an arithmetic reasoning problem.
Question: "John has 3 bags. Each bag has 5 apples. He gives away 7 apples total. How many does he have?" STANDARD PROMPT response: โ†’ "8" (sometimes wrong: "3ร—5=15, 15-7=8" but may output "7") Problem: Model jumps to answer without showing work.

Advanced: Structured Output & JSON Mode 2025 PRACTICE

โ–ผ
Why This Matters

In production systems, you rarely want free text โ€” you need structured, parseable output that your code can reliably process. Modern LLMs support JSON mode and structured generation (constrained decoding) to guarantee output format compliance.

# Example: Constrained JSON output via system prompt System: You extract entities from text. Respond ONLY with valid JSON matching this schema: {"entities": [{"text": str, "type": "PERSON|ORG|LOC", "confidence": float}]} User: "Dr. Smith from Georgia Tech arrived in Atlanta." Model output: { "entities": [ {"text": "Dr. Smith", "type": "PERSON", "confidence": 0.97}, {"text": "Georgia Tech", "type": "ORG", "confidence": 0.99}, {"text": "Atlanta", "type": "LOC", "confidence": 0.98} ] }
โœ… Libraries for Structured Output (2025) Instructor (Python), Outlines, Guidance, Pydantic + OpenAI function calling. These frameworks enforce schema compliance at the token sampling level โ€” making output predictable for downstream code.
๐Ÿ”

Retrieval-Augmented Generation (RAG)

M12T1L3 ยท Giving LLMs real-world, up-to-date knowledge

Imagine a brilliant professor who graduated in 2023 and has read everything published before then. Ask them about events in 2025 and they'll confidently make things up โ€” this is hallucination. Now imagine the same professor, but before answering, they can search the internet, pull 5 relevant articles, and read them first. That's RAG. It gives the frozen LLM access to current, specific, domain-relevant knowledge at inference time โ€” without retraining the model at all.

Why RAG? The Limitations of Base LLMs

โ–ผ
ProblemWithout RAGWith RAG
Knowledge cutoffFrozen at training dateReal-time retrieval from updated docs
HallucinationGenerates plausible but false factsGrounds answers in retrieved evidence
Domain specificityGeneric training data, weak on niche domainsRetrieves from your proprietary/domain corpus
VerifiabilityNo source citationsCites retrieved passages
CostFine-tuning = expensive, slowNo model retraining needed

RAG Architecture: Full Pipeline DEEP DIVE

โ–ผ

RAG has two distinct phases: Indexing (offline, one-time) and Querying (online, per request).

๐Ÿ“ฆ Phase 1: Indexing (Offline)
๐Ÿ“„ Raw Documents
โ†’
โœ‚๏ธ Chunk & Split
(e.g., 512 tokens)
โ†’
๐Ÿ”ข Embedding Model
(text โ†’ vector)
โ†’
๐Ÿ—„๏ธ Vector Database
(Pinecone, Chroma, Weaviate)
๐Ÿ”Ž Phase 2: Querying (Online, per user request)
โ“ User Query
โ†’
๐Ÿ”ข Embed Query
(same embedding model)
โ†’
๐Ÿ” Semantic Search
(cosine similarity top-k)
โ†’
๐Ÿ“Ž Augment Prompt
Query + Retrieved chunks
โ†’
๐Ÿค– LLM Generation
โ†’
โœ… Grounded Answer
# Augmented prompt structure sent to the LLM: System: Answer using ONLY the provided context. If the answer isn't in the context, say "I don't know." Context: [Retrieved Chunk 1]: "RAG was introduced by Lewis et al., 2020 at Meta AI..." [Retrieved Chunk 2]: "Vector similarity search uses cosine distance..." [Retrieved Chunk 3]: "Chunking strategy affects retrieval quality..." User Query: When was RAG introduced and by whom? # LLM response is now grounded in the retrieved context # โ†’ citable, verifiable, domain-specific

Chunking Strategies: The Most Underrated RAG Decision DEEP DIVE

โ–ผ
Why Chunking Matters

If your chunks are too large, you retrieve noisy context that distracts the LLM. Too small, and you lose the surrounding context needed to understand the passage. The right chunk strategy depends on your document type and query patterns.

StrategyHowBest ForDrawback
Fixed-sizeSplit every N tokens with overlapSimple, fast baselineMay split mid-sentence, losing coherence
Sentence-basedSplit on sentence boundariesShort factual queriesSome sentences need surrounding context
SemanticSplit where topic changes (embedding similarity drops)Long, topic-diverse docsComputationally expensive
HierarchicalSmall chunks for retrieval, large parent chunks for generationComplex documentsMore complex indexing pipeline
AgenticAgent decides retrieval strategy dynamicallyMulti-hop questionsRequires agentic framework

Advanced RAG Patterns BEYOND CLASS

โ–ผ

๐Ÿ”„ RAG-Fusion

Generate multiple reformulations of the query โ†’ retrieve for each โ†’ fuse results via Reciprocal Rank Fusion (RRF). Dramatically improves recall by covering different phrasings of the same intent.

๐Ÿ’ก HyDE

Hypothetical Document Embedding: Ask the LLM to generate a hypothetical answer first โ†’ embed that answer โ†’ use it to retrieve real documents. Bridges the gap between short queries and long passages.

๐Ÿค– Agentic RAG

Instead of a single retrieval step, an agent decides when and what to retrieve, can perform multiple retrieval rounds, call APIs, synthesize across sources. State-of-the-art for complex Q&A.

๐Ÿ”ฌ Research: RAG vs. Fine-Tuning vs. Long Context Three ways to give LLMs specific knowledge: (1) RAG โ€” retrieve at inference time; (2) Fine-tuning โ€” bake knowledge into weights; (3) Long context โ€” stuff everything in the prompt. In 2025, the consensus is: RAG for dynamic/large knowledge, fine-tuning for style/format, long-context for small but crucial documents. Many production systems use all three.

RAG Evaluation Metrics

โ–ผ

๐Ÿ” Retrieval Quality

  • Precision@k: fraction of retrieved chunks that are relevant
  • Recall@k: fraction of relevant chunks retrieved
  • MRR: Mean Reciprocal Rank โ€” how early the first relevant chunk appears

๐Ÿ“ Generation Quality

  • Faithfulness: does the answer only use retrieved context?
  • Answer Relevance: cosine similarity vs. expected answer
  • Hallucination Rate: facts not supported by context

โšก System Performance

  • Latency: end-to-end response time
  • Cost per query: embedding + vector search + LLM tokens
  • LLM-as-judge: a stronger LLM evaluates weaker LLM's output

RAG Governance & Safety

โ–ผ

โš ๏ธ Prompt Injection

Malicious content in retrieved documents can override system instructions. Mitigation: input sanitization, sandboxed retrieval, strict system prompt boundaries.

๐Ÿ”’ PII & Data Privacy

If the knowledge base contains personal data, retrieved context may expose it. Mitigation: PII detection/redaction before indexing, access-controlled vector stores.

๐Ÿšซ Toxicity & Bias

Retrieved documents may contain biased or toxic content. Mitigation: content filtering on retrieved chunks before augmentation.

๐ŸŽฏ Grounding Failures

Model ignores retrieved context and falls back to parametric memory. Mitigation: explicit grounding instructions, citation requirements, faithfulness scoring.

RAG vs. Fine-Tuning vs. Prompt Engineering: Decision Framework BEYOND CLASS

โ–ผ
DimensionPrompt EngineeringRAGFine-Tuning
Data neededNoneDocuments (unstructured)Labeled examples (expensive)
Knowledge updatesStuck in promptUpdate vector DB cheaplyRetrain required
Factual accuracyCan hallucinateGrounded in retrieved docsBakes in training data
Style/format controlModerateModerateBest โ€” model learns new patterns
CostCheapestModerate (vector DB)Most expensive
LatencyLowestModerate (retrieval adds ~100ms)Same as base model
Best use caseCommon tasks, quick iterationDomain Q&A, dynamic knowledgeCustom tone, specialized tasks

๐Ÿง  Quiz 10 Prep โ€” Generative AI, Prompting & RAG

1. What is the key architectural difference between BERT and GPT that determines whether they can be used for generation?

โœ… Correct! GPT's causal masking means each token can only attend to previous tokens, enabling autoregressive generation. BERT sees the full context bidirectionally, making it excellent for classification but unable to generate coherently without modification.
โŒ Review the decoder architecture. The key difference is about masking โ€” GPT masks future tokens during self-attention, enabling left-to-right generation.

2. In Chain-of-Thought prompting, what does adding "Let's think step by step" accomplish at a technical level?

โœ… Correct! The reasoning tokens generated before the final answer are in the context window and directly influence what the model predicts as the answer. The model uses its own intermediate thoughts as additional context.
โŒ CoT is all about the tokens in context. By generating intermediate reasoning steps, those steps become part of the context that influences the final answer โ€” no weight changes needed.

3. In a RAG system, why is the same embedding model used for BOTH indexing documents AND encoding user queries?

โœ… Correct! Cosine similarity only makes sense if vectors are in the same geometric space. If you indexed with model A and queried with model B, the similarity scores would be meaningless โ€” you'd be comparing vectors from two different coordinate systems.
โŒ Think about what vector similarity means geometrically. The same embedding model is essential for the vector space to be consistent.

4. A company wants to deploy a QA chatbot for its internal HR policy documents. The documents are updated quarterly. Which approach is MOST appropriate?

โœ… Correct! RAG is ideal here: documents update quarterly (no retraining cost), the knowledge is too large for a context window, and grounding answers in retrieved docs reduces hallucination. Update the vector DB when policies change.
โŒ Consider: fine-tuning requires expensive retraining each quarter. Long context has limits and high cost. Zero-shot would hallucinate company-specific policies. RAG is the right tool here.
๐Ÿ”ฌ Beyond the Slides ยท Graduate Depth

Evaluation Science: BLEU, BERTScore, RAGAS & Beyond

How do you know if your LLM is good? "Vibes" don't scale. This section covers the rigorous evaluation framework that every NLP researcher uses: from classical metrics with known failure modes, to modern reference-free evaluation with RAGAS and LLM-as-judge paradigms that define the frontier in 2026.

๐Ÿ“ BLEU: The Classic You Must Know โ€” And Know Its Limits Core

BLEU Derivation

# Modified n-gram precision pโ‚™ = ฮฃ_{cand} ฮฃ_{ngram} min(count(ngram,cand), max_count_in_refs) / ฮฃ_{cand} count_ngrams(cand) # Brevity penalty (penalize short outputs) BP = 1 if len(c) > len(r) = exp(1-r/c) if len(c) โ‰ค len(r) # Final BLEU score BLEU = BP ยท exp(ฮฃโ‚™ wโ‚™ ยท log pโ‚™) (typically n=1..4, uniform weights)

Known Failure Modes

  • Rewards n-gram overlap, not semantic equivalence
  • "The cat sat." vs "A dog stood." can score 0 BLEU despite semantic similarity
  • Doesn't capture fluency or coherence
  • Poor correlation with human judgment on summarization
  • Language-agnostic (treats all tokens equally)
  • Best for: MT (still standard), document-level benchmarks
  • Poor for: open-ended generation, dialogue, summarization

๐Ÿ“Š ROUGE, BERTScore & Beyond

ROUGE Family (Summarization)

ROUGE-N = (matched n-grams) / (n-grams in reference) # Recall-oriented; standard for summarization ROUGE-1: unigram overlap ROUGE-2: bigram overlap ROUGE-L: Longest Common Subsequence ROUGE-Lsum: sentence-level LCS for summaries

Same limitations as BLEU โ€” surface form, not semantics. Still used as baseline in summarization papers. Always report ROUGE-1, ROUGE-2, ROUGE-L together.

BERTScore (2020)

For each token in candidate: find max cosine similarity to any token in reference (using BERT embeddings) BERTScore_P = ฮฃแตข max_j cos(ฤ‰แตข, rโฑผ) / |c| BERTScore_R = ฮฃโฑผ max_i cos(ฤ‰แตข, rโฑผ) / |r| BERTScore_F = 2ยทPยทR / (P+R)

Correlates significantly better with human judgment than BLEU/ROUGE. Uses contextual embeddings โ€” semantically equivalent sentences score high. Standard in 2026 papers alongside ROUGE.

๐Ÿ” RAGAS: Reference-Free RAG Evaluation 2024 Standard

RAGAS (Shahul Es et al., EACL 2024) evaluates RAG systems without requiring human-labeled ground truth. It uses an LLM to judge each dimension โ€” enabling scalable, automated evaluation of production RAG pipelines.

Faithfulness
0โ€“1

Are all claims in the answer supported by the retrieved context? LLM breaks answer into statements, checks each against chunks.

Answer Relevancy
0โ€“1

Is the answer responsive to the query? LLM generates reverse questions from the answer, measures embedding similarity to original query.

Context Precision
0โ€“1

Are the relevant chunks ranked higher in retrieved context? Measures proportion of relevant chunks in top-K, weighted by position.

Context Recall
0โ€“1

Does retrieved context contain all info needed for the answer? Requires ground truth answer. Checks which answer sentences are attributable to context.

Answer Correctness
0โ€“1

Is the answer factually correct vs. ground truth? Combines semantic similarity + factual overlap (requires reference answer).

Context Entity Recall
0โ€“1

What fraction of ground truth entities appear in retrieved context? Entity-level retrieval quality metric.

๐Ÿ’ก How to Use RAGAS in Practice

For production RAG debugging: low Context Precision โ†’ improve chunking strategy or reranker. Low Faithfulness โ†’ model hallucinating despite good retrieval. Low Answer Relevancy โ†’ model answering the wrong question. RAGAS can generate synthetic test sets from your documents, enabling continuous evaluation without human labelers.

โš–๏ธ LLM-as-Judge & Evaluation Benchmarks Frontier

LLM-as-Judge (MT-Bench, G-Eval)

Use a strong LLM (GPT-4) to score responses. Correlates well with human judgments at scale. Known biases to correct for:

  • Positional bias: first answer rated higher
  • Verbosity bias: longer = better perception
  • Self-enhancement: GPT-4 prefers GPT-4 outputs
  • Mitigation: swap order, calibrate with human labels

Benchmark Taxonomy (2026)

BenchmarkWhat It Measures
GLUE / SuperGLUEClassic NLU tasks (NLI, QA, similarity)
MMLU (57 subjects)World knowledge, professional-level QA
BIG-Bench HardReasoning tasks where models fail
HELMHolistic: accuracy + calibration + fairness + efficiency
MT-BenchMulti-turn instruction following (LLM judge)
HumanEval / MBPPCode generation
MATH / GSM8KMathematical reasoning