From how LLMs generate text token-by-token, to crafting powerful prompts, to building RAG systems that give models real-world knowledge โ this week bridges theory and modern practice.
Every model we've seen so far โ Naive Bayes, SVM, LSTM, BERT โ takes input text and classifies it. They answer yes/no or pick a label. Generative AI does something far more open-ended: given the sentence "The weather today is", it must predict what comes next. And then after outputting "sunny", it must predict the word after that, and so on. This is a completely different game โ instead of one decision, the model makes hundreds of decisions sequentially, each one conditioned on everything before it.
Generative AI is a class of AI models trained to create content โ text, images, code, audio, video โ based on patterns learned from massive datasets. Unlike discriminative models (which draw a boundary between classes), generative models learn the underlying data distribution and can sample from it.
Before generative models, every NLP task needed its own labeled dataset and custom classifier. Generative LLMs are trained on unsupervised text prediction (no labels needed โ the next word is always the label). This allows training on trillions of tokens. The resulting model captures such rich language understanding that it can be adapted to almost any task via prompting alone, without retraining.
Generative models shine when the output space is open-ended: summarization, question answering, code generation, creative writing, translation, chatbots. They're overkill (and expensive) if you only need a simple label like "spam vs. not spam" with a small dataset.
Generation is an autoregressive process โ each token is generated one at a time, and each new token is conditioned on all previously generated tokens. The loop runs until a special end-of-sequence token is produced or a maximum length is reached.
| Stage | Technique | Goal | Data |
|---|---|---|---|
| Pre-training | Next-token prediction (CLM) | Learn language, world knowledge | Trillions of tokens, unsupervised |
| Pre-training | Masked Language Model (MLM โ BERT-style) | Bidirectional context understanding | Large unlabeled corpus |
| Post-training | Supervised Fine-Tuning (SFT) | Follow instructions | ~10Kโ100K curated instruction-response pairs |
| Alignment | RLHF (Reinforcement Learning from Human Feedback) | Helpful, harmless, honest outputs | Human preference rankings |
| Alignment | DPO (Direct Preference Optimization) | Same as RLHF, more stable/efficient | Preference pairs (chosen vs. rejected) |
Imagine you hired a brilliant consultant who has read every book ever written. Now, if you ask them "what do you think?", you get a vague answer. But if you say "You are a senior data scientist. Here are 3 examples of similar problems and their solutions. Now, think step-by-step, and tell me how to detect anomalies in sensor data using isolation forests" โ you get a masterpiece. That's prompt engineering: not changing the consultant's knowledge, just learning how to talk to them.
A well-designed prompt has four components that together specify exactly what you want from the model:
Ask the model to perform a task with no examples at all. Relies entirely on pre-trained knowledge. Works well for common tasks. Fails for novel formats.
Provide 2โ10 examples of (input โ output) pairs before your actual query. Demonstrates the pattern you want. Dramatically improves performance on custom formats.
Instruct the model to show its reasoning steps before the final answer. Magic phrase: "Let's think step by step." Massively improves arithmetic, logic, multi-step reasoning.
Run CoT N times (e.g., 5โ20). Collect N answers. Use majority vote as the final answer. Addresses LLM stochasticity โ more reliable for math/logic.
The model alternates between Thought โ Action โ Observation. Actions can call real tools (search, calculator, APIs). Enables grounded, verifiable reasoning with external data.
Output of prompt N becomes input of prompt N+1. Break complex multi-step tasks into a pipeline of simpler prompts. Foundation of modern AI agent pipelines.
Generalizes CoT: generate multiple candidate reasoning branches, evaluate each, explore the most promising paths via BFS/DFS/beam search. Published by Princeton/Google (2023).
In production systems, you rarely want free text โ you need structured, parseable output that your code can reliably process. Modern LLMs support JSON mode and structured generation (constrained decoding) to guarantee output format compliance.
Imagine a brilliant professor who graduated in 2023 and has read everything published before then. Ask them about events in 2025 and they'll confidently make things up โ this is hallucination. Now imagine the same professor, but before answering, they can search the internet, pull 5 relevant articles, and read them first. That's RAG. It gives the frozen LLM access to current, specific, domain-relevant knowledge at inference time โ without retraining the model at all.
| Problem | Without RAG | With RAG |
|---|---|---|
| Knowledge cutoff | Frozen at training date | Real-time retrieval from updated docs |
| Hallucination | Generates plausible but false facts | Grounds answers in retrieved evidence |
| Domain specificity | Generic training data, weak on niche domains | Retrieves from your proprietary/domain corpus |
| Verifiability | No source citations | Cites retrieved passages |
| Cost | Fine-tuning = expensive, slow | No model retraining needed |
RAG has two distinct phases: Indexing (offline, one-time) and Querying (online, per request).
If your chunks are too large, you retrieve noisy context that distracts the LLM. Too small, and you lose the surrounding context needed to understand the passage. The right chunk strategy depends on your document type and query patterns.
| Strategy | How | Best For | Drawback |
|---|---|---|---|
| Fixed-size | Split every N tokens with overlap | Simple, fast baseline | May split mid-sentence, losing coherence |
| Sentence-based | Split on sentence boundaries | Short factual queries | Some sentences need surrounding context |
| Semantic | Split where topic changes (embedding similarity drops) | Long, topic-diverse docs | Computationally expensive |
| Hierarchical | Small chunks for retrieval, large parent chunks for generation | Complex documents | More complex indexing pipeline |
| Agentic | Agent decides retrieval strategy dynamically | Multi-hop questions | Requires agentic framework |
Generate multiple reformulations of the query โ retrieve for each โ fuse results via Reciprocal Rank Fusion (RRF). Dramatically improves recall by covering different phrasings of the same intent.
Hypothetical Document Embedding: Ask the LLM to generate a hypothetical answer first โ embed that answer โ use it to retrieve real documents. Bridges the gap between short queries and long passages.
Instead of a single retrieval step, an agent decides when and what to retrieve, can perform multiple retrieval rounds, call APIs, synthesize across sources. State-of-the-art for complex Q&A.
Malicious content in retrieved documents can override system instructions. Mitigation: input sanitization, sandboxed retrieval, strict system prompt boundaries.
If the knowledge base contains personal data, retrieved context may expose it. Mitigation: PII detection/redaction before indexing, access-controlled vector stores.
Retrieved documents may contain biased or toxic content. Mitigation: content filtering on retrieved chunks before augmentation.
Model ignores retrieved context and falls back to parametric memory. Mitigation: explicit grounding instructions, citation requirements, faithfulness scoring.
| Dimension | Prompt Engineering | RAG | Fine-Tuning |
|---|---|---|---|
| Data needed | None | Documents (unstructured) | Labeled examples (expensive) |
| Knowledge updates | Stuck in prompt | Update vector DB cheaply | Retrain required |
| Factual accuracy | Can hallucinate | Grounded in retrieved docs | Bakes in training data |
| Style/format control | Moderate | Moderate | Best โ model learns new patterns |
| Cost | Cheapest | Moderate (vector DB) | Most expensive |
| Latency | Lowest | Moderate (retrieval adds ~100ms) | Same as base model |
| Best use case | Common tasks, quick iteration | Domain Q&A, dynamic knowledge | Custom tone, specialized tasks |
1. What is the key architectural difference between BERT and GPT that determines whether they can be used for generation?
2. In Chain-of-Thought prompting, what does adding "Let's think step by step" accomplish at a technical level?
3. In a RAG system, why is the same embedding model used for BOTH indexing documents AND encoding user queries?
4. A company wants to deploy a QA chatbot for its internal HR policy documents. The documents are updated quarterly. Which approach is MOST appropriate?
How do you know if your LLM is good? "Vibes" don't scale. This section covers the rigorous evaluation framework that every NLP researcher uses: from classical metrics with known failure modes, to modern reference-free evaluation with RAGAS and LLM-as-judge paradigms that define the frontier in 2026.
Same limitations as BLEU โ surface form, not semantics. Still used as baseline in summarization papers. Always report ROUGE-1, ROUGE-2, ROUGE-L together.
Correlates significantly better with human judgment than BLEU/ROUGE. Uses contextual embeddings โ semantically equivalent sentences score high. Standard in 2026 papers alongside ROUGE.
RAGAS (Shahul Es et al., EACL 2024) evaluates RAG systems without requiring human-labeled ground truth. It uses an LLM to judge each dimension โ enabling scalable, automated evaluation of production RAG pipelines.
Are all claims in the answer supported by the retrieved context? LLM breaks answer into statements, checks each against chunks.
Is the answer responsive to the query? LLM generates reverse questions from the answer, measures embedding similarity to original query.
Are the relevant chunks ranked higher in retrieved context? Measures proportion of relevant chunks in top-K, weighted by position.
Does retrieved context contain all info needed for the answer? Requires ground truth answer. Checks which answer sentences are attributable to context.
Is the answer factually correct vs. ground truth? Combines semantic similarity + factual overlap (requires reference answer).
What fraction of ground truth entities appear in retrieved context? Entity-level retrieval quality metric.
For production RAG debugging: low Context Precision โ improve chunking strategy or reranker. Low Faithfulness โ model hallucinating despite good retrieval. Low Answer Relevancy โ model answering the wrong question. RAGAS can generate synthetic test sets from your documents, enabling continuous evaluation without human labelers.
Use a strong LLM (GPT-4) to score responses. Correlates well with human judgments at scale. Known biases to correct for:
| Benchmark | What It Measures |
|---|---|
| GLUE / SuperGLUE | Classic NLU tasks (NLI, QA, similarity) |
| MMLU (57 subjects) | World knowledge, professional-level QA |
| BIG-Bench Hard | Reasoning tasks where models fail |
| HELM | Holistic: accuracy + calibration + fairness + efficiency |
| MT-Bench | Multi-turn instruction following (LLM judge) |
| HumanEval / MBPP | Code generation |
| MATH / GSM8K | Mathematical reasoning |