🀝

Multi-Agent Systems: When One Agent Isn't Enough

M13T1L3 Β· Collaboration, specialization, and emergence

When McKinsey tackles a complex business problem, they don't send one person. They send a team: a project manager who coordinates, a financial analyst, a market researcher, a strategy consultant, and a technical expert β€” each contributing their specialty. The project manager ensures they're all working toward the same goal and their outputs fit together. That's exactly how multi-agent systems work. Specialization + coordination = capabilities that no single agent (or person) could achieve alone.

Why Multi-Agent? The Limits of Single Agents

β–Ό
ChallengeSingle Agent LimitMulti-Agent Solution
Context windowOne agent β†’ one context β†’ limited memoryEach agent has its own context β†’ total effective memory scales with # agents
SpecializationGeneralist = mediocre at everythingSpecialist agents β†’ each optimized for one sub-task
ParallelismSingle agent is sequentialMultiple agents work simultaneously on independent sub-tasks
Error checkingSingle agent self-checks (biased)Critic agent independently reviews generator agent's output
Complex workflowsHard to manage >10 steps reliablyOrchestrator decomposes into sub-agents, each handles 3-5 steps

Multi-Agent Topologies: How Agents Connect CORE

β–Ό

⭐ Hub-and-Spoke (Orchestrator)

🧠 Orchestrator ↗️ ↑ ↖️ πŸ”Search πŸ“ŠData πŸ’»Code Agent Agent Agent

Central orchestrator delegates to specialist agents. Easiest to implement, easy to debug. Most common production pattern. Risk: orchestrator is a single point of failure.

πŸ”— Sequential Pipeline (Chain)

Researcher β†’ Analyst β†’ Writer β†’ Editor ↓ ↓ ↓ ↓ Sources Summary Draft Final

Output of each agent feeds the next. Like an assembly line. Simple, predictable, good for document pipelines. Bad for tasks where early errors cascade.

πŸ•ΈοΈ Fully Connected (Debate)

Agent A ↔️ Agent B ↕️ ↕️ Agent C ↔️ Agent D

All agents communicate with all others. Used for "society of mind" debates or consensus-building. Expensive (O(nΒ²) messages) but finds blind spots no single agent sees.

πŸ“Š Hierarchical (Recursive)

🎯 Manager ↙️ β†˜οΈ 🧠 Sub-Mgr 🧠 Sub-Mgr ↙️ β†˜οΈ ↙️ β†˜οΈ Workers Workers

Manager spawns sub-managers, who spawn workers. Enables extreme decomposition. Used in AutoGen and MetaGPT for complex software engineering tasks.

πŸ”¬ Interactive: Multi-Agent Pipeline Simulator

Select a workflow to see how agents collaborate. Each agent has a specialized role. Watch how their outputs chain together.

Agent Communication & Message Protocols

β–Ό
How Agents Pass Messages

Multi-agent systems need a shared protocol for agents to communicate. The two dominant patterns are message passing (agents send structured messages to each other) and shared state (agents read/write to a central state object, typically via a graph framework like LangGraph).

# AutoGen-style agent conversation example UserProxy β†’ AssistantAgent: "Write a Python function to detect email addresses in text" AssistantAgent β†’ UserProxy: ```python import re def find_emails(text): pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' return re.findall(pattern, text) ``` UserProxy [EXECUTES CODE]: run_code(function) RESULT: Tests pass βœ… UserProxy β†’ AssistantAgent: "Now add unit tests" # Conversation continues until task is complete or # max_turns is reached
πŸ’‘ Key Design Decision: When to Terminate? Multi-agent loops need explicit termination conditions: max turns, a "TERMINATE" signal in the agent's response, a human-in-the-loop check, or a goal-completion evaluator. Without this, agents can loop indefinitely and rack up API costs.
πŸ› οΈ

Leading Agent Frameworks in 2025–2026

LangGraph, AutoGen, CrewAI β€” when to use each

LangGraph vs. AutoGen vs. CrewAI 2026 LANDSCAPE

β–Ό

πŸ”· LangGraph

Graph-based Β· State-machine
  • Philosophy: Agents as nodes in a directed graph; edges define transitions
  • State: Shared typed state object, updated by each node
  • Strength: Stateful workflows, complex branching, checkpointing, human-in-loop
  • Best for: Production systems needing reliability and observability
  • Made by: LangChain team

πŸ”· AutoGen

Conversational Β· Role-playing
  • Philosophy: Agents as conversation participants with defined roles
  • State: Conversation history, multi-turn dialogue
  • Strength: Natural multi-agent conversations, code execution loop, human proxy
  • Best for: Coding assistants, collaborative problem-solving, research tasks
  • Made by: Microsoft Research

πŸ”· CrewAI

Role-based Β· Hierarchical
  • Philosophy: Agents as employees on a crew, with defined roles and tasks
  • State: Task outputs, crew memory
  • Strength: Intuitive role-based design, sequential/parallel task execution
  • Best for: Content pipelines, research reports, workflow automation
  • Made by: CrewAI Inc.
🎯 Framework Selection Guide (2026) Use LangGraph when you need fine-grained control, stateful workflows, and production reliability. Use AutoGen when the task naturally looks like a conversation between specialized agents (especially code-heavy tasks). Use CrewAI for quickly standing up role-based pipelines with clear agent personas and task flows. For maximum power, many teams use LangGraph as the orchestration layer with AutoGen-style agents as nodes.

Real-World Architecture: A Production NLP Agent System

β–Ό
# Production multi-agent NLP pipeline example # Task: Automated customer support ticket resolution Tier 1: TRIAGE AGENT Input: Raw customer email Task: Classify intent (billing/technical/general), extract entities Tools: entity_extractor, intent_classifier, ticket_db.create() Output: Structured ticket + routing decision Tier 2: SPECIALIST AGENTS (parallel fork) BillingAgent: query_billing_db(), apply_discount() TechAgent: search_knowledge_base(), run_diagnostics() EscalateAgent: send_to_human() if confidence < 0.7 Tier 3: RESPONSE AGENT Input: Specialist agent output + customer context Task: Generate empathetic, accurate response in brand voice Tools: email_formatter(), sentiment_checker(), send_email() Tier 4: QUALITY AGENT (critic) Input: Draft response Task: Check: Is it accurate? Empathetic? Compliant? Under 200 words? If fail β†’ send back to Response Agent with specific feedback # This 4-tier system can handle 80%+ of tickets automatically # with human escalation for edge cases

Evaluating Multi-Agent Systems DEEP DIVE

β–Ό

🎯 Task-Level Metrics

  • Task completion rate
  • Accuracy vs. ground truth
  • Error rate per agent

⚑ System-Level Metrics

  • Total latency end-to-end
  • Total cost (API tokens Γ— agents)
  • Successful termination rate

πŸ” Agent-Level Metrics

  • Tool call success rate
  • Hallucination rate per agent
  • Communication efficiency
⚠️ The Evaluation Problem in Agentic Systems Traditional NLP evaluation (BLEU, F1, accuracy) doesn't transfer to agents. An agent might reach the correct final answer via a buggy process β€” or reach the wrong answer due to bad luck despite correct reasoning. You need to evaluate both process (were the steps reasonable?) and outcome (was the final answer correct?). Benchmarks like AgentBench, MINT, and WebArena provide standardized environments for this.
🌍

NLP & Agentic AI in the Real World

Where these techniques are being applied right now

Industry Applications Landscape

β–Ό
βš•οΈ
Healthcare

Clinical note summarization, ICD coding, drug interaction detection, patient Q&A with RAG over medical literature

βš–οΈ
Legal

Contract review agents, legal research (RAG over case law), document summarization, compliance checking

πŸ’°
Finance

Earnings call analysis, sentiment on financial news, fraud detection narratives, risk report generation

πŸ›’
E-Commerce

Product description generation, review summarization, personalized recommendation explanations, CS ticket routing

πŸŽ“
Education

Personalized tutoring agents (like this study guide!), automated essay feedback, concept explanation generators

πŸ’»
Software Dev

Code generation (GitHub Copilot), bug finding agents, code review, documentation writing, test generation

πŸ”¬
Scientific Research

Literature review agents, hypothesis generation, experimental design assistants, paper writing support

πŸ—£οΈ
Customer Service

Ticket triage, response generation, sentiment routing, multi-lingual support, knowledge base Q&A

πŸš€

The Road Ahead: NLP in 2026 and Beyond

Critical research frontiers every NLP practitioner should know

NLP Evolution Timeline

β–Ό
Rule-Based β†’ Statistical NLP Pre-2012

Hand-crafted rules, n-gram language models, SVMs, HMMs for sequence labeling. Brittle, domain-specific, required massive feature engineering.

Deep Learning NLP 2012–2017

Word2Vec, GloVe, LSTMs, CNNs for text. First time neural representations beat hand-crafted features at scale.

The Transformer Revolution 2017–2020

Attention is All You Need, BERT, GPT-2. Transfer learning made large-scale pre-training the new paradigm.

Large Language Models + Alignment 2020–2023

GPT-3, ChatGPT, Llama, Claude. RLHF alignment. Scale laws. In-context learning. Prompt engineering emerges as a discipline.

Agentic AI + RAG + Reasoning Models 2024–2026 (NOW)

LLM agents with tools and memory. Multi-agent frameworks. RAG as standard practice. o1/o3/R1 reasoning models. Function calling ubiquitous.

Multimodal Foundation Models 2025–2027

GPT-4o, Gemini 2.0, and successors unify text, image, audio, video in one model. NLP becomes just one modality of general AI. Agents that see, hear, and act in the physical world.

Test-Time Compute Scaling + Efficient Models 2026–2028

Models that "think longer" on hard problems (like o3). Simultaneously: efficient small models (3B parameters matching old 70B) via better architectures, data curation, and distillation.

Autonomous AI Scientists & Self-Improving Systems 2027+

AI systems that formulate hypotheses, design experiments, run code, and publish papers. Systems that improve their own capabilities. The beginning of recursive self-improvement β€” the frontier everyone is watching.

Open Research Frontiers You Should Know 2026 FORWARD

β–Ό
Research AreaCore ProblemWhy It Matters
Long-context understandingDo 128K+ token models actually use all context, or just the beginning/end?Critical for document-length RAG and agent context management
Hallucination eliminationWhy do factual LLMs still confabulate, and how to detect/prevent it at inference time?Fundamental requirement for medical, legal, financial AI deployment
Interpretability / Mechanistic Interp.What circuits and features inside transformers implement known behaviors?Safety, debugging, alignment verification
Compositional generalizationCan LLMs combine known concepts in novel ways vs. just interpolating training data?Separates genuine understanding from sophisticated pattern matching
Continual learningHow to update model knowledge without catastrophic forgetting of prior knowledge?Enable models to stay current without full retraining
Efficient architecturesState space models (Mamba), linear attention: can they match transformers with O(n) complexity?Enable long-context at reduced cost
Multi-agent trust & safetyHow to guarantee safety when agents make consequential decisions autonomously?Prerequisite for deploying agents in high-stakes domains
Low-resource NLPMost of the world's 7,000 languages have minimal training dataEquity in AI β€” most LLMs fail on non-English languages

What Skills Make You a Strong NLP Engineer in 2026?

β–Ό

πŸ“š Theoretical Foundations

  • Transformer architecture (attention, positional encoding, layer norms)
  • Language modeling objectives (CLM, MLM, instruction tuning)
  • Evaluation metrics for generation (BLEU, ROUGE, BERTScore, LLM-as-judge)
  • Probabilistic NLP: Bayesian methods, topic models
  • Sequence labeling: CRF, HMM, Viterbi

πŸ› οΈ Practical Engineering

  • Prompt engineering (all techniques from Week 13)
  • RAG pipeline implementation (chunking, embedding, vector search)
  • Fine-tuning with LoRA/QLoRA on HuggingFace
  • Agent frameworks: LangGraph, AutoGen, or CrewAI
  • LLM evaluation and observability (LangSmith, Weights & Biases)

πŸ”¬ Research Awareness

  • Read arXiv papers: cs.CL, cs.AI sections
  • Know key venues: ACL, EMNLP, NAACL, NeurIPS, ICLR
  • Follow: Papers With Code, Hugging Face model releases
  • Track benchmarks: MMLU, HumanEval, MATH, AgentBench

🧭 Critical Thinking

  • Question benchmark leakage (are models trained on test sets?)
  • Understand when RAG is better than fine-tuning (and vice versa)
  • Know the alignment tax: safer models are often less capable
  • Think about deployment constraints: latency, cost, safety

🧠 Quiz 12 Prep β€” Multi-Agent Systems & Applications

1. In a hub-and-spoke multi-agent system, a central orchestrator decomposes a task and delegates to specialists. What is the PRIMARY advantage over a single-agent approach?

βœ… Correct! Specialization + parallelism are the core benefits. Each agent is prompt-engineered (or fine-tuned) for its specific role and can run simultaneously. The total effective context across all agents far exceeds any single agent's window. The orchestrator coordinates without needing to handle all sub-task details itself.
❌ The key benefits are specialization (each agent focused on one role) and the ability to parallelize work. This enables tackling tasks too complex for any single agent.

2. You're building a customer service agent system. 20% of complex cases should be escalated to human agents. Which architectural element handles this?

βœ… Correct! Human-in-the-Loop (HITL) checkpoints are a critical safety and quality mechanism in production agentic systems. The agent evaluates its own confidence (or a confidence evaluator is inserted), and below a threshold, a human is brought in. LangGraph has native support for breakpoints and human input nodes.
❌ Temperature affects diversity, not routing to humans. The right design pattern here is HITL β€” explicit checkpoints where low-confidence decisions are escalated to human review.

3. What is the key distinction between LangGraph and AutoGen that guides which one to choose?

βœ… Correct! This is the core architectural distinction. LangGraph is ideal when you need deterministic state transitions, checkpointing, and complex branching logic β€” think production workflows. AutoGen is ideal when the task naturally emerges from agents having a conversation, especially for code-heavy tasks where back-and-forth dialogue refines solutions.
❌ Both support multi-agent and work with various LLMs. The key difference is the interaction model: graph-based state machine (LangGraph) vs. conversational dialogue (AutoGen).

4. Why is evaluating multi-agent system performance harder than evaluating a traditional NLP classifier?

βœ… Correct! An agent can reach a correct answer via faulty reasoning (lucky result) or reach a wrong answer despite good reasoning (bad luck). Evaluating only the final answer misses this. Process evaluation requires inspecting the full agent trace: were tool calls appropriate? Were observations correctly interpreted? Did the agent stay on goal? Benchmarks like AgentBench and WebArena were specifically designed for this.
❌ The challenge is the need for both process and outcome evaluation. Traditional NLP evaluates output quality; agent evaluation must also assess whether the path taken to reach the output was sound.

πŸŽ“ Course Complete!

You've journeyed from tokenization and Bag-of-Words all the way to multi-agent AI systems operating at the frontier of what's possible. The field will keep evolving β€” but you now have the foundations to understand and evaluate whatever comes next.

15
Weeks Covered
13
Core Modules
5
Homeworks
40+
Topics Mastered
πŸ”¬ Beyond the Slides Β· Graduate Depth

Knowledge Graphs, GraphRAG & Structured Knowledge in 2026

Unstructured text retrieval (vector RAG) struggles with global questions, multi-hop reasoning, and relationship-heavy queries. Knowledge Graphs and GraphRAG represent the next frontier β€” combining symbolic structure with neural generation. This section covers KG fundamentals through Microsoft's GraphRAG system.

πŸ•ΈοΈ Knowledge Graph Fundamentals Foundations

A Knowledge Graph (KG) represents world knowledge as a directed graph: nodes are entities, edges are relations. The triple (head, relation, tail) is the atomic unit:

KG Construction Pipeline

  1. Entity Recognition: identify mentions (NER)
  2. Entity Linking: map mentions to canonical KG nodes (e.g., Wikidata QIDs)
  3. Relation Extraction: extract (h, r, t) triples from text
  4. Schema Mapping: align to ontology (RDF, OWL)
  5. Conflict Resolution: handle contradictory facts across sources

Major Knowledge Bases

  • Wikidata: 100M+ items, community-maintained, SPARQL queryable
  • Freebase: deprecated (β†’ Wikidata), used in distant supervision RE
  • NELL: never-ending learning from web text
  • ConceptNet: commonsense knowledge (5M+ assertions)
  • UMLS: biomedical/clinical concepts

πŸ“ TransE: Knowledge Graph Embeddings PhD

TransE (Bordes et al., 2013) embeds KG entities and relations in vector space. Key idea: for a true triple (h, r, t), the relation vector r should approximate the translation from h to t.

# TransE scoring function score(h, r, t) = -||h + r - t|| # Training with margin ranking loss L = Ξ£_{(h,r,t)∈S} Ξ£_{(h',r,t')∈S'} max(0, Ξ³ + score(h',r,t') - score(h,r,t)) # where S' = corrupted (negative) triples # Ξ³ = margin hyperparameter # Analogy with word2vec: Berlin - Germany + France β‰ˆ Paris leader(Germany) + France β‰ˆ leader(France) = "President_of" relation as vector!

TransE Limitations & Successors

ModelHandlesKey Innovation
TransE1-to-1 relationsh + r β‰ˆ t
TransR1-to-N, N-to-1Relation-specific spaces
RotatESymmetry, inversionRotation in complex plane
ComplExAsymmetric relationsComplex-valued embeddings
DistMultSymmetric relationsBilinear scoring

🌐 GraphRAG (Microsoft, 2024) 2024 Breakthrough

πŸ“– The Problem with Naive RAG

Standard RAG retrieves semantically similar chunks. But for global questions ("What are the main themes across all our documents?", "Summarize the key risks mentioned in our corpus"), no single chunk contains the answer. GraphRAG builds a knowledge graph from the entire corpus, clusters it, and enables both local and global queries.

GraphRAG Indexing Pipeline

  1. Extract entities & relations from all documents (LLM-based)
  2. Build knowledge graph: entities as nodes, relations as edges
  3. Run Leiden community detection algorithm
  4. Generate hierarchical community summaries (LLM)
  5. Index: vector store + graph structure + community reports

GraphRAG Query Types

  • Local search: entity-centric query β†’ find relevant entities β†’ retrieve neighbors + summaries β†’ LLM answer
  • Global search: query against community reports at all levels β†’ parallel LLM map step β†’ reduce step β†’ final answer
  • Drift search (2025): hybrid local+global for exploratory queries
DimensionNaive RAGGraphRAG
Retrieval unitText chunks (flat)Entities, relationships, community summaries
Global queries❌ Poor (no single chunk has answer)βœ… Strong (community summaries)
Multi-hop reasoning❌ Weak (individual chunks)βœ… Better (graph traversal)
Indexing costLow (embed & chunk)High (LLM entity extraction Γ— corpus size)
Query costLow (single vector search)Higher (graph traversal + parallel LLM calls)
Best forFactoid QA, specific retrievalCorpus summarization, theme extraction, relationship queries

GraphRAG improves answer comprehensiveness by 50–70% over naive RAG on global questions (Microsoft, 2024). Actively used in enterprise document intelligence, scientific literature analysis, and legal discovery.