Week 15 – Multi-Agent Systems, Real-World Agentic NLP & Future

🤝

Multi-Agent Systems: When One Agent Isn't Enough

M13T1L3 · Collaboration, specialization, and emergence

📖 Story: The Consulting Firm Model

When McKinsey tackles a complex business problem, they don't send one person. They send a team: a project manager who coordinates, a financial analyst, a market researcher, a strategy consultant, and a technical expert — each contributing their specialty. The project manager ensures they're all working toward the same goal and their outputs fit together. That's exactly how multi-agent systems work. Specialization + coordination = capabilities that no single agent (or person) could achieve alone.

Why Multi-Agent? The Limits of Single Agents

▼

Challenge	Single Agent Limit	Multi-Agent Solution
Context window	One agent → one context → limited memory	Each agent has its own context → total effective memory scales with # agents
Specialization	Generalist = mediocre at everything	Specialist agents → each optimized for one sub-task
Parallelism	Single agent is sequential	Multiple agents work simultaneously on independent sub-tasks
Error checking	Single agent self-checks (biased)	Critic agent independently reviews generator agent's output
Complex workflows	Hard to manage >10 steps reliably	Orchestrator decomposes into sub-agents, each handles 3-5 steps

Multi-Agent Topologies: How Agents Connect CORE

▼

⭐ Hub-and-Spoke (Orchestrator)

🧠 Orchestrator ↗️ ↑ ↖️ 🔍Search 📊Data 💻Code Agent Agent Agent

Central orchestrator delegates to specialist agents. Easiest to implement, easy to debug. Most common production pattern. Risk: orchestrator is a single point of failure.

🔗 Sequential Pipeline (Chain)

Researcher → Analyst → Writer → Editor ↓ ↓ ↓ ↓ Sources Summary Draft Final

Output of each agent feeds the next. Like an assembly line. Simple, predictable, good for document pipelines. Bad for tasks where early errors cascade.

🕸️ Fully Connected (Debate)

Agent A ↔️ Agent B ↕️ ↕️ Agent C ↔️ Agent D

All agents communicate with all others. Used for "society of mind" debates or consensus-building. Expensive (O(n²) messages) but finds blind spots no single agent sees.

📊 Hierarchical (Recursive)

🎯 Manager ↙️ ↘️ 🧠 Sub-Mgr 🧠 Sub-Mgr ↙️ ↘️ ↙️ ↘️ Workers Workers

Manager spawns sub-managers, who spawn workers. Enables extreme decomposition. Used in AutoGen and MetaGPT for complex software engineering tasks.

🔬 Interactive: Multi-Agent Pipeline Simulator

Select a workflow to see how agents collaborate. Each agent has a specialized role. Watch how their outputs chain together.

Agent Communication & Message Protocols

▼

How Agents Pass Messages

Multi-agent systems need a shared protocol for agents to communicate. The two dominant patterns are message passing (agents send structured messages to each other) and shared state (agents read/write to a central state object, typically via a graph framework like LangGraph).

# AutoGen-style agent conversation example UserProxy → AssistantAgent: "Write a Python function to detect email addresses in text" AssistantAgent → UserProxy: ```python import re def find_emails(text): pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' return re.findall(pattern, text) ``` UserProxy [EXECUTES CODE]: run_code(function) RESULT: Tests pass ✅ UserProxy → AssistantAgent: "Now add unit tests" # Conversation continues until task is complete or # max_turns is reached

💡 Key Design Decision: When to Terminate? Multi-agent loops need explicit termination conditions: max turns, a "TERMINATE" signal in the agent's response, a human-in-the-loop check, or a goal-completion evaluator. Without this, agents can loop indefinitely and rack up API costs.

🛠️

Leading Agent Frameworks in 2025–2026

LangGraph, AutoGen, CrewAI — when to use each

LangGraph vs. AutoGen vs. CrewAI 2026 LANDSCAPE

▼

🔷 LangGraph

Graph-based · State-machine

Philosophy: Agents as nodes in a directed graph; edges define transitions
State: Shared typed state object, updated by each node
Strength: Stateful workflows, complex branching, checkpointing, human-in-loop
Best for: Production systems needing reliability and observability
Made by: LangChain team

🔷 AutoGen

Conversational · Role-playing

Philosophy: Agents as conversation participants with defined roles
State: Conversation history, multi-turn dialogue
Strength: Natural multi-agent conversations, code execution loop, human proxy
Best for: Coding assistants, collaborative problem-solving, research tasks
Made by: Microsoft Research

🔷 CrewAI

Role-based · Hierarchical

Philosophy: Agents as employees on a crew, with defined roles and tasks
State: Task outputs, crew memory
Strength: Intuitive role-based design, sequential/parallel task execution
Best for: Content pipelines, research reports, workflow automation
Made by: CrewAI Inc.

🎯 Framework Selection Guide (2026) Use LangGraph when you need fine-grained control, stateful workflows, and production reliability. Use AutoGen when the task naturally looks like a conversation between specialized agents (especially code-heavy tasks). Use CrewAI for quickly standing up role-based pipelines with clear agent personas and task flows. For maximum power, many teams use LangGraph as the orchestration layer with AutoGen-style agents as nodes.

Real-World Architecture: A Production NLP Agent System

▼

# Production multi-agent NLP pipeline example # Task: Automated customer support ticket resolution Tier 1: TRIAGE AGENT Input: Raw customer email Task: Classify intent (billing/technical/general), extract entities Tools: entity_extractor, intent_classifier, ticket_db.create() Output: Structured ticket + routing decision Tier 2: SPECIALIST AGENTS (parallel fork) BillingAgent: query_billing_db(), apply_discount() TechAgent: search_knowledge_base(), run_diagnostics() EscalateAgent: send_to_human() if confidence < 0.7 Tier 3: RESPONSE AGENT Input: Specialist agent output + customer context Task: Generate empathetic, accurate response in brand voice Tools: email_formatter(), sentiment_checker(), send_email() Tier 4: QUALITY AGENT (critic) Input: Draft response Task: Check: Is it accurate? Empathetic? Compliant? Under 200 words? If fail → send back to Response Agent with specific feedback # This 4-tier system can handle 80%+ of tickets automatically # with human escalation for edge cases

Evaluating Multi-Agent Systems DEEP DIVE

▼

🎯 Task-Level Metrics

Task completion rate
Accuracy vs. ground truth
Error rate per agent

⚡ System-Level Metrics

Total latency end-to-end
Total cost (API tokens × agents)
Successful termination rate

🔍 Agent-Level Metrics

Tool call success rate
Hallucination rate per agent
Communication efficiency

⚠️ The Evaluation Problem in Agentic Systems Traditional NLP evaluation (BLEU, F1, accuracy) doesn't transfer to agents. An agent might reach the correct final answer via a buggy process — or reach the wrong answer due to bad luck despite correct reasoning. You need to evaluate both process (were the steps reasonable?) and outcome (was the final answer correct?). Benchmarks like AgentBench, MINT, and WebArena provide standardized environments for this.

🌍

NLP & Agentic AI in the Real World

Where these techniques are being applied right now

Industry Applications Landscape

▼

⚕️

Healthcare

Clinical note summarization, ICD coding, drug interaction detection, patient Q&A with RAG over medical literature

⚖️

Legal

Contract review agents, legal research (RAG over case law), document summarization, compliance checking

💰

Finance

Earnings call analysis, sentiment on financial news, fraud detection narratives, risk report generation

🛒

E-Commerce

Product description generation, review summarization, personalized recommendation explanations, CS ticket routing

🎓

Education

Personalized tutoring agents (like this study guide!), automated essay feedback, concept explanation generators

💻

Software Dev

Code generation (GitHub Copilot), bug finding agents, code review, documentation writing, test generation

🔬

Scientific Research

Literature review agents, hypothesis generation, experimental design assistants, paper writing support

🗣️

Customer Service

Ticket triage, response generation, sentiment routing, multi-lingual support, knowledge base Q&A

🚀

The Road Ahead: NLP in 2026 and Beyond

Critical research frontiers every NLP practitioner should know

NLP Evolution Timeline

▼

Rule-Based → Statistical NLP Pre-2012

Hand-crafted rules, n-gram language models, SVMs, HMMs for sequence labeling. Brittle, domain-specific, required massive feature engineering.

Deep Learning NLP 2012–2017

Word2Vec, GloVe, LSTMs, CNNs for text. First time neural representations beat hand-crafted features at scale.

The Transformer Revolution 2017–2020

Attention is All You Need, BERT, GPT-2. Transfer learning made large-scale pre-training the new paradigm.

Large Language Models + Alignment 2020–2023

GPT-3, ChatGPT, Llama, Claude. RLHF alignment. Scale laws. In-context learning. Prompt engineering emerges as a discipline.

Agentic AI + RAG + Reasoning Models 2024–2026 (NOW)

LLM agents with tools and memory. Multi-agent frameworks. RAG as standard practice. o1/o3/R1 reasoning models. Function calling ubiquitous.

Multimodal Foundation Models 2025–2027

GPT-4o, Gemini 2.0, and successors unify text, image, audio, video in one model. NLP becomes just one modality of general AI. Agents that see, hear, and act in the physical world.

Test-Time Compute Scaling + Efficient Models 2026–2028

Models that "think longer" on hard problems (like o3). Simultaneously: efficient small models (3B parameters matching old 70B) via better architectures, data curation, and distillation.

Autonomous AI Scientists & Self-Improving Systems 2027+

AI systems that formulate hypotheses, design experiments, run code, and publish papers. Systems that improve their own capabilities. The beginning of recursive self-improvement — the frontier everyone is watching.

Open Research Frontiers You Should Know 2026 FORWARD

▼

Research Area	Core Problem	Why It Matters
Long-context understanding	Do 128K+ token models actually use all context, or just the beginning/end?	Critical for document-length RAG and agent context management
Hallucination elimination	Why do factual LLMs still confabulate, and how to detect/prevent it at inference time?	Fundamental requirement for medical, legal, financial AI deployment
Interpretability / Mechanistic Interp.	What circuits and features inside transformers implement known behaviors?	Safety, debugging, alignment verification
Compositional generalization	Can LLMs combine known concepts in novel ways vs. just interpolating training data?	Separates genuine understanding from sophisticated pattern matching
Continual learning	How to update model knowledge without catastrophic forgetting of prior knowledge?	Enable models to stay current without full retraining
Efficient architectures	State space models (Mamba), linear attention: can they match transformers with O(n) complexity?	Enable long-context at reduced cost
Multi-agent trust & safety	How to guarantee safety when agents make consequential decisions autonomously?	Prerequisite for deploying agents in high-stakes domains
Low-resource NLP	Most of the world's 7,000 languages have minimal training data	Equity in AI — most LLMs fail on non-English languages

What Skills Make You a Strong NLP Engineer in 2026?

▼

📚 Theoretical Foundations

Transformer architecture (attention, positional encoding, layer norms)
Language modeling objectives (CLM, MLM, instruction tuning)
Evaluation metrics for generation (BLEU, ROUGE, BERTScore, LLM-as-judge)
Probabilistic NLP: Bayesian methods, topic models
Sequence labeling: CRF, HMM, Viterbi

🛠️ Practical Engineering

Prompt engineering (all techniques from Week 13)
RAG pipeline implementation (chunking, embedding, vector search)
Fine-tuning with LoRA/QLoRA on HuggingFace
Agent frameworks: LangGraph, AutoGen, or CrewAI
LLM evaluation and observability (LangSmith, Weights & Biases)

🔬 Research Awareness

Read arXiv papers: cs.CL, cs.AI sections
Know key venues: ACL, EMNLP, NAACL, NeurIPS, ICLR
Follow: Papers With Code, Hugging Face model releases
Track benchmarks: MMLU, HumanEval, MATH, AgentBench

🧭 Critical Thinking

Question benchmark leakage (are models trained on test sets?)
Understand when RAG is better than fine-tuning (and vice versa)
Know the alignment tax: safer models are often less capable
Think about deployment constraints: latency, cost, safety

🧠 Quiz 12 Prep — Multi-Agent Systems & Applications

1. In a hub-and-spoke multi-agent system, a central orchestrator decomposes a task and delegates to specialists. What is the PRIMARY advantage over a single-agent approach?

✅ Correct! Specialization + parallelism are the core benefits. Each agent is prompt-engineered (or fine-tuned) for its specific role and can run simultaneously. The total effective context across all agents far exceeds any single agent's window. The orchestrator coordinates without needing to handle all sub-task details itself.

❌ The key benefits are specialization (each agent focused on one role) and the ability to parallelize work. This enables tackling tasks too complex for any single agent.

2. You're building a customer service agent system. 20% of complex cases should be escalated to human agents. Which architectural element handles this?

✅ Correct! Human-in-the-Loop (HITL) checkpoints are a critical safety and quality mechanism in production agentic systems. The agent evaluates its own confidence (or a confidence evaluator is inserted), and below a threshold, a human is brought in. LangGraph has native support for breakpoints and human input nodes.

❌ Temperature affects diversity, not routing to humans. The right design pattern here is HITL — explicit checkpoints where low-confidence decisions are escalated to human review.

3. What is the key distinction between LangGraph and AutoGen that guides which one to choose?

✅ Correct! This is the core architectural distinction. LangGraph is ideal when you need deterministic state transitions, checkpointing, and complex branching logic — think production workflows. AutoGen is ideal when the task naturally emerges from agents having a conversation, especially for code-heavy tasks where back-and-forth dialogue refines solutions.

❌ Both support multi-agent and work with various LLMs. The key difference is the interaction model: graph-based state machine (LangGraph) vs. conversational dialogue (AutoGen).

4. Why is evaluating multi-agent system performance harder than evaluating a traditional NLP classifier?

✅ Correct! An agent can reach a correct answer via faulty reasoning (lucky result) or reach a wrong answer despite good reasoning (bad luck). Evaluating only the final answer misses this. Process evaluation requires inspecting the full agent trace: were tool calls appropriate? Were observations correctly interpreted? Did the agent stay on goal? Benchmarks like AgentBench and WebArena were specifically designed for this.

❌ The challenge is the need for both process and outcome evaluation. Traditional NLP evaluates output quality; agent evaluation must also assess whether the path taken to reach the output was sound.

← Week 14: Agentic AI Foundations 🏠 Back to Home

Model	Handles	Key Innovation
TransE	1-to-1 relations	h + r ≈ t
TransR	1-to-N, N-to-1	Relation-specific spaces
RotatE	Symmetry, inversion	Rotation in complex plane
ComplEx	Asymmetric relations	Complex-valued embeddings
DistMult	Symmetric relations	Bilinear scoring

Dimension	Naive RAG	GraphRAG
Retrieval unit	Text chunks (flat)	Entities, relationships, community summaries
Global queries	❌ Poor (no single chunk has answer)	✅ Strong (community summaries)
Multi-hop reasoning	❌ Weak (individual chunks)	✅ Better (graph traversal)
Indexing cost	Low (embed & chunk)	High (LLM entity extraction × corpus size)
Query cost	Low (single vector search)	Higher (graph traversal + parallel LLM calls)
Best for	Factoid QA, specific retrieval	Corpus summarization, theme extraction, relationship queries

Multi-Agent Systems, Real-World NLP & The Road Ahead

Multi-Agent Systems: When One Agent Isn't Enough

Why Multi-Agent? The Limits of Single Agents

Multi-Agent Topologies: How Agents Connect CORE

⭐ Hub-and-Spoke (Orchestrator)

🔗 Sequential Pipeline (Chain)

🕸️ Fully Connected (Debate)

📊 Hierarchical (Recursive)

🔬 Interactive: Multi-Agent Pipeline Simulator

Agent Communication & Message Protocols

Leading Agent Frameworks in 2025–2026

LangGraph vs. AutoGen vs. CrewAI 2026 LANDSCAPE

🔷 LangGraph

🔷 AutoGen

🔷 CrewAI

Real-World Architecture: A Production NLP Agent System

Evaluating Multi-Agent Systems DEEP DIVE

🎯 Task-Level Metrics

⚡ System-Level Metrics

🔍 Agent-Level Metrics

NLP & Agentic AI in the Real World

Industry Applications Landscape

Healthcare

Legal

Finance

E-Commerce

Education

Software Dev

Scientific Research

Customer Service

The Road Ahead: NLP in 2026 and Beyond

NLP Evolution Timeline

Rule-Based → Statistical NLP Pre-2012

Deep Learning NLP 2012–2017

The Transformer Revolution 2017–2020

Large Language Models + Alignment 2020–2023

Agentic AI + RAG + Reasoning Models 2024–2026 (NOW)

Multimodal Foundation Models 2025–2027

Test-Time Compute Scaling + Efficient Models 2026–2028

Autonomous AI Scientists & Self-Improving Systems 2027+

Open Research Frontiers You Should Know 2026 FORWARD

What Skills Make You a Strong NLP Engineer in 2026?

📚 Theoretical Foundations

🛠️ Practical Engineering

🔬 Research Awareness

🧭 Critical Thinking

🧠 Quiz 12 Prep — Multi-Agent Systems & Applications

🎓 Course Complete!

Knowledge Graphs, GraphRAG & Structured Knowledge in 2026

🕸️ Knowledge Graph Fundamentals Foundations

KG Construction Pipeline

Major Knowledge Bases

📐 TransE: Knowledge Graph Embeddings PhD

TransE Limitations & Successors

🌐 GraphRAG (Microsoft, 2024) 2024 Breakthrough

GraphRAG Indexing Pipeline

GraphRAG Query Types