Agentic AI & Tool Integrations Domain 2.1
AI agents are autonomous systems that perceive, reason, and act using tools. This is one of the highest-weight Domain 2 topics — master the architecture, tool design, and AWS service mapping.
What Is an AI Agent?
An AI agent combines a foundation model (the "brain") with a set of tools and an execution loop. It moves beyond single-turn Q&A to multi-step autonomous task completion: it reasons about which action to take, calls a tool, observes the result, and decides whether to act again or respond.
ReAct Loop (Reason → Act → Observe)
- Thought: LLM reasons about what action to take next
- Action: Calls a specific tool with parameters
- Observation: Receives and processes the tool's result
- Repeat until task is complete
- Final response returned to user
Bedrock Agents Architecture
- Fully managed agent orchestration on AWS
- Connects to Action Groups (Lambda-backed tools)
- Connects to Knowledge Bases for RAG context
- Supports Guardrails for safe execution
- Built-in memory (session context up to 8h with AgentCore)
- Trace mode for step-by-step debugging
Strands SDK (Open-Source)
- AWS open-source agent framework
- Full developer control vs. managed Bedrock Agents
- Supports MCP (Model Context Protocol) natively
- Modular: swap prompting strategy, memory, tools independently
- Built-in testing & evaluation hooks
- Choose when you need customization or self-hosted deployment
AgentCore Components
- Runtime: Session isolation up to 8 hours
- Policy: Natural language → Cedar policies
- Memory: Cross-session learning & persistence
- Evaluations: 13 built-in quality evaluators
Tool / Function Calling — Schema Design
Tools are the "hands" of an agent. Each tool needs a well-defined schema so the LLM knows when and how to call it.
description field is what the LLM reads to decide when to call a tool. A vague description causes wrong tool selection. A precise description with "use when…" language dramatically improves accuracy.
| Tool Design Principle | Good Practice | Common Mistake |
|---|---|---|
| Tool granularity | One tool per atomic capability | One mega-tool that does everything |
| Parameter names | Self-documenting (customer_id, not id) | Generic names (param1, data) |
| Error handling | Return structured error objects the LLM can interpret | Throw raw exceptions that confuse the agent |
| Output format | Consistent JSON schema every time | Variable format depending on result |
| Idempotency | Safe to retry without side effects | Tools that charge a card or send an email on every call |
Agent Error Handling & Multi-Step Patterns
Error Handling Patterns
- Tool timeout: Lambda 15-min max; return partial result + error flag
- Tool returns empty: Agent should re-reason, not loop indefinitely
- Hallucinated parameters: Validate inputs in Lambda before execution
- Infinite loops: Set
maxIterationsin Bedrock Agents config - Step Functions circuit breaker: Use for deterministic multi-step workflows
Sequential vs. Parallel Tool Calls
- Sequential: Each tool result feeds the next (data dependencies)
- Parallel: Independent data fetches that are merged later
- Bedrock Agents supports parallel tool calls natively
- Use parallel when: search + profile lookup + pricing query all needed
- Use sequential when: step B requires step A's output
Use Bedrock Agents When
- You want managed infrastructure & zero agent loop code
- You need built-in KB integration + Guardrails
- Minimal ops overhead is the top priority
- Standard tool-calling patterns are sufficient
Use Strands SDK When
- You need custom prompting strategy or memory logic
- Multi-model agent orchestration across providers
- MCP (Model Context Protocol) server integration needed
- Full control over agent loop is required
Use Step Functions (NOT Agents) When
- Workflow must be deterministic with audit trail
- Compliance requires guaranteed execution order
- No autonomous reasoning needed — just orchestration
MCP on ECS Fargate When
- MCP server needs persistent SSE connections
- Lambda is a trap here — no persistent connections
- Long-lived tool connections (databases, streaming)
⚠ Agentic AI Exam Traps
- Lambda for MCP servers: Wrong — Lambda can't maintain persistent SSE connections. Use ECS Fargate.
- Bedrock Flows vs. Agents: Flows = no-code visual prompt chains (not autonomous). Agents = autonomous reasoning loops.
- Step Functions vs. Agents: Step Functions when deterministic; Agents when adaptive reasoning needed.
- AgentCore Runtime session limit: 8 hours max — not unlimited, not 24 hours.
- Strands requires self-hosting: More ops overhead than Bedrock Agents — not serverless by default.
Model Deployment Strategies Domain 2.2
How you deploy a model determines your latency, cost, and scalability profile. Know each option's tradeoffs cold.
The Three Deployment Strategies
| Strategy | Service / API | When to Use | Key Trade-off |
|---|---|---|---|
| On-Demand (serverless) | Bedrock InvokeModel / Lambda | Variable/unpredictable traffic, event-driven, cost-sensitive | Cold start latency; pay per token |
| Provisioned Throughput | Bedrock Provisioned Throughput | Consistent high-volume traffic, SLA-bound latency | Reserved capacity cost even when idle; commit required |
| Hybrid | SageMaker AI Endpoints | Variable workloads needing auto-scale with baseline | More infra to manage; most flexible |
Bedrock Provisioned Throughput Deep Dive
What It Provides
- Reserved model capacity in Model Units (MUs)
- Consistent low-latency regardless of AWS load
- Required for: custom fine-tuned models, no-expiry commitments
- Commitment: 1-month or 6-month terms (no-commit = on-demand rate)
When NOT to Use It
- Unpredictable or spiky traffic patterns
- Development or testing workloads
- Traffic < 70% utilization of reserved capacity
- Short-lived experiments (pays for idle capacity)
Model Optimization Techniques
| Technique | What It Does | Use When | Trade-off |
|---|---|---|---|
| Quantization | Reduces model weight precision (FP32→FP16→INT8) | Reduce memory footprint & inference cost; edge deployment | Slight accuracy loss; INT8 saves ~75% memory vs FP32 |
| Knowledge Distillation | Trains small "student" model to mimic large "teacher" | Need small, fast model with similar capability | Training cost up front; student has lower ceiling |
| Model Pruning | Removes low-importance weights | Further reduce model size after training | Complex; can degrade quality if over-pruned |
| Speculative Decoding | Small draft model generates tokens; large model verifies | Reduce latency for large model inference | Requires two models; draft model accuracy matters |
Inference Endpoint Types (SageMaker)
| Endpoint Type | Best For | Key Feature |
|---|---|---|
| Real-Time Endpoint | Low-latency synchronous requests | Always-on, instant response |
| Serverless Endpoint | Infrequent, unpredictable traffic | Auto-scale to zero; cold start latency |
| Async Endpoint | Large payloads, long processing (up to 1 hour) | Non-blocking; result via S3 + SNS |
| Batch Transform | Offline bulk inference (no real-time needed) | Process full dataset; results to S3 |
| Multi-Model Endpoint | Many models, low per-model traffic | Dynamic model loading; cost efficient |
⚠ Batch Inference Trap
- Bedrock batch:
CreateModelInvocationJob— for general text/image batch workloads StartAsyncInvoke: Nova Reel video generation ONLY — not general batch- SageMaker Batch Transform = offline, S3-to-S3, no endpoint required
Containerization & Safeguarding
Container Benefits for GenAI
- Reproducible environment (model + dependencies pinned)
- Portable across dev → staging → prod
- ECS / EKS for scalable deployment
- Amazon ECR for private container registry
- Lambda container images for serverless (up to 10GB)
Safeguarding Workflows
- Step Functions: Prevent infinite workflow loops; timeout states
- Lambda timeouts: Control max run time (up to 15 min)
- IAM policies: Resource boundaries prevent unauthorized access
- ECS circuit breaker: Auto rollback on deployment failures
- Bedrock Guardrails: Input/output safety at model layer
Enterprise Integration Architectures Domain 2.3
Real-world GenAI solutions connect to existing enterprise systems. Know the integration patterns, async workflows, and identity federation approaches.
Integration Architecture Patterns
| Pattern | Components | Use Case |
|---|---|---|
| Synchronous API | API Gateway → Lambda → Bedrock | Chatbots, real-time Q&A, low-latency responses (<30s) |
| Async / Queue-backed | SQS → Lambda → Bedrock → SNS | Batch document processing, report generation, email summarization |
| Event-driven | S3 Event → EventBridge → Lambda → Bedrock | Auto-process uploads (PDFs, images) as they arrive |
| Streaming | API Gateway WebSocket / AppSync → Bedrock streaming | Real-time token streaming for chat UX |
| Workflow orchestration | Step Functions → multiple Lambda → Bedrock | Multi-step document pipelines with retry/error handling |
Identity & Access Patterns
Cognito OIDC Pattern
- User authenticates via Cognito User Pool
- Cognito issues JWT → exchanged for temporary AWS credentials
- Credentials scoped to user's IAM role
- Use for: web/mobile apps calling Bedrock directly
IAM Identity Center (SSO)
- Federated access for enterprise employees
- Maps corporate IdP (Okta, Azure AD) → AWS roles
- Use for: internal tools, developer access to Bedrock
- Supports permission sets across multiple accounts
Cross-Account Access
- Resource-based policies on Bedrock knowledge bases
- IAM role assumption from trusted accounts
- Use for: centralized AI account serving multiple business units
VPC Endpoints for Bedrock
- PrivateLink endpoint keeps traffic off public internet
- Required for compliance (HIPAA, FedRAMP)
- Endpoint type: Interface endpoint for
bedrock-runtime - Also use for S3, DynamoDB (Gateway endpoints — free)
SQS Buffer Pattern for Knowledge Base Sync
StartIngestionJob = full sync. IngestKnowledgeBaseDocuments = incremental/document-level sync. The SQS pattern uses the incremental API for efficiency and resilience.
Foundation Model API Patterns Domain 2.4
Know every Bedrock API by name, purpose, and when it is the right choice vs. a distractor.
Bedrock API Reference
| API | Purpose | Key Detail |
|---|---|---|
InvokeModel | Provider-specific single-turn call | Request/response body is model-specific JSON; synchronous |
InvokeModelWithResponseStream | Streaming token response | Use for chat UX; returns chunked stream |
Converse | Unified multi-turn chat — provider agnostic | Standardized message format; preferred for multi-model apps |
ConverseStream | Streaming version of Converse | Same as Converse but chunks token output |
RetrieveAndGenerate | RAG in one managed call | Bedrock handles retrieval + generation; citations included |
Retrieve | KB retrieval only (no generation) | Use when you want to handle generation yourself |
CreateModelInvocationJob | Async batch inference | Large-scale text/image batch; results to S3 |
StartAsyncInvoke | Nova Reel video generation async | Video ONLY — NOT general batch inference |
ApplyGuardrail | Run guardrail independently of model call | Useful for testing guardrail responses in isolation |
Inference Profiles & Cross-Region Routing
Inference Profiles
- Route requests across multiple AWS regions automatically
- Best option for: automatic failover + traffic balancing
- System-defined: AWS manages routing logic
- Cross-region: you define which regions are eligible
- Preferred over manual Route 53 routing for model calls
On-Demand vs. Provisioned
- On-demand: Pay per token, scales instantly, no commitment
- Provisioned: Reserved MUs, consistent latency, 1–6 month commit
- Fine-tuned models must use Provisioned Throughput for prod
- On-demand fine-tuned: available but at on-demand rate (no SLA)
Prompt Caching
cacheWriteInputTokens and cacheReadInputTokens for cost tracking.Cost Optimization & Resource Efficiency Domain 4.1
GenAI costs are token-driven. Master token efficiency, tiered model routing, and caching to dramatically reduce spend.
Token Cost Fundamentals
What Costs Tokens
- Input tokens: System prompt + conversation history + user message + retrieved context
- Output tokens: Model-generated response (typically 3–5× more expensive per token than input)
- Context tokens: Accumulated history in multi-turn conversations
- Different models have different tokenizers (same text ≠ same token count)
Token Reduction Techniques
- Prompt compression: Summarize conversation history instead of passing full context
- Context window optimization: Truncate old messages using sliding window
- Response limiting: Set
maxTokensparameter; use stop sequences - Shorter system prompts: Remove redundant instructions
- Retrieval precision: Retrieve fewer, better chunks (reduce RAG context size)
Tiered Model Routing Strategy
Not all requests need a powerful (expensive) model. Route by query complexity:
| Query Type | Recommended Model Tier | Why |
|---|---|---|
| Simple FAQ, classification, yes/no | Haiku / Nova Micro / Lite | Cheap, fast, accurate enough |
| Multi-step reasoning, analysis | Sonnet / Nova Pro | Good cost-quality balance |
| Complex code gen, research synthesis | Opus / Nova Premier | Maximum capability needed |
| Repeated identical tasks at scale | Batch inference (any tier) | 50% cost reduction vs. real-time |
Batch Inference for Cost Reduction
Use Batch Inference When
- Results needed in hours, not seconds
- Large volume: 100s–millions of records
- No interactive user waiting for response
- Cost reduction of ~50% vs. on-demand
- Report generation, document summarization, embeddings refresh
Do NOT Use Batch When
- User is waiting for real-time response
- Latency SLA < 5 seconds
- Interactive chat or streaming needed
- Small number of records (<100) — overhead not worth it
Capacity Planning & Auto-Scaling
| Scenario | Strategy | AWS Service |
|---|---|---|
| Steady 24/7 high volume | Provisioned Throughput (committed) | Bedrock Provisioned Throughput |
| 10× event-based spikes | Cross-Region Inference Profile | Bedrock Inference Profiles |
| Dev/test workloads | On-demand only; no reservation | Bedrock On-Demand |
| SageMaker endpoint scaling | Auto Scaling policy on endpoint | Application Auto Scaling |
| Cost monitoring & alerts | Budget alerts + Cost Explorer tags | AWS Budgets + Cost Explorer |
Project, CostCenter, Environment tags. Use AWS Cost Explorer to break down GenAI spend by tag.
Performance Optimization Domain 4.2
Latency in GenAI comes from retrieval, model inference, and network. Address each layer systematically.
Latency Optimization Framework
| Latency Source | Optimization Technique | Expected Impact |
|---|---|---|
| Model inference time | Use smaller/faster model tier; quantization; Provisioned Throughput | 20–80% reduction |
| Token generation speed | Streaming response (InvokeModelWithResponseStream); speculative decoding | Perceived latency drops for users |
| Retrieval time | HNSW vector index (ANN); pre-filter metadata; index warm-up | ms-level retrieval |
| Pre-computed responses | Cache answers for predictable queries (FAQ, product descriptions) | Near-zero latency for cached hits |
| Cold start (Lambda) | Provisioned concurrency; keep-warm pings; container reuse | Eliminates cold start |
| Prompt processing | Prompt caching for repeated prefixes; shorter system prompts | Reduces input token cost & time |
Vector Index Optimization
Index Types
- HNSW (Hierarchical NSW): Best for high-QPS, ANN search; OpenSearch default
- IVF (Inverted File): Good for large datasets; slower but memory-efficient
- Flat/Exact: 100% accurate but O(n) — only for small datasets
- Hybrid (BM25 + HNSW): OpenSearch hybrid search for best relevance
OpenSearch Auto-Tune
- Built-in feature — no new service needed
- Automatically optimizes: JVM heap, shard allocation, cache settings
- Least operational overhead for latency issues
- Exam trap: ElastiCache in front of OpenSearch adds complexity & management — wrong answer for "least ops overhead"
Query Optimization
- Metadata pre-filtering (filter by domain/date before vector search)
- Query expansion: synonym injection, HyDE (Hypothetical Document Embeddings)
- Reranker model: cross-encoder re-scores top-k after retrieval
- RRF (Reciprocal Rank Fusion): merge BM25 + semantic scores
Parameter Tuning
temperature: 0 = deterministic; 1 = creativetop_p: nucleus sampling — limits vocabulary breadthtop_k: limits token candidates per stepmaxTokens: hard cap on output length (cost + latency)- Stop sequences: terminate output at defined strings
Parallel Processing & Streaming
Parallel Requests Pattern
- Fan-out: send multiple Bedrock calls simultaneously via async Python/Lambda
- Use
asyncio.gather()or Step Functions parallel branches - Combine results after all complete (or first-finished wins)
- Good for: multi-section document generation, multi-query retrieval
Response Streaming
- Use
InvokeModelWithResponseStreamorConverseStream - First token appears in ~300ms vs. ~5s for full non-streaming response
- Required for real-time chat feel in web apps
- AppSync subscriptions or WebSocket API Gateway for browser push
Caching Strategies Domain 4.1–4.2
Caching reduces both cost (fewer model invocations) and latency. There are three distinct caching layers — know when to use each.
Three Caching Layers
| Cache Layer | What's Cached | AWS Service | Invalidation |
|---|---|---|---|
| Prompt Cache | Repeated token prefix (system prompt) | Bedrock native prompt caching | Automatic on prefix change |
| Semantic Cache | Full responses for semantically similar queries | ElastiCache (Redis) + embedding similarity check | TTL-based or explicit flush |
| Embedding Cache | Pre-computed document embeddings | ElastiCache / DynamoDB | On document update |
| Result Cache | Exact-match query responses | DynamoDB / ElastiCache | TTL or version-key based |
| Edge Cache | Static/semi-static AI-generated content | CloudFront | Cache invalidation API |
Semantic Caching Pattern
DynamoDB + ElastiCache Decision
| Need | Use | Why |
|---|---|---|
| Microsecond exact-match lookups at scale | ElastiCache (Redis) | In-memory; sub-millisecond reads |
| Persistent cache that survives restart | DynamoDB (TTL attribute) | Durable; serverless; auto-TTL expiry |
| Session state for multi-turn chat | DynamoDB (session_id key) | Serverless; scales with users automatically |
| Embedding cache with vector search | ElastiCache for Redis (RediSearch) | Built-in vector similarity search |
| Distributed rate limiting | ElastiCache (Redis atomic ops) | INCR + EXPIRE pattern for token bucket |
Exam Traps — Domain 2 & 4
These are the most common wrong-answer choices. Read each carefully before exam day.
Domain 2 Traps
| Trap | Wrong Answer | Correct Answer |
|---|---|---|
| MCP server hosting | AWS Lambda (no persistent connection) | ECS Fargate (persistent SSE connections) |
| Deterministic multi-step workflow | Bedrock Agents | AWS Step Functions |
| Bedrock Flows purpose | "Autonomous reasoning agent" | No-code visual prompt chain (not autonomous) |
| Batch inference API | StartAsyncInvoke (video only) | CreateModelInvocationJob |
| Agent memory across sessions | "Built-in Bedrock Agents" | AgentCore Memory module |
| Fine-tuned model in production | On-demand access | Provisioned Throughput (required for prod custom models) |
| Cross-region failover | Route 53 health checks | Bedrock Inference Profiles |
| Model distillation input | Prompt-response pairs | Prompts only (Bedrock synthesizes responses) |
Domain 4 Traps
| Trap | Wrong Answer | Correct Answer |
|---|---|---|
| OpenSearch latency with least ops | Add ElastiCache in front | Enable OpenSearch Auto-Tune (built-in, zero new service) |
| Reduce costs for repeated prompts | Smaller model only | Prompt caching (same prefix → 90% token discount) |
| Real-time chat feel | Full synchronous response | Streaming (ConverseStream / InvokeModelWithResponseStream) |
| Consistent high traffic | On-demand Bedrock | Provisioned Throughput |
| Unpredictable traffic | Provisioned Throughput | On-demand or inference profiles for burst |
| Cost tagging GenAI workloads | CloudTrail only | AWS Cost Explorer + resource tags |
Decision Trees
Use these frameworks to quickly pick the right answer on scenario-based questions.
Which Agent Framework?
Which Deployment Mode?
CreateModelInvocationJob (Bedrock) or SageMaker Batch Transform.