Agentic AI & Tool Integrations Domain 2.1

AI agents are autonomous systems that perceive, reason, and act using tools. This is one of the highest-weight Domain 2 topics — master the architecture, tool design, and AWS service mapping.

What Is an AI Agent?

An AI agent combines a foundation model (the "brain") with a set of tools and an execution loop. It moves beyond single-turn Q&A to multi-step autonomous task completion: it reasons about which action to take, calls a tool, observes the result, and decides whether to act again or respond.

ReAct Loop (Reason → Act → Observe)

Thought: LLM reasons about what action to take next
Action: Calls a specific tool with parameters
Observation: Receives and processes the tool's result
Repeat until task is complete
Final response returned to user

Bedrock Agents Architecture

Fully managed agent orchestration on AWS
Connects to Action Groups (Lambda-backed tools)
Connects to Knowledge Bases for RAG context
Supports Guardrails for safe execution
Built-in memory (session context up to 8h with AgentCore)
Trace mode for step-by-step debugging

Strands SDK (Open-Source)

AWS open-source agent framework
Full developer control vs. managed Bedrock Agents
Supports MCP (Model Context Protocol) natively
Modular: swap prompting strategy, memory, tools independently
Built-in testing & evaluation hooks
Choose when you need customization or self-hosted deployment

AgentCore Components

Runtime: Session isolation up to 8 hours
Policy: Natural language → Cedar policies
Memory: Cross-session learning & persistence
Evaluations: 13 built-in quality evaluators

Tool / Function Calling — Schema Design

Tools are the "hands" of an agent. Each tool needs a well-defined schema so the LLM knows when and how to call it.

# Bedrock Agents Action Group — Lambda Tool Schema example
{
  "name": "query_customer_database",
  "description": "Retrieve customer order history by customer ID. Use when the user asks about past orders, delivery status, or purchase records.",
  "parameters": {
    "customer_id": {
      "type": "string",
      "description": "Unique customer identifier (format: CUST-XXXXX)",
      "required": true
    },
    "date_range_days": {
      "type": "integer",
      "description": "Number of days to look back (default: 90)",
      "required": false
    }
  }
}
          

            Key rule: The description field is what the LLM reads to decide when to call a tool. A vague description causes wrong tool selection. A precise description with "use when…" language dramatically improves accuracy.
          

Tool Design Principle	Good Practice	Common Mistake
Tool granularity	One tool per atomic capability	One mega-tool that does everything
Parameter names	Self-documenting (customer_id, not id)	Generic names (param1, data)
Error handling	Return structured error objects the LLM can interpret	Throw raw exceptions that confuse the agent
Output format	Consistent JSON schema every time	Variable format depending on result
Idempotency	Safe to retry without side effects	Tools that charge a card or send an email on every call

Agent Error Handling & Multi-Step Patterns

Error Handling Patterns

Tool timeout: Lambda 15-min max; return partial result + error flag
Tool returns empty: Agent should re-reason, not loop indefinitely
Hallucinated parameters: Validate inputs in Lambda before execution
Infinite loops: Set maxIterations in Bedrock Agents config
Step Functions circuit breaker: Use for deterministic multi-step workflows

Sequential vs. Parallel Tool Calls

Sequential: Each tool result feeds the next (data dependencies)
Parallel: Independent data fetches that are merged later
Bedrock Agents supports parallel tool calls natively
Use parallel when: search + profile lookup + pricing query all needed
Use sequential when: step B requires step A's output

Use Bedrock Agents When

You want managed infrastructure & zero agent loop code
You need built-in KB integration + Guardrails
Minimal ops overhead is the top priority
Standard tool-calling patterns are sufficient

Use Strands SDK When

You need custom prompting strategy or memory logic
Multi-model agent orchestration across providers
MCP (Model Context Protocol) server integration needed
Full control over agent loop is required

Use Step Functions (NOT Agents) When

Workflow must be deterministic with audit trail
Compliance requires guaranteed execution order
No autonomous reasoning needed — just orchestration

MCP on ECS Fargate When

MCP server needs persistent SSE connections
Lambda is a trap here — no persistent connections
Long-lived tool connections (databases, streaming)

⚠ Agentic AI Exam Traps

Lambda for MCP servers: Wrong — Lambda can't maintain persistent SSE connections. Use ECS Fargate.
Bedrock Flows vs. Agents: Flows = no-code visual prompt chains (not autonomous). Agents = autonomous reasoning loops.
Step Functions vs. Agents: Step Functions when deterministic; Agents when adaptive reasoning needed.
AgentCore Runtime session limit: 8 hours max — not unlimited, not 24 hours.
Strands requires self-hosting: More ops overhead than Bedrock Agents — not serverless by default.

Model Deployment Strategies Domain 2.2

How you deploy a model determines your latency, cost, and scalability profile. Know each option's tradeoffs cold.

The Three Deployment Strategies

Strategy	Service / API	When to Use	Key Trade-off
On-Demand (serverless)	Bedrock InvokeModel / Lambda	Variable/unpredictable traffic, event-driven, cost-sensitive	Cold start latency; pay per token
Provisioned Throughput	Bedrock Provisioned Throughput	Consistent high-volume traffic, SLA-bound latency	Reserved capacity cost even when idle; commit required
Hybrid	SageMaker AI Endpoints	Variable workloads needing auto-scale with baseline	More infra to manage; most flexible

Bedrock Provisioned Throughput Deep Dive

What It Provides

Reserved model capacity in Model Units (MUs)
Consistent low-latency regardless of AWS load
Required for: custom fine-tuned models, no-expiry commitments
Commitment: 1-month or 6-month terms (no-commit = on-demand rate)

When NOT to Use It

Unpredictable or spiky traffic patterns
Development or testing workloads
Traffic < 70% utilization of reserved capacity
Short-lived experiments (pays for idle capacity)

            Exam rule: Provisioned Throughput is correct when the question says "consistent high-volume", "guaranteed latency SLA", or "fine-tuned custom model in production." It is a distractor when traffic is "unpredictable" or "with long idle periods."
          

Model Optimization Techniques

Technique	What It Does	Use When	Trade-off
Quantization	Reduces model weight precision (FP32→FP16→INT8)	Reduce memory footprint & inference cost; edge deployment	Slight accuracy loss; INT8 saves ~75% memory vs FP32
Knowledge Distillation	Trains small "student" model to mimic large "teacher"	Need small, fast model with similar capability	Training cost up front; student has lower ceiling
Model Pruning	Removes low-importance weights	Further reduce model size after training	Complex; can degrade quality if over-pruned
Speculative Decoding	Small draft model generates tokens; large model verifies	Reduce latency for large model inference	Requires two models; draft model accuracy matters

            Distillation on Bedrock: When using Bedrock Model Distillation, provide only prompts (not prompt-response pairs) — Bedrock uses the teacher model to synthesize the responses. Student model must be smaller than teacher (e.g., Nova Lite ← Nova Pro).
          

Inference Endpoint Types (SageMaker)

Endpoint Type	Best For	Key Feature
Real-Time Endpoint	Low-latency synchronous requests	Always-on, instant response
Serverless Endpoint	Infrequent, unpredictable traffic	Auto-scale to zero; cold start latency
Async Endpoint	Large payloads, long processing (up to 1 hour)	Non-blocking; result via S3 + SNS
Batch Transform	Offline bulk inference (no real-time needed)	Process full dataset; results to S3
Multi-Model Endpoint	Many models, low per-model traffic	Dynamic model loading; cost efficient

⚠ Batch Inference Trap

Bedrock batch: CreateModelInvocationJob — for general text/image batch workloads
StartAsyncInvoke: Nova Reel video generation ONLY — not general batch
SageMaker Batch Transform = offline, S3-to-S3, no endpoint required

Containerization & Safeguarding

Container Benefits for GenAI

Reproducible environment (model + dependencies pinned)
Portable across dev → staging → prod
ECS / EKS for scalable deployment
Amazon ECR for private container registry
Lambda container images for serverless (up to 10GB)

Safeguarding Workflows

Step Functions: Prevent infinite workflow loops; timeout states
Lambda timeouts: Control max run time (up to 15 min)
IAM policies: Resource boundaries prevent unauthorized access
ECS circuit breaker: Auto rollback on deployment failures
Bedrock Guardrails: Input/output safety at model layer

Enterprise Integration Architectures Domain 2.3

Real-world GenAI solutions connect to existing enterprise systems. Know the integration patterns, async workflows, and identity federation approaches.

Integration Architecture Patterns

Pattern	Components	Use Case
Synchronous API	API Gateway → Lambda → Bedrock	Chatbots, real-time Q&A, low-latency responses (<30s)
Async / Queue-backed	SQS → Lambda → Bedrock → SNS	Batch document processing, report generation, email summarization
Event-driven	S3 Event → EventBridge → Lambda → Bedrock	Auto-process uploads (PDFs, images) as they arrive
Streaming	API Gateway WebSocket / AppSync → Bedrock streaming	Real-time token streaming for chat UX
Workflow orchestration	Step Functions → multiple Lambda → Bedrock	Multi-step document pipelines with retry/error handling

Identity & Access Patterns

Cognito OIDC Pattern

User authenticates via Cognito User Pool
Cognito issues JWT → exchanged for temporary AWS credentials
Credentials scoped to user's IAM role
Use for: web/mobile apps calling Bedrock directly

IAM Identity Center (SSO)

Federated access for enterprise employees
Maps corporate IdP (Okta, Azure AD) → AWS roles
Use for: internal tools, developer access to Bedrock
Supports permission sets across multiple accounts

Cross-Account Access

Resource-based policies on Bedrock knowledge bases
IAM role assumption from trusted accounts
Use for: centralized AI account serving multiple business units

VPC Endpoints for Bedrock

PrivateLink endpoint keeps traffic off public internet
Required for compliance (HIPAA, FedRAMP)
Endpoint type: Interface endpoint for bedrock-runtime
Also use for S3, DynamoDB (Gateway endpoints — free)

SQS Buffer Pattern for Knowledge Base Sync

S3 Event Notification fires when document is uploadedTriggers a message to SQS queue immediately.

SQS buffers the eventProvides resilience if KB sync is temporarily unavailable; batch processing up to 10 messages at once.

Lambda polls SQS and calls IngestKnowledgeBaseDocuments or DeleteKnowledgeBaseDocumentsTriggers incremental sync — only processes the changed document, not the full KB.

Knowledge Base logs ingestion result to CloudWatch LogsCheck for RESOURCE_IGNORED, EMBEDDING_FAILED, or INDEXING_FAILED error codes to diagnose issues.

            Exam distinction: StartIngestionJob = full sync. IngestKnowledgeBaseDocuments = incremental/document-level sync. The SQS pattern uses the incremental API for efficiency and resilience.
          

Foundation Model API Patterns Domain 2.4

Know every Bedrock API by name, purpose, and when it is the right choice vs. a distractor.

Bedrock API Reference

API	Purpose	Key Detail
`InvokeModel`	Provider-specific single-turn call	Request/response body is model-specific JSON; synchronous
`InvokeModelWithResponseStream`	Streaming token response	Use for chat UX; returns chunked stream
`Converse`	Unified multi-turn chat — provider agnostic	Standardized message format; preferred for multi-model apps
`ConverseStream`	Streaming version of Converse	Same as Converse but chunks token output
`RetrieveAndGenerate`	RAG in one managed call	Bedrock handles retrieval + generation; citations included
`Retrieve`	KB retrieval only (no generation)	Use when you want to handle generation yourself
`CreateModelInvocationJob`	Async batch inference	Large-scale text/image batch; results to S3
`StartAsyncInvoke`	Nova Reel video generation async	Video ONLY — NOT general batch inference
`ApplyGuardrail`	Run guardrail independently of model call	Useful for testing guardrail responses in isolation

Inference Profiles & Cross-Region Routing

Inference Profiles

Route requests across multiple AWS regions automatically
Best option for: automatic failover + traffic balancing
System-defined: AWS manages routing logic
Cross-region: you define which regions are eligible
Preferred over manual Route 53 routing for model calls

On-Demand vs. Provisioned

On-demand: Pay per token, scales instantly, no commitment
Provisioned: Reserved MUs, consistent latency, 1–6 month commit
Fine-tuned models must use Provisioned Throughput for prod
On-demand fine-tuned: available but at on-demand rate (no SLA)

            Exam trap: "Best Region for inference with managed failover" → answer is Inference Profiles, not Route 53 health checks or Lambda with retry logic.
          

Prompt Caching

First request: cache missFull prompt (system prompt + user message) sent. Bedrock processes all tokens. Cache checkpoint created for the prefix.

Subsequent requests: cache hitSame prefix (system prompt) detected. Only new tokens (user message) processed at full cost. Cached prefix at reduced rate (~90% discount).

Cache metadata returnedResponse includes cacheWriteInputTokens and cacheReadInputTokens for cost tracking.

            When to use: Repeated long system prompts (>1,000 tokens) reused across many requests. NOT effective for single-use or highly variable prompts. Prefix must be exact byte-for-byte match to trigger cache hit.
          

Cost Optimization & Resource Efficiency Domain 4.1

GenAI costs are token-driven. Master token efficiency, tiered model routing, and caching to dramatically reduce spend.

Token Cost Fundamentals

What Costs Tokens

Input tokens: System prompt + conversation history + user message + retrieved context
Output tokens: Model-generated response (typically 3–5× more expensive per token than input)
Context tokens: Accumulated history in multi-turn conversations
Different models have different tokenizers (same text ≠ same token count)

Token Reduction Techniques

Prompt compression: Summarize conversation history instead of passing full context
Context window optimization: Truncate old messages using sliding window
Response limiting: Set maxTokens parameter; use stop sequences
Shorter system prompts: Remove redundant instructions
Retrieval precision: Retrieve fewer, better chunks (reduce RAG context size)

Tiered Model Routing Strategy

Not all requests need a powerful (expensive) model. Route by query complexity:

Query Type	Recommended Model Tier	Why
Simple FAQ, classification, yes/no	Haiku / Nova Micro / Lite	Cheap, fast, accurate enough
Multi-step reasoning, analysis	Sonnet / Nova Pro	Good cost-quality balance
Complex code gen, research synthesis	Opus / Nova Premier	Maximum capability needed
Repeated identical tasks at scale	Batch inference (any tier)	50% cost reduction vs. real-time

            Model Cascading: Try cheap model first; escalate to expensive model only if confidence is below threshold. Implement with a confidence score check in Lambda after first model call.
          

Batch Inference for Cost Reduction

Use Batch Inference When

Results needed in hours, not seconds
Large volume: 100s–millions of records
No interactive user waiting for response
Cost reduction of ~50% vs. on-demand
Report generation, document summarization, embeddings refresh

Do NOT Use Batch When

User is waiting for real-time response
Latency SLA < 5 seconds
Interactive chat or streaming needed
Small number of records (<100) — overhead not worth it

Capacity Planning & Auto-Scaling

Scenario	Strategy	AWS Service
Steady 24/7 high volume	Provisioned Throughput (committed)	Bedrock Provisioned Throughput
10× event-based spikes	Cross-Region Inference Profile	Bedrock Inference Profiles
Dev/test workloads	On-demand only; no reservation	Bedrock On-Demand
SageMaker endpoint scaling	Auto Scaling policy on endpoint	Application Auto Scaling
Cost monitoring & alerts	Budget alerts + Cost Explorer tags	AWS Budgets + Cost Explorer

            Tagging for cost allocation: Tag all Bedrock / SageMaker resources with Project, CostCenter, Environment tags. Use AWS Cost Explorer to break down GenAI spend by tag.
          

Performance Optimization Domain 4.2

Latency in GenAI comes from retrieval, model inference, and network. Address each layer systematically.

Latency Optimization Framework

Latency Source	Optimization Technique	Expected Impact
Model inference time	Use smaller/faster model tier; quantization; Provisioned Throughput	20–80% reduction
Token generation speed	Streaming response (`InvokeModelWithResponseStream`); speculative decoding	Perceived latency drops for users
Retrieval time	HNSW vector index (ANN); pre-filter metadata; index warm-up	ms-level retrieval
Pre-computed responses	Cache answers for predictable queries (FAQ, product descriptions)	Near-zero latency for cached hits
Cold start (Lambda)	Provisioned concurrency; keep-warm pings; container reuse	Eliminates cold start
Prompt processing	Prompt caching for repeated prefixes; shorter system prompts	Reduces input token cost & time

Vector Index Optimization

Index Types

HNSW (Hierarchical NSW): Best for high-QPS, ANN search; OpenSearch default
IVF (Inverted File): Good for large datasets; slower but memory-efficient
Flat/Exact: 100% accurate but O(n) — only for small datasets
Hybrid (BM25 + HNSW): OpenSearch hybrid search for best relevance

OpenSearch Auto-Tune

Built-in feature — no new service needed
Automatically optimizes: JVM heap, shard allocation, cache settings
Least operational overhead for latency issues
Exam trap: ElastiCache in front of OpenSearch adds complexity & management — wrong answer for "least ops overhead"

Query Optimization

Metadata pre-filtering (filter by domain/date before vector search)
Query expansion: synonym injection, HyDE (Hypothetical Document Embeddings)
Reranker model: cross-encoder re-scores top-k after retrieval
RRF (Reciprocal Rank Fusion): merge BM25 + semantic scores

Parameter Tuning

temperature: 0 = deterministic; 1 = creative
top_p: nucleus sampling — limits vocabulary breadth
top_k: limits token candidates per step
maxTokens: hard cap on output length (cost + latency)
Stop sequences: terminate output at defined strings

Parallel Processing & Streaming

Parallel Requests Pattern

Fan-out: send multiple Bedrock calls simultaneously via async Python/Lambda
Use asyncio.gather() or Step Functions parallel branches
Combine results after all complete (or first-finished wins)
Good for: multi-section document generation, multi-query retrieval

Response Streaming

Use InvokeModelWithResponseStream or ConverseStream
First token appears in ~300ms vs. ~5s for full non-streaming response
Required for real-time chat feel in web apps
AppSync subscriptions or WebSocket API Gateway for browser push

Caching Strategies Domain 4.1–4.2

Caching reduces both cost (fewer model invocations) and latency. There are three distinct caching layers — know when to use each.

Three Caching Layers

Cache Layer	What's Cached	AWS Service	Invalidation
Prompt Cache	Repeated token prefix (system prompt)	Bedrock native prompt caching	Automatic on prefix change
Semantic Cache	Full responses for semantically similar queries	ElastiCache (Redis) + embedding similarity check	TTL-based or explicit flush
Embedding Cache	Pre-computed document embeddings	ElastiCache / DynamoDB	On document update
Result Cache	Exact-match query responses	DynamoDB / ElastiCache	TTL or version-key based
Edge Cache	Static/semi-static AI-generated content	CloudFront	Cache invalidation API

Semantic Caching Pattern

User query arrivesGenerate embedding for the query using the same embedding model used for the cache.

Similarity search in cacheCompare query embedding against stored embeddings in ElastiCache or Redis with vector search. If cosine similarity > threshold (e.g., 0.95), return cached response.

Cache miss → call modelSend query to foundation model. Store (query embedding, response) pair in cache with TTL.

Return responseIdentical for cache hit and miss from user perspective — zero model cost on hits.

            When semantic caching works best: FAQ-heavy workloads, support chatbots, product Q&A where many users ask similar questions. Cache hit rate of 30–60% typical in production.
          

DynamoDB + ElastiCache Decision

Need	Use	Why
Microsecond exact-match lookups at scale	ElastiCache (Redis)	In-memory; sub-millisecond reads
Persistent cache that survives restart	DynamoDB (TTL attribute)	Durable; serverless; auto-TTL expiry
Session state for multi-turn chat	DynamoDB (session_id key)	Serverless; scales with users automatically
Embedding cache with vector search	ElastiCache for Redis (RediSearch)	Built-in vector similarity search
Distributed rate limiting	ElastiCache (Redis atomic ops)	INCR + EXPIRE pattern for token bucket

Exam Traps — Domain 2 & 4

These are the most common wrong-answer choices. Read each carefully before exam day.

Domain 2 Traps

Trap	Wrong Answer	Correct Answer
MCP server hosting	AWS Lambda (no persistent connection)	ECS Fargate (persistent SSE connections)
Deterministic multi-step workflow	Bedrock Agents	AWS Step Functions
Bedrock Flows purpose	"Autonomous reasoning agent"	No-code visual prompt chain (not autonomous)
Batch inference API	`StartAsyncInvoke` (video only)	`CreateModelInvocationJob`
Agent memory across sessions	"Built-in Bedrock Agents"	AgentCore Memory module
Fine-tuned model in production	On-demand access	Provisioned Throughput (required for prod custom models)
Cross-region failover	Route 53 health checks	Bedrock Inference Profiles
Model distillation input	Prompt-response pairs	Prompts only (Bedrock synthesizes responses)

Domain 4 Traps

Trap	Wrong Answer	Correct Answer
OpenSearch latency with least ops	Add ElastiCache in front	Enable OpenSearch Auto-Tune (built-in, zero new service)
Reduce costs for repeated prompts	Smaller model only	Prompt caching (same prefix → 90% token discount)
Real-time chat feel	Full synchronous response	Streaming (`ConverseStream` / `InvokeModelWithResponseStream`)
Consistent high traffic	On-demand Bedrock	Provisioned Throughput
Unpredictable traffic	Provisioned Throughput	On-demand or inference profiles for burst
Cost tagging GenAI workloads	CloudTrail only	AWS Cost Explorer + resource tags

Decision Trees

Use these frameworks to quickly pick the right answer on scenario-based questions.

Which Agent Framework?

Is the workflow deterministic with a fixed execution order?YES → Step Functions. STOP.

Do you need visual no-code prompt chaining?YES → Bedrock Flows. STOP.

Do you need full developer control, custom memory, or MCP integration?YES → Strands SDK. STOP.

Otherwise (managed, serverless, standard tool-use, KB integration)?→ Bedrock Agents.

Which Deployment Mode?

Is traffic consistent and high-volume with SLA requirements?YES → Bedrock Provisioned Throughput.

Is it a custom fine-tuned model going to production?YES → Provisioned Throughput (mandatory).

Is traffic unpredictable or are there long idle periods?YES → On-demand Bedrock via Lambda.

Is cross-region failover needed?YES → Inference Profiles.

Is it a large offline batch job?YES → CreateModelInvocationJob (Bedrock) or SageMaker Batch Transform.

Which Caching Strategy?

Is the same long system prompt sent on every request?YES → Bedrock Prompt Caching (prefix must be identical).

Are many users asking semantically similar questions?YES → Semantic Cache with ElastiCache + embedding similarity.

Are exact-match repeated queries the pattern?YES → DynamoDB result cache with TTL.

Is static GenAI-generated content served to many users?YES → CloudFront edge caching.