Agentic AI & Tool Integrations Domain 2.1

AI agents are autonomous systems that perceive, reason, and act using tools. This is one of the highest-weight Domain 2 topics — master the architecture, tool design, and AWS service mapping.

An AI agent combines a foundation model (the "brain") with a set of tools and an execution loop. It moves beyond single-turn Q&A to multi-step autonomous task completion: it reasons about which action to take, calls a tool, observes the result, and decides whether to act again or respond.

ReAct Loop (Reason → Act → Observe)
  • Thought: LLM reasons about what action to take next
  • Action: Calls a specific tool with parameters
  • Observation: Receives and processes the tool's result
  • Repeat until task is complete
  • Final response returned to user
Bedrock Agents Architecture
  • Fully managed agent orchestration on AWS
  • Connects to Action Groups (Lambda-backed tools)
  • Connects to Knowledge Bases for RAG context
  • Supports Guardrails for safe execution
  • Built-in memory (session context up to 8h with AgentCore)
  • Trace mode for step-by-step debugging
Strands SDK (Open-Source)
  • AWS open-source agent framework
  • Full developer control vs. managed Bedrock Agents
  • Supports MCP (Model Context Protocol) natively
  • Modular: swap prompting strategy, memory, tools independently
  • Built-in testing & evaluation hooks
  • Choose when you need customization or self-hosted deployment
AgentCore Components
  • Runtime: Session isolation up to 8 hours
  • Policy: Natural language → Cedar policies
  • Memory: Cross-session learning & persistence
  • Evaluations: 13 built-in quality evaluators

Tool / Function Calling — Schema Design

Tools are the "hands" of an agent. Each tool needs a well-defined schema so the LLM knows when and how to call it.

# Bedrock Agents Action Group — Lambda Tool Schema example { "name": "query_customer_database", "description": "Retrieve customer order history by customer ID. Use when the user asks about past orders, delivery status, or purchase records.", "parameters": { "customer_id": { "type": "string", "description": "Unique customer identifier (format: CUST-XXXXX)", "required": true }, "date_range_days": { "type": "integer", "description": "Number of days to look back (default: 90)", "required": false } } }
Key rule: The description field is what the LLM reads to decide when to call a tool. A vague description causes wrong tool selection. A precise description with "use when…" language dramatically improves accuracy.
Tool Design PrincipleGood PracticeCommon Mistake
Tool granularityOne tool per atomic capabilityOne mega-tool that does everything
Parameter namesSelf-documenting (customer_id, not id)Generic names (param1, data)
Error handlingReturn structured error objects the LLM can interpretThrow raw exceptions that confuse the agent
Output formatConsistent JSON schema every timeVariable format depending on result
IdempotencySafe to retry without side effectsTools that charge a card or send an email on every call

Agent Error Handling & Multi-Step Patterns

Error Handling Patterns
  • Tool timeout: Lambda 15-min max; return partial result + error flag
  • Tool returns empty: Agent should re-reason, not loop indefinitely
  • Hallucinated parameters: Validate inputs in Lambda before execution
  • Infinite loops: Set maxIterations in Bedrock Agents config
  • Step Functions circuit breaker: Use for deterministic multi-step workflows
Sequential vs. Parallel Tool Calls
  • Sequential: Each tool result feeds the next (data dependencies)
  • Parallel: Independent data fetches that are merged later
  • Bedrock Agents supports parallel tool calls natively
  • Use parallel when: search + profile lookup + pricing query all needed
  • Use sequential when: step B requires step A's output

Use Bedrock Agents When

  • You want managed infrastructure & zero agent loop code
  • You need built-in KB integration + Guardrails
  • Minimal ops overhead is the top priority
  • Standard tool-calling patterns are sufficient

Use Strands SDK When

  • You need custom prompting strategy or memory logic
  • Multi-model agent orchestration across providers
  • MCP (Model Context Protocol) server integration needed
  • Full control over agent loop is required

Use Step Functions (NOT Agents) When

  • Workflow must be deterministic with audit trail
  • Compliance requires guaranteed execution order
  • No autonomous reasoning needed — just orchestration

MCP on ECS Fargate When

  • MCP server needs persistent SSE connections
  • Lambda is a trap here — no persistent connections
  • Long-lived tool connections (databases, streaming)

⚠ Agentic AI Exam Traps

  • Lambda for MCP servers: Wrong — Lambda can't maintain persistent SSE connections. Use ECS Fargate.
  • Bedrock Flows vs. Agents: Flows = no-code visual prompt chains (not autonomous). Agents = autonomous reasoning loops.
  • Step Functions vs. Agents: Step Functions when deterministic; Agents when adaptive reasoning needed.
  • AgentCore Runtime session limit: 8 hours max — not unlimited, not 24 hours.
  • Strands requires self-hosting: More ops overhead than Bedrock Agents — not serverless by default.

Model Deployment Strategies Domain 2.2

How you deploy a model determines your latency, cost, and scalability profile. Know each option's tradeoffs cold.

StrategyService / APIWhen to UseKey Trade-off
On-Demand (serverless)Bedrock InvokeModel / LambdaVariable/unpredictable traffic, event-driven, cost-sensitiveCold start latency; pay per token
Provisioned ThroughputBedrock Provisioned ThroughputConsistent high-volume traffic, SLA-bound latencyReserved capacity cost even when idle; commit required
HybridSageMaker AI EndpointsVariable workloads needing auto-scale with baselineMore infra to manage; most flexible

Bedrock Provisioned Throughput Deep Dive

What It Provides
  • Reserved model capacity in Model Units (MUs)
  • Consistent low-latency regardless of AWS load
  • Required for: custom fine-tuned models, no-expiry commitments
  • Commitment: 1-month or 6-month terms (no-commit = on-demand rate)
When NOT to Use It
  • Unpredictable or spiky traffic patterns
  • Development or testing workloads
  • Traffic < 70% utilization of reserved capacity
  • Short-lived experiments (pays for idle capacity)
Exam rule: Provisioned Throughput is correct when the question says "consistent high-volume", "guaranteed latency SLA", or "fine-tuned custom model in production." It is a distractor when traffic is "unpredictable" or "with long idle periods."

Model Optimization Techniques

TechniqueWhat It DoesUse WhenTrade-off
QuantizationReduces model weight precision (FP32→FP16→INT8)Reduce memory footprint & inference cost; edge deploymentSlight accuracy loss; INT8 saves ~75% memory vs FP32
Knowledge DistillationTrains small "student" model to mimic large "teacher"Need small, fast model with similar capabilityTraining cost up front; student has lower ceiling
Model PruningRemoves low-importance weightsFurther reduce model size after trainingComplex; can degrade quality if over-pruned
Speculative DecodingSmall draft model generates tokens; large model verifiesReduce latency for large model inferenceRequires two models; draft model accuracy matters
Distillation on Bedrock: When using Bedrock Model Distillation, provide only prompts (not prompt-response pairs) — Bedrock uses the teacher model to synthesize the responses. Student model must be smaller than teacher (e.g., Nova Lite ← Nova Pro).

Inference Endpoint Types (SageMaker)

Endpoint TypeBest ForKey Feature
Real-Time EndpointLow-latency synchronous requestsAlways-on, instant response
Serverless EndpointInfrequent, unpredictable trafficAuto-scale to zero; cold start latency
Async EndpointLarge payloads, long processing (up to 1 hour)Non-blocking; result via S3 + SNS
Batch TransformOffline bulk inference (no real-time needed)Process full dataset; results to S3
Multi-Model EndpointMany models, low per-model trafficDynamic model loading; cost efficient

⚠ Batch Inference Trap

  • Bedrock batch: CreateModelInvocationJob — for general text/image batch workloads
  • StartAsyncInvoke: Nova Reel video generation ONLY — not general batch
  • SageMaker Batch Transform = offline, S3-to-S3, no endpoint required

Containerization & Safeguarding

Container Benefits for GenAI
  • Reproducible environment (model + dependencies pinned)
  • Portable across dev → staging → prod
  • ECS / EKS for scalable deployment
  • Amazon ECR for private container registry
  • Lambda container images for serverless (up to 10GB)
Safeguarding Workflows
  • Step Functions: Prevent infinite workflow loops; timeout states
  • Lambda timeouts: Control max run time (up to 15 min)
  • IAM policies: Resource boundaries prevent unauthorized access
  • ECS circuit breaker: Auto rollback on deployment failures
  • Bedrock Guardrails: Input/output safety at model layer

Enterprise Integration Architectures Domain 2.3

Real-world GenAI solutions connect to existing enterprise systems. Know the integration patterns, async workflows, and identity federation approaches.

PatternComponentsUse Case
Synchronous APIAPI Gateway → Lambda → BedrockChatbots, real-time Q&A, low-latency responses (<30s)
Async / Queue-backedSQS → Lambda → Bedrock → SNSBatch document processing, report generation, email summarization
Event-drivenS3 Event → EventBridge → Lambda → BedrockAuto-process uploads (PDFs, images) as they arrive
StreamingAPI Gateway WebSocket / AppSync → Bedrock streamingReal-time token streaming for chat UX
Workflow orchestrationStep Functions → multiple Lambda → BedrockMulti-step document pipelines with retry/error handling

Identity & Access Patterns

Cognito OIDC Pattern
  • User authenticates via Cognito User Pool
  • Cognito issues JWT → exchanged for temporary AWS credentials
  • Credentials scoped to user's IAM role
  • Use for: web/mobile apps calling Bedrock directly
IAM Identity Center (SSO)
  • Federated access for enterprise employees
  • Maps corporate IdP (Okta, Azure AD) → AWS roles
  • Use for: internal tools, developer access to Bedrock
  • Supports permission sets across multiple accounts
Cross-Account Access
  • Resource-based policies on Bedrock knowledge bases
  • IAM role assumption from trusted accounts
  • Use for: centralized AI account serving multiple business units
VPC Endpoints for Bedrock
  • PrivateLink endpoint keeps traffic off public internet
  • Required for compliance (HIPAA, FedRAMP)
  • Endpoint type: Interface endpoint for bedrock-runtime
  • Also use for S3, DynamoDB (Gateway endpoints — free)

SQS Buffer Pattern for Knowledge Base Sync

S3 Event Notification fires when document is uploadedTriggers a message to SQS queue immediately.
SQS buffers the eventProvides resilience if KB sync is temporarily unavailable; batch processing up to 10 messages at once.
Lambda polls SQS and calls IngestKnowledgeBaseDocuments or DeleteKnowledgeBaseDocumentsTriggers incremental sync — only processes the changed document, not the full KB.
Knowledge Base logs ingestion result to CloudWatch LogsCheck for RESOURCE_IGNORED, EMBEDDING_FAILED, or INDEXING_FAILED error codes to diagnose issues.
Exam distinction: StartIngestionJob = full sync. IngestKnowledgeBaseDocuments = incremental/document-level sync. The SQS pattern uses the incremental API for efficiency and resilience.

Foundation Model API Patterns Domain 2.4

Know every Bedrock API by name, purpose, and when it is the right choice vs. a distractor.

APIPurposeKey Detail
InvokeModelProvider-specific single-turn callRequest/response body is model-specific JSON; synchronous
InvokeModelWithResponseStreamStreaming token responseUse for chat UX; returns chunked stream
ConverseUnified multi-turn chat — provider agnosticStandardized message format; preferred for multi-model apps
ConverseStreamStreaming version of ConverseSame as Converse but chunks token output
RetrieveAndGenerateRAG in one managed callBedrock handles retrieval + generation; citations included
RetrieveKB retrieval only (no generation)Use when you want to handle generation yourself
CreateModelInvocationJobAsync batch inferenceLarge-scale text/image batch; results to S3
StartAsyncInvokeNova Reel video generation asyncVideo ONLY — NOT general batch inference
ApplyGuardrailRun guardrail independently of model callUseful for testing guardrail responses in isolation

Inference Profiles & Cross-Region Routing

Inference Profiles
  • Route requests across multiple AWS regions automatically
  • Best option for: automatic failover + traffic balancing
  • System-defined: AWS manages routing logic
  • Cross-region: you define which regions are eligible
  • Preferred over manual Route 53 routing for model calls
On-Demand vs. Provisioned
  • On-demand: Pay per token, scales instantly, no commitment
  • Provisioned: Reserved MUs, consistent latency, 1–6 month commit
  • Fine-tuned models must use Provisioned Throughput for prod
  • On-demand fine-tuned: available but at on-demand rate (no SLA)
Exam trap: "Best Region for inference with managed failover" → answer is Inference Profiles, not Route 53 health checks or Lambda with retry logic.

Prompt Caching

First request: cache missFull prompt (system prompt + user message) sent. Bedrock processes all tokens. Cache checkpoint created for the prefix.
Subsequent requests: cache hitSame prefix (system prompt) detected. Only new tokens (user message) processed at full cost. Cached prefix at reduced rate (~90% discount).
Cache metadata returnedResponse includes cacheWriteInputTokens and cacheReadInputTokens for cost tracking.
When to use: Repeated long system prompts (>1,000 tokens) reused across many requests. NOT effective for single-use or highly variable prompts. Prefix must be exact byte-for-byte match to trigger cache hit.

Cost Optimization & Resource Efficiency Domain 4.1

GenAI costs are token-driven. Master token efficiency, tiered model routing, and caching to dramatically reduce spend.

What Costs Tokens
  • Input tokens: System prompt + conversation history + user message + retrieved context
  • Output tokens: Model-generated response (typically 3–5× more expensive per token than input)
  • Context tokens: Accumulated history in multi-turn conversations
  • Different models have different tokenizers (same text ≠ same token count)
Token Reduction Techniques
  • Prompt compression: Summarize conversation history instead of passing full context
  • Context window optimization: Truncate old messages using sliding window
  • Response limiting: Set maxTokens parameter; use stop sequences
  • Shorter system prompts: Remove redundant instructions
  • Retrieval precision: Retrieve fewer, better chunks (reduce RAG context size)

Tiered Model Routing Strategy

Not all requests need a powerful (expensive) model. Route by query complexity:

Query TypeRecommended Model TierWhy
Simple FAQ, classification, yes/noHaiku / Nova Micro / LiteCheap, fast, accurate enough
Multi-step reasoning, analysisSonnet / Nova ProGood cost-quality balance
Complex code gen, research synthesisOpus / Nova PremierMaximum capability needed
Repeated identical tasks at scaleBatch inference (any tier)50% cost reduction vs. real-time
Model Cascading: Try cheap model first; escalate to expensive model only if confidence is below threshold. Implement with a confidence score check in Lambda after first model call.

Batch Inference for Cost Reduction

Use Batch Inference When

  • Results needed in hours, not seconds
  • Large volume: 100s–millions of records
  • No interactive user waiting for response
  • Cost reduction of ~50% vs. on-demand
  • Report generation, document summarization, embeddings refresh

Do NOT Use Batch When

  • User is waiting for real-time response
  • Latency SLA < 5 seconds
  • Interactive chat or streaming needed
  • Small number of records (<100) — overhead not worth it

Capacity Planning & Auto-Scaling

ScenarioStrategyAWS Service
Steady 24/7 high volumeProvisioned Throughput (committed)Bedrock Provisioned Throughput
10× event-based spikesCross-Region Inference ProfileBedrock Inference Profiles
Dev/test workloadsOn-demand only; no reservationBedrock On-Demand
SageMaker endpoint scalingAuto Scaling policy on endpointApplication Auto Scaling
Cost monitoring & alertsBudget alerts + Cost Explorer tagsAWS Budgets + Cost Explorer
Tagging for cost allocation: Tag all Bedrock / SageMaker resources with Project, CostCenter, Environment tags. Use AWS Cost Explorer to break down GenAI spend by tag.

Performance Optimization Domain 4.2

Latency in GenAI comes from retrieval, model inference, and network. Address each layer systematically.

Latency SourceOptimization TechniqueExpected Impact
Model inference timeUse smaller/faster model tier; quantization; Provisioned Throughput20–80% reduction
Token generation speedStreaming response (InvokeModelWithResponseStream); speculative decodingPerceived latency drops for users
Retrieval timeHNSW vector index (ANN); pre-filter metadata; index warm-upms-level retrieval
Pre-computed responsesCache answers for predictable queries (FAQ, product descriptions)Near-zero latency for cached hits
Cold start (Lambda)Provisioned concurrency; keep-warm pings; container reuseEliminates cold start
Prompt processingPrompt caching for repeated prefixes; shorter system promptsReduces input token cost & time

Vector Index Optimization

Index Types
  • HNSW (Hierarchical NSW): Best for high-QPS, ANN search; OpenSearch default
  • IVF (Inverted File): Good for large datasets; slower but memory-efficient
  • Flat/Exact: 100% accurate but O(n) — only for small datasets
  • Hybrid (BM25 + HNSW): OpenSearch hybrid search for best relevance
OpenSearch Auto-Tune
  • Built-in feature — no new service needed
  • Automatically optimizes: JVM heap, shard allocation, cache settings
  • Least operational overhead for latency issues
  • Exam trap: ElastiCache in front of OpenSearch adds complexity & management — wrong answer for "least ops overhead"
Query Optimization
  • Metadata pre-filtering (filter by domain/date before vector search)
  • Query expansion: synonym injection, HyDE (Hypothetical Document Embeddings)
  • Reranker model: cross-encoder re-scores top-k after retrieval
  • RRF (Reciprocal Rank Fusion): merge BM25 + semantic scores
Parameter Tuning
  • temperature: 0 = deterministic; 1 = creative
  • top_p: nucleus sampling — limits vocabulary breadth
  • top_k: limits token candidates per step
  • maxTokens: hard cap on output length (cost + latency)
  • Stop sequences: terminate output at defined strings

Parallel Processing & Streaming

Parallel Requests Pattern
  • Fan-out: send multiple Bedrock calls simultaneously via async Python/Lambda
  • Use asyncio.gather() or Step Functions parallel branches
  • Combine results after all complete (or first-finished wins)
  • Good for: multi-section document generation, multi-query retrieval
Response Streaming
  • Use InvokeModelWithResponseStream or ConverseStream
  • First token appears in ~300ms vs. ~5s for full non-streaming response
  • Required for real-time chat feel in web apps
  • AppSync subscriptions or WebSocket API Gateway for browser push

Caching Strategies Domain 4.1–4.2

Caching reduces both cost (fewer model invocations) and latency. There are three distinct caching layers — know when to use each.

Cache LayerWhat's CachedAWS ServiceInvalidation
Prompt CacheRepeated token prefix (system prompt)Bedrock native prompt cachingAutomatic on prefix change
Semantic CacheFull responses for semantically similar queriesElastiCache (Redis) + embedding similarity checkTTL-based or explicit flush
Embedding CachePre-computed document embeddingsElastiCache / DynamoDBOn document update
Result CacheExact-match query responsesDynamoDB / ElastiCacheTTL or version-key based
Edge CacheStatic/semi-static AI-generated contentCloudFrontCache invalidation API

Semantic Caching Pattern

User query arrivesGenerate embedding for the query using the same embedding model used for the cache.
Similarity search in cacheCompare query embedding against stored embeddings in ElastiCache or Redis with vector search. If cosine similarity > threshold (e.g., 0.95), return cached response.
Cache miss → call modelSend query to foundation model. Store (query embedding, response) pair in cache with TTL.
Return responseIdentical for cache hit and miss from user perspective — zero model cost on hits.
When semantic caching works best: FAQ-heavy workloads, support chatbots, product Q&A where many users ask similar questions. Cache hit rate of 30–60% typical in production.

DynamoDB + ElastiCache Decision

NeedUseWhy
Microsecond exact-match lookups at scaleElastiCache (Redis)In-memory; sub-millisecond reads
Persistent cache that survives restartDynamoDB (TTL attribute)Durable; serverless; auto-TTL expiry
Session state for multi-turn chatDynamoDB (session_id key)Serverless; scales with users automatically
Embedding cache with vector searchElastiCache for Redis (RediSearch)Built-in vector similarity search
Distributed rate limitingElastiCache (Redis atomic ops)INCR + EXPIRE pattern for token bucket

Exam Traps — Domain 2 & 4

These are the most common wrong-answer choices. Read each carefully before exam day.

Domain 2 Traps

TrapWrong AnswerCorrect Answer
MCP server hostingAWS Lambda (no persistent connection)ECS Fargate (persistent SSE connections)
Deterministic multi-step workflowBedrock AgentsAWS Step Functions
Bedrock Flows purpose"Autonomous reasoning agent"No-code visual prompt chain (not autonomous)
Batch inference APIStartAsyncInvoke (video only)CreateModelInvocationJob
Agent memory across sessions"Built-in Bedrock Agents"AgentCore Memory module
Fine-tuned model in productionOn-demand accessProvisioned Throughput (required for prod custom models)
Cross-region failoverRoute 53 health checksBedrock Inference Profiles
Model distillation inputPrompt-response pairsPrompts only (Bedrock synthesizes responses)

Domain 4 Traps

TrapWrong AnswerCorrect Answer
OpenSearch latency with least opsAdd ElastiCache in frontEnable OpenSearch Auto-Tune (built-in, zero new service)
Reduce costs for repeated promptsSmaller model onlyPrompt caching (same prefix → 90% token discount)
Real-time chat feelFull synchronous responseStreaming (ConverseStream / InvokeModelWithResponseStream)
Consistent high trafficOn-demand BedrockProvisioned Throughput
Unpredictable trafficProvisioned ThroughputOn-demand or inference profiles for burst
Cost tagging GenAI workloadsCloudTrail onlyAWS Cost Explorer + resource tags

Decision Trees

Use these frameworks to quickly pick the right answer on scenario-based questions.

Is the workflow deterministic with a fixed execution order?YES → Step Functions. STOP.
Do you need visual no-code prompt chaining?YES → Bedrock Flows. STOP.
Do you need full developer control, custom memory, or MCP integration?YES → Strands SDK. STOP.
Otherwise (managed, serverless, standard tool-use, KB integration)?→ Bedrock Agents.

Which Deployment Mode?

Is traffic consistent and high-volume with SLA requirements?YES → Bedrock Provisioned Throughput.
Is it a custom fine-tuned model going to production?YES → Provisioned Throughput (mandatory).
Is traffic unpredictable or are there long idle periods?YES → On-demand Bedrock via Lambda.
Is cross-region failover needed?YES → Inference Profiles.
Is it a large offline batch job?YES → CreateModelInvocationJob (Bedrock) or SageMaker Batch Transform.

Which Caching Strategy?

Is the same long system prompt sent on every request?YES → Bedrock Prompt Caching (prefix must be identical).
Are many users asking semantically similar questions?YES → Semantic Cache with ElastiCache + embedding similarity.
Are exact-match repeated queries the pattern?YES → DynamoDB result cache with TTL.
Is static GenAI-generated content served to many users?YES → CloudFront edge caching.