AIP-C01 Deep Study Guide | AWS Certified Generative AI Developer

AIP-C01 Exam Blueprint & Study Map

AWS Certified Generative AI Developer — Professional · 65 Questions · 130 minutes · ~$300 USD

Target: Developers with 2+ yrs AWS + 1+ yr GenAI hands-on

5

Content Domains

65

Exam Questions

130

Minutes

~100

Bedrock Models Available

750

Passing Score (scaled)

📋

Domain Weight Distribution & Key Focus Areas

▶

Domain	Approx Weight	Core Tasks	High-Priority Services
1. FM Integration, Data & Compliance	~30%	FM selection, Data pipelines, Vector stores, RAG, Prompt engineering	Bedrock, OpenSearch, SageMaker, S3, Glue
2. Implementation & Integration	~25%	Agentic AI, Deployment strategies, Enterprise integration, API patterns	Bedrock Agents, Step Functions, Lambda, API GW, ECS
3. AI Safety, Security & Governance	~20%	Guardrails, IAM, Compliance, Responsible AI, Audit	Bedrock Guardrails, IAM, CloudTrail, Security Hub, KMS
4. Operational Efficiency & Optimization	~15%	Cost optimization, Performance tuning, Monitoring, Caching	CloudWatch, Cost Explorer, Bedrock Prompt Routing, SageMaker
5. Testing, Validation & Troubleshooting	~10%	Model evaluation, QA frameworks, Debugging, Regression testing	Bedrock Model Eval, SageMaker Experiments, CloudWatch Logs

🎯 Exam Strategy (given your AIF-C01 background)Focus deep on: (1) Agentic AI patterns — Bedrock Agents, Strands, MCP, multi-agent orchestration; (2) Advanced RAG — chunking strategies, hybrid search, re-ranking; (3) Nova model family — Pro/Lite/Sonic/Forge use cases; (4) Deployment decision trees — when Lambda vs Provisioned Throughput vs SageMaker endpoints; (5) Guardrails deep config.

🗺️

Services You Must Know Cold

▶

🟠 Amazon Bedrock Core

InvokeModel / InvokeModelWithResponseStream
Converse API (unified multi-model)
Knowledge Bases (RAG)
Agents + Action Groups
Guardrails (content/PII/topic)
Prompt Management + Flows
Model Evaluation
Custom Model Import
Provisioned Throughput
Cross-Region Inference
Intelligent Prompt Routing

🔵 Amazon Nova Family

Nova 2 Pro — complex multi-step reasoning
Nova 2 Lite — cost-effective, high-volume
Nova 2 Sonic — real-time voice/conversation
Nova Canvas — image generation
Nova Reel — video generation
Nova Forge — custom model training from checkpoints
Supports system messages, multimodal, streaming

🟢 Agentic Stack

Bedrock Agents (managed)
Bedrock AgentCore (composable services)
Strands Agents SDK (open-source)
AWS Agent Squad (multi-agent)
MCP (Model Context Protocol)
Step Functions (ReAct/CoT workflows)
Lambda (stateless MCP servers)
ECS (complex MCP servers)

🟣 Vector & RAG Stack

Amazon OpenSearch (neural plugin, HNSW)
Aurora PostgreSQL + pgvector
Amazon S3 Vectors (new — billions of vectors)
Bedrock Knowledge Bases
Amazon Titan Embeddings (V1/V2/Multimodal)
Amazon Kendra (keyword + semantic hybrid)
Bedrock Data Automation

🔴 Security & Governance

Bedrock Guardrails (content filter, PII, topic)
IAM + resource-based policies
AWS KMS (data encryption)
AWS CloudTrail (audit)
AWS Security Hub (near-real-time risk)
Amazon Macie (PII in S3)
VPC Endpoints (PrivateLink)
SageMaker Model Cards

🩵 Deployment & Ops

SageMaker Real-time, Serverless, Async endpoints
SageMaker Multi-model / Multi-container endpoints
SageMaker Inference Components
EC2 UltraServers (large model inference)
DeepSpeed / Triton model parallelism
Nova Forge (custom training via SageMaker)
CloudWatch (metrics, alarms, dashboards)
AWS X-Ray (distributed tracing)

⚡

Critical "Know the Difference" Decision Points

▶

Decision	Option A	Option B	Choose When
Bedrock vs SageMaker	Bedrock — managed, pay-per-token	SageMaker — custom containers, full control	Bedrock for standard FMs; SageMaker for custom/open-source models or complex inference pipelines
On-demand vs Provisioned Throughput	On-demand — variable traffic, pay-per-use	Provisioned — predictable, dedicated capacity	Provisioned for steady high-volume; On-demand for spiky/low traffic
RAG vs Fine-tuning	RAG — dynamic, updatable knowledge	Fine-tuning — baked-in domain knowledge	RAG when data changes frequently; Fine-tuning for style/tone/format adaptation
Lambda vs ECS for MCP	Lambda — stateless, lightweight tools	ECS — stateful, complex compute tools	Lambda for simple tool calls; ECS for code execution, image processing
OpenSearch vs pgvector vs S3 Vectors	OpenSearch — full-text + vector hybrid	pgvector — relational + vector in RDS	S3 Vectors for billions of vectors, cost-optimized; pgvector when you need SQL joins; OpenSearch for hybrid keyword+semantic
Fixed vs Semantic Chunking	Fixed-size — simple, predictable	Semantic — content-aware boundaries	Fixed for uniform content; Semantic for varied documents; Hierarchical for structured docs
Nova Pro vs Lite vs Sonic	Pro — complex reasoning tasks	Lite — high-volume, cost-efficient	Sonic for real-time voice/conversation; Lite for batch/simple; Pro for analysis/reasoning
Bedrock Agents vs Strands vs Step Functions	Bedrock Agents — fully managed, conversational	Strands — open-source, custom control	Step Functions for deterministic workflows with branching; Bedrock Agents/Strands for autonomous LLM-driven action selection

Domain 1: Foundation Model Integration, Data Management & Compliance

Tasks 1.1–1.6 · Covers FM selection, data pipelines, vector stores, RAG, and prompt engineering

~30% of Exam Weight

1.1

Analyze Requirements & Design GenAI Solutions

▶

🏗️ Architectural Patterns for GenAI Solutions

Three primary integration approaches based on control/expertise trade-offs:

Amazon Bedrock (Unified API)

Fully managed, no infrastructure
Pay-per-use (tokens)
Quick time-to-market
~100 models via single API
Best for: standard FMs, rapid prototyping

SageMaker (Custom/Control)

Bring your own model/container
Fine-grained instance control
Supports open-source models
GPU selection (g5, p4d families)
Best for: custom models, complex inference

AWS AI Factories (On-Prem)

AWS-managed infra in your DC
Cloud-like AI in own environment
For data sovereignty requirements
Also: AWS Outposts for hybrid

🎯 Exam Tip Questions about "which service to use" almost always hinge on: (1) level of control needed, (2) data residency requirements, (3) traffic pattern (steady vs. spiky), and (4) team ML expertise. Memorize these trade-offs.

AWS Well-Architected for GenAI (6 Pillars applied)

Pillar	GenAI-Specific Consideration
Operational Excellence	Automated model retraining, baseline behavior metrics, self-healing capabilities
Security	IAM for model access, VPC endpoints, KMS encryption, Guardrails for output safety
Reliability	Multi-AZ by default, cross-Region inference for HA, circuit breakers, fallback models
Performance Efficiency	Right-sizing (Lambda vs. Provisioned), caching embeddings/responses, batch inference
Cost Optimization	On-demand vs. provisioned throughput, model cascading (cheap→expensive), prompt caching
Sustainability	Model distillation, parameter-efficient fine-tuning (PEFT), smaller specialized models

PoC → Production Transition Framework

Define use case scope with success criteria and ROI metrics
Select FM: benchmark on custom eval set, not just public leaderboards
Build PoC using Bedrock + Lambda + simple front-end
Validate with stakeholders: accuracy, latency, cost per inference
Harden for production: add Guardrails, monitoring, error handling
Phased rollout: pilot → limited release → full production

🧩 Enterprise Adoption Strategy

AI Center of Excellence (CoE)Central governance and best practices
Pattern library and code templates
Model governance committee
Standardized onboarding process
Cross-functional team structure

Production Monitoring FrameworkTechnical: inference latency (p50/p95/p99), throughput, error rates
Business: cost per inference, user satisfaction, task completion
Quality: accuracy, consistency, hallucination rate
CloudWatch dashboards + automated alerts

💡 Nova Forge — Cost Optimization for Custom ModelsNova Forge allows continued pre-training from checkpoints (pre/mid/post-training phases), blending proprietary data with Nova-curated data. This significantly reduces cost vs. full retraining and preserves foundational skills. Uses RL with your own reward functions and orchestrator for multi-turn rollouts. Accessed via Amazon SageMaker AI.

1.2

Select & Configure Foundation Models

▶

📊 FM Evaluation Frameworks & Benchmarks

General Benchmarks

MMLU — 57 subjects, knowledge breadth
HELM — 42+ models, multidimensional (fairness, bias, toxicity)
BigBench — 204+ diverse tasks, capability boundaries
BIG-Bench Hard — complex multi-step reasoning
GLUE/SuperGLUE — language understanding

Task-Specific

HumanEval+ / MBPP+ — code generation
GSM8K / MATH — mathematical reasoning
MT-Bench — multi-turn conversation (GPT-4 as judge)
MedPaLM — medical domain
FinanceBench — financial analysis + compliance

Multimodal

MMMLU — text/image/audio/video
MME — fine-grained perception vs. reasoning
MMMU — professional multimodal tasks
LMSys Chatbot Arena — human preference (Elo ratings)

🎯 Key Insight Benchmark scores don't always translate to real-world performance. Always supplement with custom benchmarks built from your actual use case data. The Bedrock Model Evaluation feature lets you run your own evals (automatic + human review).

🔀 Model Routing Strategies (Critical Topic)

Strategy	How It Works	Best For	AWS Implementation
Static Routing	Predetermined rules (department, content type, user role)	Simple, predictable workloads	Lambda + JSON routing config, AppConfig feature flags
Dynamic / Intelligent Routing	Runtime analysis of prompt complexity, content type, cost/quality	Mixed workloads needing optimization	Bedrock Intelligent Prompt Routing
Content-Based Routing	Step Functions Choice states evaluate input characteristics	Specialized models per domain	Step Functions + Lambda classifier
Model Cascading	Start cheap (Nova Lite) → escalate to Pro only if quality < threshold	Cost optimization with quality floor	Lambda confidence scoring + escalation logic
Cross-Region Inference	Distribute requests across AWS regions	Throughput scaling, HA, latency optimization	Bedrock Cross-Region Inference Profiles

Nova Model Routing Tiers

Nova 2 Pro

High-complexity: multi-step reasoning, detailed analysis, document understanding. Highest cost/quality.

Nova 2 Lite

Medium complexity: standard generation, high-volume processing. Best cost/performance ratio for most workloads.

Nova 2 Sonic

Real-time conversational AI with lowest latency. Optimized for voice applications and streaming dialogue.

💡 Bedrock Intelligent Prompt Routing — Key Facts

Analyzes: prompt length, complexity, content type, performance requirements
Evaluates: latency requirements, cost limits, quality thresholds
Routes between: Claude, Nova, Titan, and other Bedrock models
Learns from performance data over time (improves routing accuracy)
Maintains consistent response formats across models
Configure via Bedrock API — minimal code changes required

🛡️ Resilient AI System Design

Circuit Breaker Pattern

Monitor error rate over N requests
Threshold: ~50% failures → open circuit
Recovery timeout: 30-60 seconds
Half-open: test 10-20% traffic
Implement with: Step Functions + CloudWatch alarms

Fallback Hierarchy

Primary model → Smaller model
Smaller model → Cached response
Cached → Static response
Each level has quality threshold check
Log all fallback events for analysis

Cross-Region HA

Active-active or active-passive
Route 53 health checks (30s interval, 3 failures)
Cross-region inference profiles in Bedrock
DynamoDB Global Tables for state sync
S3 Cross-Region Replication for assets

⚠️ Watch Out "Graceful degradation" is a key exam theme. Know how to design systems that degrade gracefully: return a cached/simplified response rather than fail completely. AWS AppConfig for dynamic config updates without redeployment.

1.3

Implement Data Validation & Processing Pipelines

▶

🔄 Data Quality & Validation Architecture

Data quality impacts FMs through three channels: prompts, retrieved information (RAG), and fine-tuning datasets.

Tool	Role in Pipeline	Key Capability
AWS Glue	ETL, Data Catalog, Crawlers	Schema detection, data cataloging, validation workflows, PySpark transforms
SageMaker Data Wrangler	Data exploration & transformation UI	300+ built-in transforms, data quality reports, bias detection
SageMaker Processing Jobs	Large-scale data processing	Pre-built Scikit-learn/Spark containers, feature engineering, evaluation
AWS Lambda	Custom validation logic, real-time checks	Schema validation, type checks, range validation, normalization
Step Functions	Pipeline orchestration with quality gates	Error handling, retries, parallel processing, feedback loops
Amazon Comprehend	NLP enrichment	Entity extraction, sentiment, PII detection for data enhancement
Bedrock Data Automation	Unstructured data processing	Auto-cleansing, tokenization, formatting for training/RAG data
CloudWatch	Data quality monitoring	Custom metrics for data drift, quality scores, anomaly detection

📦 JSON Formatting for Bedrock APIs (Must Know)

Each FM has a specific JSON schema. The Converse API provides a unified interface.

Claude / Nova (messages format):

// Claude format
{
  "anthropic_version": "bedrock-2023-05-31",
  "max_tokens": 1000,
  "system": "You are an assistant...",
  "messages": [{
    "role": "user",
    "content": "Your prompt here"
  }],
  "temperature": 0.7,
  "top_p": 0.9
}
// Nova format (similar, uses inferenceConfig)
{
  "system": [{"text": "..."}],
  "messages": [{...}],
  "inferenceConfig": {
    "maxTokens": 1000,
    "temperature": 0.7
  }
}

Amazon Titan format:

{
  "inputText": "Your prompt",
  "textGenerationConfig": {
    "maxTokenCount": 500,
    "temperature": 0.8,
    "topP": 0.9,
    "stopSequences": ["User:"]
  }
}

HTTP Error Codes:

400 — Bad Request (invalid JSON, missing fields)
401/403 — Auth/permission issues (non-retriable)
429 — Throttling (retriable with backoff)
500/503 — Service errors (retriable)

🎯 Retry Strategy: Exponential backoff starting at 100ms, factor of 2, max 3-5 attempts, add ±100ms jitter. SDK does this automatically if configured.

Multimodal Input (image in messages):

{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "Describe this diagram"},
      {"type": "image", "source": {
        "type": "base64",
        "media_type": "image/jpeg",
        "data": "<base64-encoded-image>"
      }}
    ]
  }]
}

🎭 Multimodal Data Processing

Text Processing

Amazon Comprehend: entities, sentiment, PII
AWS Glue: ETL, normalization
Lambda: custom cleaning, tokenization
Bedrock Data Automation: AI-powered prep

Image Processing

Amazon Rekognition: object detection, labels
Bedrock Nova Canvas/Titan Image
Base64 encoding for Bedrock API
S3 + Lambda trigger pipeline

Audio/Video

Amazon Transcribe: speech-to-text
Cross-modal alignment (sync audio/video)
Nova Reel: video generation
Nova Sonic: real-time audio conversation

💡 S3 Vectors (New Feature)Amazon S3 Vectors is a new capability for storing and querying vector embeddings natively in S3. Supports billions of vectors with sub-second query latency. Key advantages: 40-60% cost reduction with Intelligent-Tiering, metadata pre-filtering reduces search space 50-70%, multi-region replication with <15 min sync, ABAC for fine-grained access.

1.4

Design & Implement Vector Store Solutions

▶

📐 Vector Database Deep Dive

Distance Metrics — Know All Three:

Metric	Formula Concept	Best For	Notes
Cosine Similarity	Angle between vectors (direction only)	Text embeddings, docs of different lengths	Range: -1 to 1; ignores magnitude; most common for NLP
Euclidean Distance	Straight-line distance in vector space	When magnitude matters, dense embeddings	Sensitive to dimensionality; lower = more similar
Dot Product	Magnitude + direction combined	When content volume is relevant	Can favor longer documents; efficient compute

AWS Vector Store Options:

Service	Index Type	Hybrid Search	Scale	Best Use Case
OpenSearch Neural	HNSW or IVF	✅ Keyword + Vector	Large to very large	Full-text + semantic search, enterprise search
Aurora pgvector	IVFFlat, HNSW	✅ SQL + Vector	Medium	Need relational queries + similarity (e.g., filter by user_id then similarity)
S3 Vectors	Native S3 distributed	❌ Vector only	Billions of vectors	Cost-optimized large-scale vector storage
Bedrock Knowledge Bases	Managed (OSS backend)	✅ Managed hybrid	Enterprise	Managed RAG — no infra management
Amazon MemoryDB	Redis-compatible	❌	Medium	Ultra-low latency vector + key-value

🔍 OpenSearch HNSW Configuration (Deep Detail)

Hierarchical Navigable Small World (HNSW) is the primary index type for vector search in OpenSearch:

              Index Construction Parameters
              M: Max connections per node — higher M = better recall but more memory (typical: 16-64)
ef_construction: Search width during build — higher = better quality, slower indexing (typical: 100-512)
max_connections: Upper limit on node connections

            

              Search Parameters
              ef_search: Search width during query — higher = better recall, slower (typical: 100-512)
num_candidates: Candidates to evaluate
rescore: Enable for improved accuracy
Performance: p50/p95/p99 latency + recall@k

            

4-Stage Hierarchical Search Pipeline:

Coarse filtering: Apply metadata filters, document clustering, semantic routing to relevant partitions
Approximate ANN search: Fast approximate nearest neighbor, retrieve larger candidate set
Fine-grained ranking: Precise cosine scores, business logic weighting, diversity algorithms
Result assembly: Retrieve full content + metadata, final formatting, relevance explanations

🎯 S3 Vectors Performance: Pre-filter metadata BEFORE vector calculations to reduce search space 50-70%. Use prefix-based hierarchical organization for efficient filtering. Configure Intelligent-Tiering for 40-60% cost reduction on infrequently accessed vectors.

🔄 Vector Store Data Maintenance Systems

Event-Driven Updates

S3 event → Lambda → re-embed → upsert
DynamoDB Streams → update pipeline
Near real-time freshness
Best for: frequently changing docs

Batch Sync

Scheduled Glue jobs or Step Functions
Delta detection (last-modified timestamps)
Cost-efficient for bulk updates
Best for: large corpora, nightly updates

Hybrid Approach

Real-time for high-priority content
Batch for bulk/archival content
Drift monitoring with CloudWatch
Version control for knowledge bases

S3 Metadata Framework for RAG Enhancement:

System-Defined MetadataContent-Type, Content-Length
Last-Modified timestamp
ETag (content fingerprint)
x-amz-version-id

User-Defined Metadata (x-amz-meta-*)document-author, department, category
expiry-date, version, language
security-classification, jurisdiction
Enables pre-filtering before vector search

1.5

Design Retrieval Mechanisms for FM Augmentation (RAG)

▶

✂️ Chunking Strategies — Critical Deep Dive

Strategy	How It Works	Pros	Cons	Use When
Fixed-Size	Split every N tokens (e.g., 512) with optional overlap (e.g., 50 tokens)	Simple, predictable, consistent embeddings	May break semantic units	Uniform content (FAQs, reports)
Recursive Character	Try splitting on paragraphs → sentences → words → chars	Preserves natural boundaries better	Variable chunk sizes	General-purpose documents
Semantic	Split where embedding similarity drops below threshold	Content-aware, preserves meaning	Slower, requires embedding during chunking	Varied documents, conversational content
Hierarchical	Parent chunks (large context) + child chunks (precise retrieval)	Best of both worlds: precision + context	More complex, higher storage cost	Long documents needing both broad and specific retrieval
Document-Structure	Use headers, sections, paragraphs as boundaries	Preserves logical document structure	Requires structured input	PDFs, Word docs, HTML with clear structure

💡 Chunking Best Practices

Overlap: 10-20% of chunk size to preserve cross-boundary context
Include metadata in chunk (source, page, section) for better retrieval context
Measure: chunk cohesion (intra-chunk cosine similarity), retrievability metrics
Bedrock Knowledge Bases offers: fixed-size, semantic, and hierarchical chunking built-in
Custom chunking: Lambda function for complex logic (hierarchical workflows)

🧲 Embedding Models — Amazon Titan Embeddings

Titan Text Embeddings V2

Dimensions: 256, 512, or 1024 (configurable)
Supports normalization (for cosine)
English + multilingual support
Best for: text-only semantic search

Titan Multimodal Embeddings G1

Embeds both text AND images in same space
Cross-modal similarity search
Dimension: 1024
Best for: product search, media retrieval

Embedding Selection Criteria

Match dimensionality to quality/cost need
Use SAME model for indexing AND querying
Consider: throughput, cost per 1K tokens
Cohere Embed for multilingual enterprise

🎯 Critical Rule: Always use the exact same embedding model for both creating the vector index AND for embedding queries at search time. Mixing models produces meaningless similarity scores.

🔎 Advanced Query Engineering

Query Enhancement Techniques:

Query Expansion

Use LLM to generate synonyms/related terms
HyDE: generate hypothetical answer, embed it, search for similar docs
Multi-query: generate N variations → union results
Domain-specific expansion (medical/legal terms)

Query Decomposition

Break complex queries into sub-queries
Identify: temporal, entity, constraint components
Run sub-queries in parallel (Lambda)
Aggregate + deduplicate results
Use Step Functions for orchestration

Re-ranking

First-pass: fast ANN retrieval (top-k)
Re-rank with cross-encoder model
Apply business logic weighting
Diversity algorithms (avoid result clustering)
Amazon Kendra: hybrid keyword + semantic

# Query pipeline classification example
def select_processing_pipeline(query, classification):
    if classification == 'simple':
        return ['expansion']
    elif classification == 'complex':
        return ['decomposition', 'expansion', 'transformation']
    elif classification == 'domain_specific':
        return ['domain_expansion', 'specialized_transformation']

1.6

Implement Prompt Engineering Strategies & Governance

▶

✍️ Advanced Prompt Engineering Techniques

Technique	Description	AWS Implementation	Best For
Chain-of-Thought (CoT)	"Think step by step" — forces intermediate reasoning before answer	System message + prompt structure; Step Functions for multi-step	Math, logic, complex analysis
ReAct (Reason+Act)	Interleaved Reasoning-Action-Observation loop	Step Functions state machine (Reason state → Action state → Observe state)	Agentic tasks needing tool use
Few-Shot	Provide 3-5 examples in prompt	Bedrock Prompt Management templates with examples	Classification, format adherence
Tree of Thought	Explore multiple reasoning branches in parallel	Step Functions Parallel states + aggregation Lambda	Complex multi-path problems
Self-Consistency	Sample N responses, majority vote	Lambda to invoke model N times + aggregation	Factual accuracy, reducing hallucination
Prompt Chaining	Output of prompt A feeds prompt B	Bedrock Flows (visual) or Step Functions	Multi-stage document processing

💡 Bedrock Flows: Visual, node-based builder for prompt chains. Nodes include: FM inference nodes, Lambda nodes, Condition nodes, Iterator nodes, Collector nodes, Knowledge Base retrieval nodes. Use for: RAG + generation pipelines, multi-step reasoning, conditional branching without custom code.

🛡️ Bedrock Guardrails — Deep Configuration

Content Filters

Categories: Hate, Insults, Sexual, Violence, Misconduct, Prompt Attack
Severity levels: LOW, MEDIUM, HIGH
Applies to: INPUT and/or OUTPUT
Custom threshold per category

Topic Denial

Define forbidden topics with plain language
Examples: competitor products, legal advice, medical diagnoses
LLM-based classification (no regex)
Returns custom denial message

PII Redaction

50+ PII types: SSN, credit card, email, phone, name, address
Modes: REDACT (replace with type) or BLOCK
Applies to both input and output
Audit-ready with CloudTrail logging

Grounding Check

Detects hallucinations vs. source documents
Checks if output is grounded in retrieved context
Relevance scoring threshold configurable
Essential for RAG pipelines

Word Filters

Custom blocked word lists
Managed lists (profanity)
Applied post-generation

Prompt Injection Defense

PROMPT_ATTACK filter category in content filter
Detects jailbreak attempts, role-play attacks
System prompt separation (protected)
Input validation in Lambda pre-Bedrock call

⚠️ Guardrails Gotcha: Guardrails must be explicitly associated with a model invocation (via guardrailIdentifier + guardrailVersion in the API call). They do NOT auto-apply to all Bedrock calls. Also: Guardrails can be applied at both REQUEST and RESPONSE level independently.

📋 Prompt Management & Governance (Enterprise)

Bedrock Prompt Management Features:

Centralized repository: Store prompt templates with versions
Parameterization: Variables in templates ({{input}}, {{context}})
Version control: Draft → Review → Approved → Production
Approval workflows: Governance gates before deployment
A/B testing: Route % traffic to different prompt versions
Analytics: Track performance per prompt version

Governance Architecture:

CloudTrail: All prompt management API calls logged
IAM policies: Role-based access to prompt versions
Security Hub: Near-real-time risk analytics for FM deployments
Centralized vs. Federated: Central policy + distributed implementation
Async monitoring: Don't impact latency with sync governance checks

🎯 QA for Probabilistic Outputs: FM outputs are probabilistic — same input can produce different outputs. Design validation around: semantic similarity scoring (not exact match), threshold-based acceptance, statistical validation over N runs. Use SageMaker AI serverless RL-based customization (new feature) to reduce fine-tuning time from months to days.

Domain 2: Implementation & Integration

Tasks 2.1–2.5 · Agentic AI, Deployment Strategies, Enterprise Integration, API Patterns, Dev Tools

~25% of Exam Weight

2.1

Implement Agentic AI Solutions & Tool Integrations

▶

🤖 Agentic AI Architecture Overview

Technology	Type	Key Characteristics	When to Use
Amazon Bedrock Agents	Fully Managed	Built-in orchestration, action groups, knowledge bases, memory, Guardrails integration	Standard agentic workflows, minimal infra management, conversational agents
Bedrock AgentCore	Composable Services	Framework-agnostic (works with any SDK/model), AgentCore Policy (governance), AgentCore Evaluations, episodic memory for enhanced context	Complex agents needing fine-grained composability, multi-framework environments
Strands Agents SDK	Open-Source	Full code visibility, modular (swap components), built-in eval, MCP integration, @tool decorator	Custom agent logic, need transparency/control, contributing to open-source
AWS Agent Squad	Multi-Agent Orchestration	Coordinates multiple specialized agents, shared context/state, task delegation	Complex tasks requiring collaboration between specialized agents
Step Functions (ReAct)	Workflow Engine	Deterministic state machines, guaranteed execution, built-in error handling, human approval steps	Predictable workflows needing audit trail, human-in-the-loop, compliance

🔗 Model Context Protocol (MCP) — Deep Dive

MCP is a standardized protocol for agent-tool interactions. Agents discover tools, invoke them, and get results via MCP servers.

              MCP Transport Protocols
              stdio: Local process communication (dev/local)
SSE: Server-Sent Events (streaming, HTTP)
streamable-http: For AWS deployments (Mcp-Session-Id header for isolation)

            

              MCP Server Hosting Options
              Lambda: Stateless, lightweight tools (web search, calculations, data retrieval)
ECS: Stateful, complex tools (code execution, image processing, large compute)
API Gateway: Expose MCP-compatible endpoints for existing services

            

6-Step MCP Workflow:

MCP Client Initialization: Agent app connects to MCP server via transport protocol
Tool Discovery: Agent calls list_tools() — gets name, description, input schema for each tool
Agent Creation: Agent created with discovered tools; LLM can now see tools in system prompt
Reasoning & Tool Selection: LLM analyzes user query, decides which tool to call and with what arguments
MCP Server Execution: Server executes tool function, returns result to agent (server is stateless)
Final Response: Agent synthesizes tool results into coherent response to user

# Strands Agent with MCP integration pattern
from mcp import stdio_client, StdioServerParameters
from strands import Agent
from strands.tools.mcp import MCPClient

mcp_client = MCPClient(lambda: stdio_client(
    StdioServerParameters(command="uvx", args=["awslabs.aws-documentation-mcp-server@latest"])
))
with mcp_client:
    tools = mcp_client.list_tools_sync()
    agent = Agent(tools=tools, model="anthropic.claude-3-5-sonnet-20241022-v2:0")
    response = agent("What is the Bedrock Converse API?")  # Agent auto-selects tools

🔒 Safeguarded AI Workflows

Stopping Conditions

Step Functions: max iteration count in Choice state
Lambda: timeout settings (predictable execution)
CloudWatch alarms: auto-halt on error rate threshold
Circuit breaker: 50% failure → open circuit 30-60s

IAM Boundaries for Agents

Least-privilege resource policies
Restrict agent to only necessary actions/resources
Deny any unneeded service calls
Session policies for temporary credentials

Human-in-the-Loop

Step Functions Human Task state (wait for token)
API Gateway: collect human feedback
DynamoDB: store review decisions with TTL
Escalation criteria based on confidence scores

Input Validation

Schema validation before agent processing
Lambda pre-processing for malformed inputs
Bedrock Guardrails: prompt injection detection
Rate limiting via API Gateway usage plans

🎯 ReAct Pattern in Step Functions: State machine alternates: Reason state (invoke LLM) → Parse Action state (Lambda extracts tool call) → Execute Action state (call tool) → Observe state (feed result back to LLM) → repeat until final answer or max steps reached.

🤝 Multi-Agent Coordination Patterns

Ensemble / Aggregation

Multiple agents/models on same task
Majority voting for classification
Weighted averaging for numeric outputs
Ranked fusion for retrieval
Lambda aggregation logic

Specialized Routing

Agent Squad: route to specialized agent
Claude → complex reasoning tasks
Nova Pro → document analysis
Nova Lite → simple/high-volume tasks
Domain-specific agents (medical, legal)

Hierarchical Agents

Orchestrator agent decomposes task
Sub-agents handle specific components
Results aggregated by orchestrator
Step Functions manages coordination
DynamoDB shares state between agents

2.2

Implement Model Deployment Strategies

▶

🚀 Deployment Strategy Decision Framework

Strategy	Service	Traffic Pattern	Latency	Cost Model	Key Config
On-Demand Serverless	Lambda + Bedrock	Spiky, unpredictable	Variable (cold start risk)	Pay per invocation	Memory, timeout, concurrency limits
Bedrock On-Demand	Bedrock InvokeModel	Any	Low-medium	Pay per token	Model ID, throttling limits
Bedrock Provisioned Throughput	Bedrock PT	Steady, high-volume	Consistent, low	Per-hour commitment (1mo/6mo)	Model Units (MUs), CloudWatch monitoring
SageMaker Real-time	SageMaker Endpoints	Consistent, latency-sensitive	Low (<1s)	Instance hours + data	Instance type, auto-scaling policy
SageMaker Serverless	SageMaker Serverless	Intermittent	Medium (cold start)	Pay per request	Memory size, max concurrency
SageMaker Async	SageMaker Async Endpoints	Batch, non-latency-sensitive	Minutes	Instance hours (scale-to-zero)	S3 input/output, max concurrency
Multi-Model Endpoint	SageMaker MME	Many models, low per-model traffic	Variable (model loading)	Shared instance across models	Container + model artifacts, routing

🖥️ Large Language Model Deployment Challenges

Memory Management

LLMs can be 10s-100s of GB
SageMaker: up to 500GB model size
GPU instances: ml.g5, ml.p4d.24xlarge (for large models)
CPU for small NER/classification: ml.c5.9xlarge
Container health check timeout: up to 60 min

Model Parallelism

DeepSpeed: tensor/pipeline parallelism
Triton + FasterTransformer: optimized inference
SageMaker Distributed Inference
UltraServers: multi-EC2 instances with low-latency interconnect
For models larger than single GPU memory

Token Processing Optimization

Batching: group requests to maximize GPU utilization
Continuous batching: process tokens as they arrive
KV-cache: reuse attention computations
Quantization (INT8/INT4): reduce model size
Knowledge distillation: train smaller model from large

SageMaker Endpoint Types Comparison:

              Inference Components (New)
              Host multiple models on single endpoint
Define separate scaling policies per model
Control memory/CPU allocation per component
Scale each model independently based on traffic
Best for: multi-model serving with different traffic patterns

            

              Serial Inference Pipelines
              Chain multiple models in sequence
Output of model N → input of model N+1
E.g.: preprocessing model → LLM → postprocessing
Single endpoint for the pipeline
Best for: fixed multi-step inference workflows

            

💡 Nova Forge via SageMaker: Custom model training starting from Nova checkpoints. Mix proprietary data with Nova-curated data across all training phases (pre/mid/post-training). Supports RL with custom reward functions and custom orchestrator for multi-turn rollouts. Prevents catastrophic forgetting better than pure custom training.

💡 Bedrock Custom Model Import: Import models trained/fine-tuned in SageMaker into Bedrock. Get on-demand API access without managing endpoints. More cost-effective than provisioned throughput for variable traffic.

⚖️ Optimized Deployment Approaches

Model Cascading Architecture:

Route all requests to smallest/cheapest model first (Nova Lite)
Evaluate response quality with confidence scoring Lambda
If quality < threshold (e.g., 0.7-0.9), escalate to Nova Pro
Cache high-quality responses for similar future queries
Monitor cascade metrics: escalation rate, cost savings, quality distribution

              Caching Strategies
              Response caching: ElastiCache/DynamoDB for identical/near-identical queries
Embedding caching: Avoid re-embedding same content
Semantic caching: Return cached if query vector is close enough (similarity threshold)
API Gateway cache: 300s default TTL for GET requests

            

              Asynchronous Inference Pattern
              SQS queue → Lambda → SageMaker Async Endpoint
Results stored in S3, notification via SNS
Scale to zero when no traffic
SQS visibility timeout matches processing duration (5-15 min for LLMs)
DLQ after 3-5 failed attempts

            

2.3

Design & Implement Enterprise Integration Architectures

▶

🏢 Enterprise Connectivity Patterns

API-Based Integration

API Gateway: REST/HTTP/WebSocket APIs
Custom domain mappings for branding
Regional (low-latency) vs Edge-optimized (global)
Lambda integration for custom logic
Usage plans + throttling per API key

Event-Driven Integration

EventBridge: route business events to FM processing
Pattern matching: select which events need GenAI
SQS DLQ: handle failed event processing
EventBridge Pipes: source → filter → enrich → target
Loose coupling between systems

Hybrid/On-Premises

AWS Outposts: run FM inference in your DC
AWS Wavelength: edge deployments for ultra-low latency
Local Zones: geographic compliance
Direct Connect: dedicated network to AWS
Site-to-Site VPN: encrypted connectivity

🔐 Secure Access Framework for GenAI

Security Layer	Service/Pattern	Implementation Detail
Identity Federation	IAM Identity Center / Cognito	Attribute mapping from IdP, role assignment per user group
Fine-grained Access	Amazon Verified Permissions	Cedar policy language, attribute-based (ABAC) policies on resources
Network Isolation	VPC Endpoints (PrivateLink)	Private connectivity to Bedrock without internet; security groups + NACLs
Encryption in Transit	ACM + TLS 1.2+	All API calls to Bedrock are TLS encrypted by default
Encryption at Rest	AWS KMS	Customer-managed keys (CMK) for model artifacts, prompt logs, knowledge bases
Audit Logging	CloudTrail + CloudWatch Logs	Log all FM API calls with request/response for compliance

🔧 CI/CD for GenAI + GenAI Gateway Architecture

              CI/CD Pipeline (CodePipeline + CodeBuild)
              Source: CodeCommit/GitHub trigger
Build: CodeBuild — package Lambda, validate prompts, dependency scan
Test: Automated FM behavior tests (deterministic + probabilistic)
Security scan: SAST/DAST, dependency vulnerabilities
Staging deploy: limited traffic rollout
Approval gate: human review or automated quality check
Production deploy: blue/green or canary
Post-deploy: CloudWatch alarms, rollback trigger

            

              GenAI Gateway Pattern
              Centralized entry point for all FM access
API Gateway → Lambda Gateway → Bedrock/SageMaker
Enforces: auth, rate limiting, logging, cost tracking
Model routing logic centralized here
X-Ray tracing across all hops
Cost allocation by team/use-case via tags
Supports: A/B testing, gradual rollout

            

2.4

Implement FM API Integrations

▶

🌊 Streaming & Real-Time AI

              Bedrock Streaming API
              InvokeModelWithResponseStream
Returns chunks as they're generated
Buffer management: 5-20 chunks, flush on sentence completion
Client-side progressive rendering
Error recovery: fallback to full-response API if streaming fails persistently

            

              WebSocket / SSE Patterns
              WebSocket: bidirectional, keep-alive ping every 30-60s
Idle timeout: ~10 min for interactive sessions
SSE: reconnection backoff from 1s to max 30-60s
Event IDs: resume streams after disconnection
API Gateway: chunked transfer encoding

            

🔄 Resilience Patterns — Key Numbers to Know

Retry Configuration

Initial backoff: 100ms
Backoff factor: 2x (exponential)
Max backoff: 20 seconds
Max attempts: 3-5
Jitter: ±100ms or factor 0.1-0.3
Retriable: 429, 500, 503
Non-retriable: 400, 401, 403

Circuit Breaker

Failure threshold: 50% over 10 requests
Recovery timeout: 30-60 seconds
Half-open test traffic: 10-20%
Implement: Step Functions + CloudWatch
Alert: CloudWatch alarm → SNS → Lambda

Throttling Config

Account-level: 10,000 RPS
Stage-level: 1,000-5,000 RPS
Route-level: 50-500 RPS (complex models)
SQS for request buffering under throttle
SQS visibility timeout: 5-15 min for LLMs

Connection Pooling

Pool size: 10-20 connections per instance
Connection TTL: 60-300 seconds
Reduce SDK client instantiation (reuse across Lambda invocations)
Use global variables for SDK clients in Lambda

🎯 X-Ray Tracing Pattern: Add custom subsegments for: (1) input preprocessing, (2) model invocation, (3) response postprocessing. Annotate with: model_name, input_complexity_score, output_quality_score, cost_estimate. This enables performance analysis by model and query type.

Domain 3: AI Safety, Security & Governance

Guardrails, IAM, Responsible AI, Compliance, Audit, Data Privacy

~20% of Exam Weight

3.1

Implement AI Safety Controls & Responsible AI

▶

🛡️ Bedrock Guardrails — Complete Configuration

Guardrail Feature	Configuration Detail	Applied At	Use Case
Content Filters	Categories: HATE, INSULTS, SEXUAL, VIOLENCE, MISCONDUCT, PROMPT_ATTACK. Severity: LOW/MEDIUM/HIGH per category	INPUT and/or OUTPUT independently	Block harmful content generation
Denied Topics	Plain language topic description (LLM-based classification, not regex). Custom denial message.	INPUT (topic detection)	Block competitor questions, legal/medical advice
Word Filters	Custom word lists + AWS managed profanity list	OUTPUT	Enforce brand/compliance word policies
PII Detection & Redaction	50+ entity types: SSN, email, phone, credit card, name, address, IP. Mode: REDACT or BLOCK	INPUT and/or OUTPUT	HIPAA, PCI-DSS, GDPR compliance
Grounding Check	Verifies output is grounded in source context. Configurable relevance threshold.	OUTPUT (requires context)	Reduce hallucinations in RAG pipelines
Sensitive Info Filters	Regex patterns for custom sensitive data (e.g., employee IDs, internal codes)	INPUT and OUTPUT	Organization-specific PII beyond standard types

⚠️ Guardrails Must Be Explicitly Invoked: Pass guardrailIdentifier + guardrailVersion in InvokeModel/Converse API call. They don't auto-apply. Can test guardrails independently with ApplyGuardrail API before deploying.

🎯 Prompt Injection Defense: Use PROMPT_ATTACK content filter + keep system prompt in system parameter (separate from user messages, protected by Bedrock). Also: validate/sanitize user input in Lambda before sending to Bedrock. Input validation + Guardrails = defense in depth.

⚖️ Responsible AI Principles on AWS

Bias Detection & Mitigation

SageMaker Clarify: bias metrics (class imbalance, DPPL, KL divergence)
SageMaker Data Wrangler: data quality reports
Model Cards (SageMaker): document model limitations, bias findings
HELM benchmark: includes fairness + toxicity metrics

Explainability

SageMaker Clarify: SHAP values for feature importance
Chain-of-Thought prompting: expose reasoning
Model Cards: document intended use, out-of-scope uses
Attribution in RAG: cite source documents

Privacy & Data Protection

Amazon Macie: detect PII in S3 automatically
Bedrock: no model training on customer data (by default)
VPC Endpoints: data doesn't leave AWS network
KMS CMK: customer controls encryption keys

Auditability

CloudTrail: all Bedrock API calls logged
CloudWatch Logs: model inputs/outputs (optional logging)
Bedrock Model Invocation Logging: S3 + CloudWatch
CloudTrail Lake: query audit events with SQL

🔑 IAM & Security for GenAI

Key IAM Actions for Bedrock (Know These)
              bedrock:InvokeModel
              bedrock:InvokeModelWithResponseStream
              bedrock:Retrieve
              bedrock:RetrieveAndGenerate
              bedrock:ApplyGuardrail
              bedrock:CreateKnowledgeBase
              bedrock:GetFoundationModel
              bedrock:ListFoundationModels
            

              Service Control Policies (SCPs)
              Org-level deny: prevent use of non-approved models
Region restrictions: only us-east-1, us-west-2
Require conditions: VPC source, MFA, time-of-day
Block: CreateProvisionedModelThroughput without approval

            

              Resource-Based Policies for Bedrock
              Knowledge Base policies: control who can Retrieve/RetrieveAndGenerate
Cross-account access for shared models
Agent resource policies: restrict which roles can invoke
Condition keys: bedrock:RequestedModelId (restrict to approved models)

            

💡 AWS Security Hub for GenAI: New enhanced capabilities — near-real-time risk analytics, improved prioritization for FM-related findings. Integrates with CloudTrail for prompt management audit events. Correlates findings across sources. Configure custom security standards for FM-specific risks (prompt injection attempts, unusual API usage patterns).

3.2

Compliance, Governance & Data Privacy

▶

📜 Compliance Frameworks & Controls

Regulation	Key Requirement	AWS Controls
GDPR	Data minimization, right to erasure, consent	Macie (PII detection), Guardrails (PII redaction), KMS (encryption), VPC endpoints (data residency)
HIPAA	PHI protection, audit trails, BAA	Bedrock HIPAA eligibility (with BAA), Macie, CloudTrail, dedicated endpoints, encryption
PCI-DSS	Cardholder data protection	Guardrails PII filter (credit card), KMS, VPC, CloudTrail, WAF on API Gateway
SOC 2	Security, availability, confidentiality	CloudTrail audit, Security Hub, GuardDuty, Access Analyzer

🎯 Data Residency: For regulatory requirements, use: (1) VPC Endpoints to keep data within AWS network, (2) AWS Outposts for on-premises data that can't leave DC, (3) Specific region selection (e.g., eu-west-1 for EU data), (4) S3 Object Lock for retention compliance, (5) Local Zones for specific geographic requirements.

Domain 4: Operational Efficiency & Optimization

Cost optimization, Performance tuning, Caching, Monitoring, Auto-scaling

~15% of Exam Weight

4.1

Cost Optimization for GenAI

▶

💰 Cost Optimization Strategies

Model Selection

Use smaller models for simple tasks (cascading)
Nova Lite for high-volume → Pro only when needed
Measure: cost per task completion (not per token)
A/B test model quality vs. cost

Prompt Optimization

Shorter prompts = fewer input tokens = lower cost
Remove unnecessary context
Structured prompts produce shorter outputs
Max tokens limit prevents runaway costs

Caching

Semantic cache: return cached if cosine similarity > 0.95
Response cache: ElastiCache for exact-match queries
Embedding cache: avoid re-embedding same documents
Up to 90% cost reduction for repetitive queries

Provisioned Throughput

1-month or 6-month commitment
Break-even: typically ~70% utilization
Use CloudWatch to track PT utilization
Only for truly steady, predictable workloads

💡 Nova Forge Cost Optimization: Continued pre-training from checkpoints is dramatically cheaper than full retraining. Blending approach reduces catastrophic forgetting, meaning you don't need to retrain as often when adding new domain knowledge. RL with custom reward functions enables efficient post-training.

💡 S3 Vectors + Intelligent-Tiering: Automatically moves infrequently accessed vectors to lower-cost tiers. 40-60% storage cost reduction for large vector collections. No performance impact for frequently queried vectors (cached in high-performance tier).

4.2

Performance Optimization & Monitoring

▶

📊 Key Metrics & Monitoring Architecture

Technical Metrics

Inference latency: p50, p90, p95, p99
Throughput (tokens/s, requests/s)
Error rate by error type
Token utilization (input vs. output)
Cache hit rate
Model invocation count

Business Metrics

Cost per inference / per task completion
User satisfaction (CSAT, thumbs up/down)
Task completion rate
Time-to-first-token (UX)
Business value per dollar spent

Quality Metrics

Hallucination rate (grounding check score)
Response relevance (semantic similarity)
Guardrail trigger rate (by category)
Human review escalation rate
Model drift (quality degradation over time)

CloudWatch Dashboard Components:

Bedrock invocation metrics (built-in namespace: AWS/Bedrock)
Custom metrics: quality scores, cache hit rates, business KPIs (via PutMetricData)
Log Insights queries: identify patterns in prompt confusion, slow responses
Composite alarms: trigger only when multiple conditions met simultaneously
Anomaly detection: ML-based baseline for adaptive alerting

🎯 Bedrock Model Invocation Logging: Enable to capture: full request/response to S3 or CloudWatch Logs. Use for: quality auditing, debugging, fine-tuning data collection, compliance. Configure at account level or per model. Important: logging adds slight latency — consider async delivery to S3 via Firehose for high-volume production.

📈 Auto-Scaling Strategies

Service	Scaling Trigger	Scaling Type	Notes
SageMaker Endpoints	InvocationsPerInstance, CPU utilization, custom metrics	Target tracking or Step scaling	Cooldown periods prevent thrashing; Inference Components allow per-model scaling
Lambda	Concurrent executions (auto)	Automatic, up to account limit	Reserved concurrency for predictability; Provisioned concurrency for cold start elimination
Bedrock Provisioned Throughput	Manual or CloudWatch-triggered scaling	Model Units (MUs)	No auto-scale; plan capacity from usage metrics
OpenSearch	CPU, memory, storage utilization	Horizontal (add data nodes) or vertical	UltraWarm for cost-efficient historical vectors; Auto-Tune for JVM optimization
API Gateway	Throttling limits per stage/route	Usage plans (no auto-scale)	SQS buffer behind API GW for burst handling

Domain 5: Testing, Validation & Troubleshooting

Model evaluation, QA frameworks, Regression testing, Debugging GenAI applications

~10% of Exam Weight

5.1

Model Evaluation & Validation Frameworks

▶

🧪 Bedrock Model Evaluation

              Automatic Evaluation
              Metrics: accuracy, robustness, toxicity
Uses built-in or custom datasets
Comparisons across multiple models
Results in S3 and viewable in console
ROUGE, METEOR, BERTScore for text quality

            

              Human Evaluation (A/B)
              Side-by-side model comparison
Human raters rank responses
Criteria: accuracy, coherence, helpfulness
Works with AWS Mechanical Turk or internal teams
Statistical significance testing

            

💡 Probabilistic Validation Approach: For deterministic outputs (JSON schemas, specific formats) → use exact match / schema validation. For generative outputs (summaries, answers) → use semantic similarity scoring (cosine similarity > threshold), not exact match. Run N samples, validate distribution of quality scores.

🔍 Testing Frameworks & Strategies

Test Type	What It Tests	Implementation
Functional Testing	Correct outputs for expected inputs	Lambda test harness, expected output comparison
Edge Case Testing	Boundary inputs, empty strings, very long prompts, special characters	Parameterized test suite, automated via Step Functions
Prompt Injection Testing	Resistance to jailbreak/injection attacks	Red-teaming prompts, Guardrail PROMPT_ATTACK filter testing
Regression Testing	New model/prompt version doesn't degrade previous quality	Golden dataset + automated quality comparison, CloudWatch quality metrics
Load Testing	Performance under expected traffic	Lambda concurrent invocations, API GW throttle testing
Hallucination Testing	Factual accuracy, grounding in source docs	Bedrock Grounding Check, RAGAs framework, human spot checks
Bias Testing	Consistent quality across demographic groups	SageMaker Clarify, HELM fairness metrics

🎯 SageMaker AI RL-based Customization: Serverless RL-based fine-tuning (new feature) reduces fine-tuning time from months to days. Enables rapid testing of new prompt architectures and domain-specific model behaviors. Used in QA workflows to quickly validate if a model variant improves target metrics.

🐛 Troubleshooting GenAI Applications

Poor RAG Quality

Check: chunk size vs. query complexity
Verify: same embedding model for index + query
Inspect: similarity scores (too low = bad embeddings)
Review: metadata filters (over-filtering?)
Embedding drift: re-embed if model updated

High Latency

X-Ray trace: find slow subsegment
Cold starts: enable Lambda Provisioned Concurrency
Vector search: reduce ef_search, add metadata pre-filter
Model: try smaller model or Cross-Region inference
Cache hit rate too low: review semantic threshold

Hallucinations

Enable Bedrock Grounding Check guardrail
Increase retrieved context (more chunks)
Add citation requirement to prompt
Reduce temperature (0.1-0.3 for factual tasks)
Use CoT to expose reasoning

Throttling (429 errors)

Check: Bedrock quota limits in Service Quotas
Request quota increase via support ticket
Implement exponential backoff + jitter
Add SQS buffer for burst absorption
Use Cross-Region inference for capacity

⚠️ CloudWatch Logs Insights for GenAI Debugging: Query patterns: filter @message like "throttle" | stats count() by @logStream — find throttled Lambda functions. Also query for guardrail trigger patterns, slow invocations (filter @duration > 5000), and error message patterns to identify systemic issues.

Deep Dive by Service

Reference architecture, configuration details, and exam tips per AWS service

🟠

Amazon Bedrock — Complete Service Reference

▶

📡 Bedrock APIs — Know Every One

API	Purpose	Key Parameters	Response Type
InvokeModel	Synchronous single inference	modelId, body (model-specific JSON)	Complete response
InvokeModelWithResponseStream	Streaming inference	modelId, body	Event stream (chunk by chunk)
Converse	Unified multi-model API (recommended)	modelId, messages[], system[], inferenceConfig	Complete, unified format
ConverseStream	Unified streaming API	Same as Converse	Event stream, unified format
Retrieve	Knowledge Base vector search only	knowledgeBaseId, retrievalQuery	Retrieved chunks + metadata
RetrieveAndGenerate	RAG: retrieve + generate in one call	knowledgeBaseId, input, retrievalConfig, generationConfig	Generated response + citations
ApplyGuardrail	Test guardrails without model call	guardrailIdentifier, guardrailVersion, source, content	Action (NONE/GUARDRAIL_INTERVENED) + assessments
CreateModelEvaluationJob	Automated model evaluation	evaluationConfig, inferenceConfig, outputDataConfig	Job ARN

🎯 Converse API Advantage: Unified format works across all Bedrock models (Claude, Nova, Titan, Llama, Mistral). One code path for all models. Handles system prompts, multi-turn conversation, tool use, document understanding. Preferred over InvokeModel for new development.

⚡ Provisioned Throughput — Detailed Config

What is a Model Unit (MU)?

A Model Unit represents a specific throughput capacity (tokens per minute). Different models have different MU sizes. Purchase 1+ MUs based on peak throughput requirement.

1-month term: lower commitment, higher per-MU cost
6-month term: better rate, more risk
No-commitment: available for some models (most expensive)
CloudWatch: TokensPerMinute metric for utilization

              When to Use Provisioned Throughput
              Steady traffic (70%+ utilization to break even)
Need guaranteed capacity (SLA requirements)
Consistent low latency requirements
Avoid: spiky/unpredictable traffic (use on-demand)
Cross-Region inference: use inference profiles instead

            

🟡

Amazon Nova Model Family — Complete Reference

▶

Model	Input Types	Best For	Key Feature	Use in Routing
Nova 2 Pro	Text, Image, Video, Doc	Complex reasoning, multi-step analysis, document understanding	Highest capability in Nova family, agentic tasks	Final escalation tier
Nova 2 Lite	Text, Image, Video	High-volume, cost-effective, standard generation	Best price/performance ratio	Default routing tier
Nova 2 Sonic	Audio + Text (real-time)	Voice AI, real-time conversation, low-latency dialogue	Optimized for audio streaming, lowest latency	Voice/RT applications
Nova Canvas	Text (prompt)	Image generation, editing	High-quality image synthesis	Image generation tasks
Nova Reel	Text + Image (optional)	Video generation	AI video creation	Video generation tasks
Nova Forge	Training pipeline	Custom model creation	Training from checkpoints (pre/mid/post-training), RL support	N/A (training, not inference)

💡 Nova Forge Key Details for Exam

Access via: Amazon SageMaker AI (not directly in Bedrock console)
Training phases: pre-training, mid-training (domain adaptation), post-training (instruction following)
Data mixing: blend YOUR data with Amazon Nova-curated data at all phases
Prevents catastrophic forgetting better than training on raw data alone
RL support: plug in your own reward functions and orchestrator
Multi-turn RL: supports sequential decision-making (agent workflows)
Cost: dramatically cheaper than full model training from scratch

📝 Nova JSON Format (Exam Must-Know)

// Nova 2 Pro inference request format
{
  "system": [
    {"text": "You are a financial analysis assistant."}
  ],
  "messages": [
    {
      "role": "user",
      "content": [{"text": "Analyze Q3 earnings..."}]
    }
  ],
  "inferenceConfig": {
    "maxTokens": 2000,
    "temperature": 0.3,
    "topP": 0.9
  }
}
// Note: system is an ARRAY of objects (different from Claude's string)

SageMaker Feature	Purpose for GenAI	Key Config
Real-time Endpoints	Low-latency synchronous inference	Instance type, initial instance count, auto-scaling policy
Serverless Inference	Intermittent traffic, scale-to-zero	Memory size (1GB-6GB), max concurrency
Async Endpoints	Long-running inference, batch	S3 input/output, max concurrency, SNS notification, auto scale-to-zero
Multi-Model Endpoints	Host many models on shared infra	Container image, model artifact S3 paths, routing logic
Inference Components	Separate scaling per model on shared endpoint	Per-component memory/CPU allocation, independent scaling policies
Serial Inference Pipeline	Chain preprocessing → LLM → postprocessing	Container sequence, data transformation between stages
Multi-Container Endpoints	Different containers on same instance	Direct invocation of specific container, resource isolation
Processing Jobs	Data prep, validation, evaluation	Pre-built containers (Scikit, Spark), custom containers
Clarify	Bias detection, explainability (SHAP)	DataConfig, ModelConfig, BiasConfig, ExplainabilityConfig
Model Cards	Document model intent, limitations, bias	Structured JSON with intended use, evaluation results, risk ratings
Experiments	Track prompt/model A/B test results	Runs, metrics, artifacts, comparison views

Feature	Details
Data Sources	S3 (primary), Web Crawler, Confluence, SharePoint, Salesforce, custom connector
Chunking Options	Default (300 tokens), Fixed-size (configurable), Hierarchical (parent+child), Semantic, No chunking
Embedding Models	Amazon Titan Text Embeddings V2 (default), Cohere Embed English/Multilingual, custom
Vector Stores	OpenSearch Serverless (default, managed), Aurora pgvector, Pinecone, Redis Enterprise, MongoDB Atlas
Retrieval Methods	Semantic (vector), Hybrid (vector + keyword), Metadata filtering
Reranking	Built-in reranking with Amazon Rerank or Cohere Rerank models
Sync	Manual or automatic on schedule; incremental sync (only changed docs)
Query Config	numberOfResults, searchType (SEMANTIC/HYBRID), filter (metadata conditions)
Citations	RetrieveAndGenerate returns source documents + page numbers + S3 URI

⚡ Quick Reference Cheat Sheet

Critical numbers, decision trees, and patterns for exam day

🔢

Critical Numbers & Thresholds to Memorize

▶

⚡ Retry & Resilience

Initial backoff: 100ms
Backoff factor: 2x (exponential)
Max backoff: 20 seconds
Max attempts: 3-5
Jitter: ±100ms or 0.1-0.3 factor
Circuit open at: 50% fail over 10 requests
Recovery timeout: 30-60 seconds
Half-open traffic: 10-20%

🔄 SQS for LLMs

Visibility timeout: 5-15 minutes (LLM tasks)
DLQ after: 3-5 failed attempts
Max message size: 256KB (use S3 pointer for large)
Retention: Up to 14 days
FIFO vs Standard: FIFO for ordered processing

🔌 API Gateway

Request timeout max: 29 seconds (REST API)
Account-level throttle: 10,000 RPS default
Stage-level: 1,000-5,000 RPS
Route-level: 50-500 RPS (complex models)
Burst limit: 2-3x steady-state rate
Cache TTL: 300s default

🔴 Bedrock Throttling

429 error: Throttled (retriable)
400 error: Bad request (NOT retriable)
401/403: Auth (NOT retriable)
500/503: Service error (retriable)
Bedrock timeout: Up to 120s complex models
Simple models: 15-30s timeout OK

📊 Vector Search

HNSW M param: 16-64 (connections/node)
ef_construction: 100-512 (build quality)
ef_search: 100-512 (query quality)
Cosine range: -1 to 1 (1 = identical)
Semantic cache threshold: Cosine > 0.95
S3 Vectors pre-filter saves: 50-70% search space

🔗 Lambda + GenAI

Connection pool size: 10-20 connections/instance
Connection TTL: 60-300 seconds
Provisioned concurrency: Eliminate cold starts
Global SDK client: Initialize OUTSIDE handler
LLM task timeout: 5-15 minutes (async)
Max Lambda timeout: 15 minutes

💰 Cost Thresholds

Provisioned break-even: ~70% utilization
S3 Vectors Intelligent-Tiering: 40-60% cost reduction
Semantic cache savings: Up to 90% for repetitive queries
Model cascade target: Start with Nova Lite, escalate only if quality < threshold

📡 Streaming

Buffer size: 5-20 chunks
Flush trigger: Sentence completion or 100-500ms
WebSocket keep-alive: Ping every 30-60s
WebSocket idle timeout: ~10 minutes
SSE reconnect: 1s → backoff → max 30-60s

🧠

Last-Mile Exam Traps

▶

🎯 Why this mattersThe official-style practice questions are less about broad GenAI knowledge and more about picking the precise AWS-native answer when multiple options feel plausible. Use this section as your final decision matrix.

Common Trap	Correct Lean	What AWS Is Testing
Need managed failover and performance-aware regional routing	Inference profile	Not just "multi-Region"; choose the Bedrock-native routing construct.
Need general async batch inference for text/image workloads	`CreateModelInvocationJob`	`StartAsyncInvoke` is the distractor for Nova Reel video generation.
Need to inspect which knowledge base files failed ingestion	Knowledge base logging to CloudWatch Logs	CloudTrail audits API calls, not document-level ingestion outcomes.
Need to reorder already relevant retrieval results	Reranker models	Hybrid search improves retrieval; rerank improves final ordering.
Need to guarantee every inference call includes a guardrail	IAM condition key `bedrock:GuardrailIdentifier`	Central enforcement beats custom proxy code.
Need to know which specific guardrail layer intervened	`trace: "enabled"` + `GuardrailPolicyType` metrics	Not just whether input/output was blocked, but which policy fired.
Need generation to halt on a phrase	Stop sequences	Prompt instructions are weaker than inference parameters.
Unpredictable traffic with long idle periods	On-demand Bedrock	Provisioned Throughput is usually only right for high steady utilization.
Deterministic workflow with audit and mandatory sequence	Step Functions	Agents/Flows are often distractors when compliance is explicit.
Persistent MCP tool servers	ECS/Fargate	Lambda is attractive, but poor for persistent SSE-style connections.

💡 Final HeuristicIf two answers both work, the exam usually wants the option with the least custom infrastructure, the clearest AWS-native fit, and the most direct match to the exact constraint in the question stem.

🌳

Decision Trees — What to Use When

▶

🚀 Deployment Strategy Decision Tree

Q: What type of traffic pattern?
├── Spiky / unpredictable, low volume
│   └── → Lambda + Bedrock On-Demand (pay per token)
├── Steady, high volume (>70% utilization)
│   └── → Bedrock Provisioned Throughput (hourly commitment)
├── Need custom/open-source model
│   ├── Low latency needed → SageMaker Real-time Endpoint
│   ├── Intermittent traffic → SageMaker Serverless Inference  
│   ├── Batch, non-urgent → SageMaker Async Endpoint
│   └── Many models, low per-model traffic → SageMaker MME
├── Model too large for single GPU (>80GB)
│   └── → SageMaker + DeepSpeed/Triton OR EC2 UltraServer
└── On-premises data requirement
    └── → AWS Outposts OR AWS Wavelength (edge)

🔍 Vector Store Selection Decision Tree

Q: What are your requirements?
├── Billions of vectors, lowest cost
│   └── → Amazon S3 Vectors (new) with Intelligent-Tiering
├── Need keyword + semantic (hybrid) search
│   └── → Amazon OpenSearch Service with Neural Plugin
├── Need vector + relational SQL queries (JOINs, filters)
│   └── → Aurora PostgreSQL + pgvector extension
├── Fully managed RAG (no infra)
│   └── → Amazon Bedrock Knowledge Bases (managed KNN with OpenSearch Serverless)
├── Ultra-low latency + key-value
│   └── → Amazon MemoryDB (Redis-compatible)
└── Enterprise search + AI (hybrid)
    └── → Amazon Kendra (semantic + keyword, pre-built connectors)

🤖 Agentic Architecture Decision Tree

Q: What type of agentic workflow?
├── Simple autonomous agent (managed)
│   └── → Amazon Bedrock Agents
│       (+ Knowledge Bases for RAG, + Guardrails for safety)
├── Need full code control + transparency
│   └── → Strands Agents SDK (open-source)
│       + MCP servers (Lambda for simple, ECS for complex tools)
├── Multiple specialized agents collaborating
│   └── → AWS Agent Squad + Strands OR Bedrock Agents multi-agent
├── Deterministic workflow (compliance, audit required)
│   └── → AWS Step Functions state machine
│       (ReAct pattern or sequential with LLM at each step)
├── Complex governance + composability
│   └── → Amazon Bedrock AgentCore (framework-agnostic composable services)
└── Human approval required in workflow
    └── → Step Functions waitForTaskToken (human-in-the-loop)

🔒 Guardrails Decision Tree

Q: What safety requirement do you have?
├── Block harmful content (hate, violence, sexual)
│   └── → ContentPolicyConfig with appropriate inputStrength/outputStrength
├── Block specific topics (competitors, legal advice)
│   └── → TopicPolicyConfig (DENY + plain language description)
├── Detect/redact PII (HIPAA, GDPR)
│   └── → SensitiveInformationPolicyConfig (REDACT or BLOCK per PII type)
├── Block custom words/phrases
│   └── → WordPolicyConfig (custom list + managed profanity)
├── Prevent hallucinations in RAG
│   └── → GroundingPolicyConfig (GROUNDING filter with threshold 0.7-0.9)
├── Prevent prompt injection attacks
│   └── → ContentPolicyConfig with PROMPT_ATTACK filter (HIGH on INPUT)
└── Need to test guardrail before deploying
    └── → Use ApplyGuardrail API independently

✂️ Chunking Strategy Decision Tree

Q: What type of document and use case?
├── Uniform, structured content (FAQs, product descriptions)
│   └── → Fixed-size chunking (256-512 tokens, 10-20% overlap)
├── Long documents needing both precision + context
│   └── → Hierarchical chunking (parent=large context, child=small precision)
├── Varied documents, conversational data
│   └── → Semantic chunking (split where similarity drops)
├── Documents with clear headers/sections (PDFs, docs)
│   └── → Document-structure chunking (split at headings)
├── Custom logic required (domain-specific, preprocessing)
│   └── → Lambda custom chunking workflow
└── Using Bedrock Knowledge Bases (managed)
    └── → Choose: Default (300 tokens), Fixed, Hierarchical, or Semantic in KB config

🎯

Top 20 Exam-Day Tips (High-Yield)

▶

Converse API = unified multi-model: Single code path for Claude, Nova, Titan, Llama. Preferred for new development over InvokeModel.
Bedrock Guardrails must be explicitly invoked: Add guardrailConfig to every API call. Not auto-applied. Test with ApplyGuardrail API.
Same embedding model for index AND query: Never mix models. This is a common trap question.
Nova Forge = SageMaker AI: Accessed through SageMaker, NOT directly in Bedrock. Training from checkpoints to prevent catastrophic forgetting.
S3 Vectors = billions of vectors: New service for massive-scale vector storage with Intelligent-Tiering. Pre-filter metadata before vector calculation (50-70% savings).
Provisioned Throughput break-even ~70% utilization: Below that, on-demand is cheaper. Use CloudWatch to track PT utilization before committing.
Step Functions + Bedrock = native integration: No Lambda needed for InvokeModel or RetrieveAndGenerate in Step Functions.
MCP: Lambda = lightweight, ECS = complex: Lambda for stateless tool access (search, calc), ECS for code execution or image processing.
HNSW ef_search tradeoff: Higher ef_search = better recall but slower queries. Tune based on acceptable latency at p99.
pgvector advantage = SQL + vector: When you need relational queries combined with similarity search. Not for billion-scale.
Cross-Region Inference Profiles: Use for distributing load across regions. Automatic failover, no additional cost vs. on-demand tokens.
AgentCore vs. Agents: AgentCore = composable services (Policy, Evaluations, Memory) that work with ANY framework/model. Bedrock Agents = specific managed agent runtime.
Probabilistic validation: FM outputs vary. Use semantic similarity scoring + thresholds (not exact match) for QA. Run N samples, validate distribution.
ReAct in Step Functions: Reason (LLM) → Parse Action (Lambda) → Execute Tool (Lambda/API) → Observe (Pass state) → loop. Max iterations in Choice state.
Hierarchical chunking = parent+child: Child chunks for precise matching, parent chunks returned as context. Built into Bedrock Knowledge Bases.
Grounding Check = hallucination prevention: Bedrock Guardrail feature. Only works when you pass source context. Set threshold 0.7-0.9.
Lambda cold starts: Use Provisioned Concurrency for latency-sensitive paths. Initialize SDK clients OUTSIDE handler function (global scope).
SageMaker Inference Components: Host multiple models on one endpoint with INDEPENDENT scaling policies per model. Different from Multi-Model Endpoints (which share compute).
Model cascading pattern: Route ALL traffic to Nova Lite first. Escalate to Nova Pro only when quality score < threshold. Can save 60-80% of inference costs.
Security Hub + Bedrock: Near-real-time risk analytics for FM deployments. Correlates CloudTrail events with security findings. Configure custom standards for FM-specific risks.

📚

Services Not to Forget (Often Overlooked)

▶

Amazon Kendra

Enterprise search combining keyword (BM25) + semantic. Pre-built connectors (S3, SharePoint, Confluence, Salesforce). FAQ extraction. Relevance tuning. Use when: existing enterprise docs, need zero-config search quality.

Amazon Macie

ML-powered PII detection in S3. Auto-discovers sensitive data. Integrates with Security Hub. Use for: data governance, GDPR compliance, before feeding data to FMs.

AWS AppConfig

Dynamic configuration without redeployment. Use for: routing rules, model selection logic, feature flags, A/B test percentages. Supports gradual rollout with automatic rollback.

Amazon Verified Permissions

Fine-grained authorization using Cedar policy language. ABAC (attribute-based) policies. Use for: controlling which users can query which knowledge bases, role-based FM access.

Amazon Bedrock Data Automation

AI-powered pipeline for processing unstructured documents (PDFs, images, audio, video). Extracts structured data automatically. Reduces manual preprocessing for RAG pipelines.

AWS X-Ray

Distributed tracing across Lambda → Bedrock → OpenSearch. Custom segments + annotations (model_name, cost, quality_score). Service map visualization. Filter traces by annotation. Use for latency debugging.

Amazon Comprehend

NLP enrichment for data pipelines. Entity extraction, sentiment, key phrases, PII detection, topic modeling. Use to enrich documents with metadata BEFORE indexing into vector store.

CloudTrail Lake

Query audit events with SQL (Athena-like). Use for: compliance reporting on FM usage, query who invoked which model when, detect unusual access patterns in prompt management.