CloudWatch & Observability Domain 4.3

GenAI monitoring differs from traditional apps — you need to track token usage, prompt quality, hallucination rates, and retrieval relevance in addition to standard infrastructure metrics.

Metric / LogWhat It MeasuresWhen to Alert
InputTokenCountTokens consumed in prompts per requestSudden spike → prompt injection or runaway context
OutputTokenCountTokens generated in responsesConsistently near maxTokens → responses being truncated
InvocationLatencyEnd-to-end model call latencyP99 > SLA threshold
InvocationThrottlesThrottled requests due to quota limitsAny non-zero value → scale up provisioned throughput
GuardrailPolicyTypeWhich specific guardrail policy triggeredHigh rate on ContentPolicy → review content filter config
Model Invocation LogsFull request/response JSON logged to S3/CloudWatchEnable for debugging; disable in production if PII present
GuardrailPolicyType dimension: Identifies exactly which policy blocked a request — ContentPolicy, TopicPolicy, SensitiveInformationPolicy, PromptAttackPolicy. Use this to distinguish why a block occurred, not just that it occurred.

Monitoring Architecture — Three Layers

Infrastructure Layer
  • Lambda: invocations, errors, duration, concurrency
  • API Gateway: 4xx/5xx rates, latency, cache hit rate
  • SQS: ApproximateNumberOfMessages (queue depth)
  • DynamoDB: read/write capacity consumed vs. provisioned
  • ECS: CPU/memory utilization, task count
AI Application Layer
  • Token consumption per user / per session
  • Prompt effectiveness score (custom metric)
  • Cache hit rate (semantic & prompt cache)
  • Knowledge Base retrieval latency
  • Guardrail block rate by policy type
  • Agent tool call success/failure rate
Business Layer
  • Cost per query / per user session
  • User satisfaction scores (thumbs up/down)
  • Task completion rate for agents
  • Hallucination rate (from evaluator pipeline)
  • Response quality score trend over time

X-Ray Distributed Tracing for GenAI

X-Ray traces the full request path across services, letting you pinpoint exactly where latency or errors originate:

API Gateway receives requestX-Ray trace ID generated and propagated downstream.
Lambda invokedSubsegment created; measure time before Bedrock call. Identify if Lambda code is the bottleneck.
Bedrock InvokeModel callSubsegment records model call latency separately from Lambda overhead.
OpenSearch / pgvector retrievalSubsegment for KB retrieval; isolate vector search latency from model latency.
Response returnedFull trace shows per-segment time breakdown. Service map visualizes bottleneck service.
X-Ray vs. CloudWatch Logs: X-Ray for latency attribution across services (where is slow?). CloudWatch Logs for content-level debugging (what did the prompt/response look like?). Use both together.

Anomaly Detection & Alerting

Anomaly TypeDetection MethodResponse
Token cost spikeCloudWatch metric alarm on InputTokenCount sumSNS alert → Lambda auto-throttle user sessions
Guardrail block surgeCloudWatch alarm on GuardrailCoverage > thresholdReview prompt injection attack patterns in logs
Retrieval quality dropCloudWatch custom metric for retrieval scoreTrigger KB re-sync or embedding model refresh
Latency regressionCloudWatch Anomaly Detection band on P99 latencyAuto-scale or switch to provisioned throughput
Hallucination rate increaseAutomated evaluator Lambda checking factuality scoreAlert on-call; roll back prompt template
Baseline establishment: GenAI metrics are variable — token counts and latency fluctuate significantly by request type. Use CloudWatch Anomaly Detection (ML-based) rather than fixed thresholds. Collect 2 weeks of baseline data before enabling anomaly alerts.

Vector Database Monitoring Domain 4.3

RAG system performance depends heavily on vector database health. These metrics are specific to high-dimensional operations.

Query Latency P99Time for vector similarity search
Index Build TimeTime to index new documents
Recall@K% of true matches in top-K results
QPSQueries per second throughput
Index SizeMemory footprint of vector index
Shard HealthOpenSearch shard distribution
KPIHealthy RangeProblem Indicator
Query latency P99< 100ms for HNSW> 500ms → index not warmed, over-sharded, or undersized
Recall@10> 0.95< 0.90 → HNSW efSearch too low; index quality degraded
Index memory usage< 75% of available RAM> 90% → risk of OOM; add nodes or reduce vector dimension
Shard count per index~10GB per shardToo many small shards → overhead; too few → hot shards

OpenSearch Auto-Tune vs. Manual Tuning

OpenSearch Auto-Tune (Built-In)
  • Automatically optimizes JVM heap sizing
  • Adjusts shard allocation across nodes
  • Optimizes cache settings (field data, query cache)
  • Schedules optimizations during low-traffic windows
  • Zero new infrastructure — just enable it
  • Exam answer for "least ops overhead"
Manual Tuning Options
  • HNSW parameters: m (connections), ef_construction (build quality), ef_search (query quality)
  • Refresh interval: increase for write-heavy workloads
  • Segment merging: control merge policy for index size
  • Replica count: 1 replica = HA; 0 = write performance only

⚠ Exam Trap: OpenSearch + ElastiCache

  • Adding ElastiCache in front of OpenSearch = additional service to manage, additional cost, cache invalidation complexity
  • The exam question says "least operational overhead" → Auto-Tune wins over ElastiCache every time
  • Only add ElastiCache when the need is specifically caching identical repeated queries (not general latency improvement)

Knowledge Base Ingestion Monitoring

When using Bedrock Knowledge Bases, ingestion failures log to CloudWatch Logs — not CloudTrail. Know the error codes:

Error CodeMeaningFix
RESOURCE_IGNOREDDocument skipped (unsupported format, too large, or duplicate)Check file format + size limits; deduplicate source data
EMBEDDING_FAILEDEmbedding model call failed (timeout, rate limit)Check embedding model quota; add retry with exponential backoff
INDEXING_FAILEDVector store write failed (OpenSearch down, auth error)Check OpenSearch cluster health; verify IAM permissions for KB
METADATA_EXTRACTION_FAILEDMetadata parsing failed (malformed JSON/tags)Validate metadata format before ingestion
CloudTrail vs. CloudWatch Logs for KB: CloudTrail logs API calls (who called StartIngestionJob, when). CloudWatch Logs captures document-level ingestion status (which specific document failed and why). For troubleshooting ingestion failures, you need CloudWatch Logs.

Diagnostic Framework Domain 5.2

When a GenAI application misbehaves, the first step is isolating which layer is the problem. This framework guides you to the right diagnosis quickly.

Layer 1: Retrieval

  • Wrong documents retrieved
  • No results returned
  • Irrelevant chunks in context
  • Stale / outdated information
  • Low recall (missing correct docs)

Layer 2: Generation

  • Hallucination (fabricated facts)
  • Off-topic or irrelevant response
  • Format errors (wrong JSON, missing fields)
  • Truncated responses
  • Inconsistent style or language

Layer 3: Infrastructure

  • High latency (> expected)
  • Throttling errors (429)
  • Timeout errors
  • Cold start delays
  • Memory / compute exhaustion

Layer 4: Data / Input

  • PII leakage in output
  • Prompt injection detected
  • Guardrail blocks legitimate content
  • Data quality issues in source docs
  • Schema drift in structured inputs

Step-by-Step Diagnostic Decision Tree

Is the response factually wrong or invented?→ Hallucination → Go to Generation Failures section. Check: was the correct information even in the retrieved context?
Is the retrieved context correct but the answer still wrong?→ Generation issue → model isn't using context correctly. Try: better instruction in system prompt; higher-capability model; fewer distracting chunks.
Is the retrieved context empty or irrelevant?→ Retrieval failure → Go to Retrieval Failures section. Check: embedding model mismatch, query preprocessing, chunk quality.
Is the latency too high?→ Infrastructure issue. Use X-Ray to identify which service is slow. Check: cold starts, vector DB query time, model inference time separately.
Is a guardrail blocking legitimate content?→ Data/safety layer issue. Check guardrail trace with trace: ENABLED; identify which GuardrailPolicyType triggered; adjust policy thresholds.
Is the problem intermittent?→ Check for: ThrottlingExceptions (use exponential backoff), embedding model quota limits, OpenSearch shard contention during index refresh.

Retrieval Failure Troubleshooting Domain 5.2

Most RAG quality problems trace back to retrieval — wrong chunks, missing chunks, or irrelevant context being passed to the model.

SymptomRoot CauseFix
Always returns same documents regardless of queryEmbedding model saturating (all vectors similar); poor chunkingIncrease chunk diversity; try different embedding model; add metadata filtering
Returns 0 resultsIndex empty / not synced; embedding dimension mismatch; query preprocessing issueCheck KB sync status in CloudWatch; verify embedding dimensions match between index and query
Returns technically relevant but contextually wrong docsPure semantic search missing keyword signalsSwitch to hybrid search (BM25 + semantic); add reranker
Retrieves outdated informationKB not re-synced after source updateEnable S3 Event Notifications → incremental sync (IngestKnowledgeBaseDocuments)
Retrieves correct docs but model hallucinates anywayChunks too large (irrelevant content dilutes signal); too many chunksReduce top_k; use smaller, more precise chunks; add reranker to filter top-k

Embedding & Chunking Issues

Embedding Model Mismatch
  • Document embeddings and query embeddings must use the same model
  • Changing embedding model → must re-index all documents
  • Check: embedding dimension (e.g., 1536 vs. 3072) matches index config
  • Drift detection: monitor cosine similarity distribution over time
Chunking Strategy Fixes
  • Fixed-size chunks returning wrong context: Switch to semantic chunking
  • Tables being split across chunks: Use Bedrock Data Automation + chunking
  • Hierarchical docs (headers + sub-content): Hierarchical chunking
  • Chunk too short: Add overlap (10-15%) to preserve context across boundaries
Reranker Usage
  • Retrieve top-50 docs, rerank to top-5 before passing to model
  • Reranker uses cross-encoder (expensive but accurate)
  • When to add: retrieval finds relevant docs but ordering is poor
  • Metrics: compare NDCG@10 before and after reranker
Query Preprocessing
  • HyDE: Generate a hypothetical answer, embed it, search with that
  • Query expansion: Add synonyms or related terms before embedding
  • Sub-query decomposition: Split complex query into 2-3 simpler searches, merge results with RRF

Generation Failure Troubleshooting Domain 5.2

Generation failures happen at the model layer — hallucination, format errors, and safety blocks are the most common categories.

Detecting Hallucinations
  • Citation checking: Verify every claim in response is supported by a retrieved chunk
  • Faithfulness score: RAGAS faithfulness metric (automated)
  • LLM-as-Judge: Use a second model to evaluate factual accuracy
  • Ground truth comparison: Compare against known-correct answers
  • Bedrock Guardrail trace: Check groundednessScore if grounding filter enabled
Prevention Techniques
  • Explicit grounding instruction: "Only use information from the provided context. Say 'I don't know' if the answer isn't in the context."
  • Reduce temperature: Lower temperature = less creative/hallucinated
  • Citation enforcement: Require model to cite source chunk IDs
  • Retrieval quality first: Hallucination drops when relevant context is present
  • Smaller context window: Fewer chunks reduces off-topic generation
Important distinction: Hallucination is a generation problem but its most common cause is a retrieval problem. If the correct information is absent from retrieved context, the model invents an answer. Fix retrieval first.

Format & Safety Block Failures

Failure TypeDiagnosisFix
JSON output malformedModel didn't follow schema; output truncated by maxTokensUse JSON mode / constrained generation; increase maxTokens; provide schema in prompt with example
Response truncated mid-sentencemaxTokens too lowIncrease maxTokens limit; or add "be concise" instruction to reduce response length
Guardrail blocks legitimate queriesOver-aggressive policy; topic too broad in denied topics listEnable Guardrail trace; identify which policy triggered; narrow the policy scope or adjust sensitivity
PII in output despite filterPII type not covered in sensitive info policy; entity not recognized by modelAdd specific PII types to guardrail; test with Amazon Comprehend for detection validation
Prompt injection bypassed guardrailAttack pattern not covered by PROMPT_ATTACK filterUpdate Guardrail config; add adversarial examples to filter; use system prompt injection warnings

Agent Failure Troubleshooting Domain 5.2

Agent failures are multi-step — the bug could be in tool selection, parameter extraction, tool execution, or response synthesis.

Enable Bedrock Agents trace to see each reasoning step:

# Enable trace in Bedrock Agents API call response = bedrock_agent_runtime.invoke_agent( agentId='AGENT_ID', agentAliasId='ALIAS_ID', sessionId='SESSION_123', inputText='user query', enableTrace=True # Returns step-by-step reasoning ) # Trace output includes: # - modelInvocationInput: what was sent to the model # - rationale: model's reasoning about next action # - invocationInput: tool call + parameters # - observation: tool result returned # - finalResponse: agent's final answer

Common Agent Failure Patterns

SymptomRoot CauseFix
Agent calls wrong toolTool descriptions are ambiguous or overlappingRewrite tool descriptions with explicit "use when..." guidance; disambiguate similar tools
Agent passes wrong parameters to toolParameter names/types unclear; model hallucinated valuesImprove parameter descriptions; add validation in Lambda; use enum types for constrained values
Agent loops indefinitelyTool returns error agent can't interpret; no exit conditionSet maxIterations; return structured error objects; add fallback "give up" instruction
Agent returns "I couldn't complete the task" without reasonAll tools failed; guardrail blocked; token limit hitCheck trace for failed tool calls; increase token budget; review guardrail trace
Tool Lambda timing outDatabase query too slow; external API rate-limitedAdd DynamoDB caching; implement exponential backoff for external APIs; increase Lambda timeout
Agent ignores retrieved KB contextTool result format unrecognized by agent; context window overflowStandardize tool output format; reduce top-k chunks; prioritize most recent/relevant

ThrottlingException Handling

# Exponential backoff pattern for Bedrock throttling import time, random def invoke_with_retry(client, model_id, body, max_retries=5): for attempt in range(max_retries): try: return client.invoke_model(modelId=model_id, body=body) except client.exceptions.ThrottlingException: if attempt == max_retries - 1: raise wait = (2 ** attempt) * 0.1 + random.uniform(0, 0.1) time.sleep(wait) # 100ms, 200ms, 400ms + jitter
Retry pattern for exam questions: Initial delay 100ms → double on each retry → add random jitter → circuit breaker after max retries. The jitter prevents thundering herd if many clients retry simultaneously.

Evaluation Metrics Domain 5.1

GenAI evaluation requires different metrics than traditional ML. Know which metric to use for which evaluation goal.

What You're MeasuringMetricRange & Interpretation
Retrieval ranking qualityNDCG@K (Normalized Discounted Cumulative Gain)0–1; higher = better; rewards finding relevant docs in top positions
First relevant result positionMRR (Mean Reciprocal Rank)0–1; MRR=1 means first result always relevant; good for single-answer queries
% of relevant docs retrievedRecall@K0–1; "Did we find all relevant docs in top K?" Critical for high-recall use cases
% of retrieved docs that are relevantPrecision@K0–1; "How many of our top-K results are actually relevant?"
Summarization / translation qualityROUGE (n-gram overlap)0–1; ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
Machine translation qualityBLEU (Bilingual Evaluation Understudy)0–1; compares n-gram overlap with human reference; strict on word choice
Semantic similarity (meaning-preserving)BERTScore0–1; uses BERT embeddings; tolerates paraphrasing unlike BLEU/ROUGE
Language model qualityPerplexityLower = better; measures how well model predicts text; not suited for open-ended generation
BLEU vs. ROUGE: BLEU = translation quality (precision-focused). ROUGE = summarization quality (recall-focused). BLEU penalizes missing reference words; ROUGE penalizes missing summary content.

LLM-as-Judge Evaluation

What It Is
  • Use a powerful LLM (e.g., Claude Opus) to evaluate outputs of another model
  • Judge scores: relevance, factuality, helpfulness, safety, coherence
  • Can replace or augment human evaluators at scale
  • Bedrock supports LLM-as-Judge natively in evaluation jobs
When to Use
  • No ground truth available (open-ended Q&A)
  • Need scalable evaluation without human labelers
  • Evaluating creative or long-form content
  • A/B testing two model versions at scale
Limitations
  • Judge model can have same biases as evaluated model
  • Positional bias: judge often prefers response listed first
  • Length bias: longer responses rated higher regardless of quality
  • Mitigation: swap order of responses; use multiple judges; calibrate with human labels

Evaluation Frameworks in AWS

Framework / ServiceWhat It EvaluatesKey Feature
Bedrock Model EvaluationBuilt-in evaluation for Bedrock modelsAutomated + human evaluation jobs; supports custom metrics
SageMaker ClarifyBias detection, explainability, driftModel cards integration; fairness metrics
AgentCore EvaluationsAgent quality (13 built-in evaluators)Task success rate, tool accuracy, response quality
RAGASRAG pipeline end-to-end quality4 core metrics: faithfulness, answer relevance, context precision, context recall

RAGAS Framework Domain 5.1

RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG pipelines. Know all four metrics and what they measure independently.

1. Faithfulness

Does the generated answer contain only information from the retrieved context? Measures hallucination at the answer level.

  • Score: 0–1 (1 = fully grounded in context)
  • Low score = model is hallucinating or ignoring context
  • Fix: improve grounding instructions; reduce temperature
2. Answer Relevance

Does the generated answer actually address the user's question? Measures if the response is on-topic.

  • Score: 0–1 (1 = directly answers the question)
  • Low score = model gave tangential or off-topic response
  • Fix: improve system prompt instructions; better few-shot examples
3. Context Precision

Of the retrieved chunks, how many were actually needed to answer the question? Measures retrieval relevance.

  • Score: 0–1 (1 = all retrieved chunks were relevant)
  • Low score = retrieval is returning noisy/irrelevant chunks
  • Fix: better chunking; metadata filtering; reranker
4. Context Recall

Were all necessary pieces of information retrieved? Measures retrieval completeness.

  • Score: 0–1 (1 = all needed context was retrieved)
  • Low score = relevant docs are in the KB but not being retrieved
  • Fix: increase top-k; improve query preprocessing; hybrid search
Exam diagnostic pattern: Low Faithfulness → generation problem. Low Answer Relevance → generation problem. Low Context Precision → retrieval returning noise. Low Context Recall → retrieval missing relevant docs.

What RAGAS Tells You About Your Pipeline

RAGAS PatternDiagnosisFix
Low Faithfulness + High Context PrecisionGood retrieval but model ignores context (hallucinates)Stronger grounding instructions; lower temperature; citation enforcement
High Faithfulness + Low Answer RelevanceModel uses context faithfully but answers wrong questionBetter query understanding; clarify intent in system prompt
Low Context Precision + High Context RecallRetrieving too many docs (noisy); recall is fineReduce top-k; add reranker; tighten metadata filter
High Context Precision + Low Context RecallRetrieved docs are relevant but missing key informationIncrease top-k; improve chunking to avoid splitting key facts
All four metrics lowSystemic failure — likely embedding model or KB indexing issueAudit KB ingestion logs; test embedding model independently; re-index

A/B Testing & Quality Assurance Domain 5.1

Systematic testing before and after changes — new model version, new prompt, new chunking strategy — prevents regressions.

Define the change and success metricE.g., "Switching from Claude Sonnet to Nova Pro. Success = faithfulness ≥ 0.90 and answer relevance ≥ 0.85 with 20% lower cost."
Create evaluation dataset100–500 test queries with expected answers (ground truth). Include edge cases, adversarial inputs, and high-frequency query types.
Run both variants against the datasetModel A (control) and Model B (treatment) process identical queries. Use Bedrock Model Evaluation jobs or custom evaluation Lambda.
Score with RAGAS + LLM-as-JudgeAutomated scoring for all four RAGAS metrics plus cost per query for each variant.
Statistical significance checkEnsure sample size is large enough (n ≥ 100 per variant) and difference is statistically significant (p < 0.05).
Canary rollout if variant winsRoute 5% → 20% → 50% → 100% of traffic to new variant. Monitor production metrics at each stage before full rollout.

Regression Testing & Quality Gates

Regression Test Suite
  • Maintain a golden dataset of queries with expected outputs
  • Run on every deployment (CI/CD pipeline)
  • Fail the pipeline if faithfulness < threshold
  • Track metric trends over time (detect gradual drift)
Quality Gates
  • Pre-deploy gate: Automated RAGAS score on staging dataset
  • Safety gate: Guardrail test with adversarial inputs
  • Performance gate: P99 latency < SLA threshold
  • Cost gate: Average tokens/query within budget
Synthetic Evaluation Data
  • When real labeled data is scarce, generate synthetic test cases
  • Use a powerful LLM to create query-answer pairs from your KB documents
  • Validate synthetic data before using (LLM can make errors)
  • Mix synthetic + real data for best coverage

Security Controls Domain 3.2

Defense-in-depth for GenAI: seven layers from network to application. Know which AWS service addresses which layer.

LayerControlAWS Service
1. Network isolationKeep traffic off public internetVPC + PrivateLink (Interface VPC Endpoints) for Bedrock, S3, DynamoDB
2. Identity & AccessLeast-privilege IAM roles; no wildcard permissionsIAM + AWS Organizations SCPs + IAM Identity Center
3. Data protection at restEncrypt all stored dataKMS (CMK for Bedrock KB, S3, DynamoDB); automatic key rotation
4. Data protection in transitTLS 1.2+ everywhereEnforced by Bedrock, API Gateway; validate certificates
5. Input/Output safetyFilter harmful content, PII, injectionsBedrock Guardrails; Amazon Comprehend; Amazon Macie
6. Monitoring & auditLog all access and changesCloudTrail (API calls); CloudWatch Logs (app logs); AWS Config (resource drift)
7. Fine-grained data accessRow/column level data access controlAWS Lake Formation; DynamoDB fine-grained access control

PII Detection & Redaction

Bedrock Guardrails (PII)
  • Supported actions: BLOCK, MASK, ANONYMIZE, DETECT
  • Per entity type: name, SSN, credit card, phone, email, address
  • Applied to: input (pre-model), output (post-model), or both
  • Only one global blockedInputMessaging and blockedOutputsMessaging — not per entity type
Amazon Comprehend
  • PII detection and redaction in text at scale
  • Supports 100+ PII entity types
  • Redact mode: replace with [PII] placeholder
  • Use for: batch document scanning, preprocessing before KB ingestion
  • Also: sentiment analysis, entity recognition, key phrase extraction
Amazon Macie
  • Automated sensitive data discovery in S3 buckets
  • Detects: PII, financial data, credentials in stored files
  • Use for: ongoing monitoring of S3 data lake for sensitive content
  • Alerts when new sensitive data appears in monitored buckets
Decision rule: Guardrails = real-time PII filtering at model I/O boundary. Comprehend = batch processing or pre-ingestion scanning. Macie = S3 data lake monitoring for sensitive data discovery.

IAM Enforcement for Guardrails

# IAM Deny policy — forces all Bedrock calls to include a guardrail { "Effect": "Deny", "Action": ["bedrock:InvokeModel", "bedrock:Converse"], "Resource": "*", "Condition": { "StringNotEquals": { "bedrock:GuardrailIdentifier": [ "guardrail-id-prod-v1", "guardrail-id-prod-v2" ] } } } # Any InvokeModel call WITHOUT a valid GuardrailIdentifier is DENIED # This is the CENTRAL enforcement mechanism — no proxy Lambda needed
Exam pattern: "Force all developers to use guardrails on every Bedrock call" → IAM Deny with bedrock:GuardrailIdentifier condition key. This is stronger than documentation or guidelines — it's technically enforced at the API level.

AI Governance & Compliance Domain 3.3–3.4

Governance ensures AI systems are deployed responsibly, with audit trails, model versioning, and compliance frameworks in place.

Governance
  • Framework of rules, practices, and processes
  • Defines who has authority and accountability
  • SageMaker Model Registry: version control for models
  • SageMaker Model Cards: document model metadata, intended use, limitations
  • AWS Organizations SCPs: prevent policy violations across accounts
Compliance
  • Adhering to laws, regulations, and internal policies
  • HIPAA: PHI encryption + access logging + BAA with AWS
  • SOC2: audit logging, access controls, change management
  • FedRAMP: requires GovCloud region; PrivateLink required
  • GDPR: right to erasure → document data lineage
  • AWS Artifact: download compliance reports
Audit & Monitoring
  • CloudTrail: All API calls logged (who, when, what) — immutable
  • AWS Config: Resource configuration drift detection
  • CloudWatch Logs: Application-level audit trail
  • S3 Object Lock: WORM compliance for audit logs
  • Macie: Monitor for accidental PII exposure in S3

SageMaker Model Cards

Model Cards provide structured documentation for ML models — a governance requirement for responsible AI deployment.

What Model Cards Document
  • Model description and intended use cases
  • Training data description and known biases
  • Evaluation results (performance metrics by demographic)
  • Ethical considerations and limitations
  • Model version and deployment history
Integration with Model Registry
  • Each model version in SageMaker Model Registry links to its Model Card
  • Approval workflow: models require approval before prod deployment
  • SageMaker Clarify generates bias reports attached to Model Cards
  • SageMaker Model Dashboard: unified view of all deployed models

Responsible AI — Bias & Explainability

Tool / FeaturePurposeWhen to Use
SageMaker ClarifyBias detection in training data & model predictions; feature importance (SHAP)Before and after model training to audit fairness
SageMaker Model MonitorDetect data drift and model quality drift in productionAfter deployment to catch degradation over time
Bedrock Guardrails (grounding)Prevent model from making claims not supported by retrieved contextRAG applications where hallucination is a compliance risk
Amazon Augmented AI (A2I)Human review workflow for low-confidence AI predictionsWhen automated decisions need human oversight (lending, healthcare)
Fairness metrics (SageMaker Clarify): Demographic parity difference, disparate impact, equal opportunity difference. These measure whether model performance differs across demographic groups.

CloudWatch Insights Queries

Ready-to-use CloudWatch Logs Insights queries for diagnosing common GenAI application issues.

# High token consumption sessions (top 20) fields @timestamp, sessionId, inputTokens, outputTokens | filter inputTokens + outputTokens > 5000 | sort (inputTokens + outputTokens) desc | limit 20
# Token usage trend by hour stats sum(inputTokens) as totalInput, sum(outputTokens) as totalOutput | by bin(1h) | sort @timestamp asc

Latency & Performance Queries

# P50/P90/P99 latency breakdown stats percentile(latencyMs, 50) as p50, percentile(latencyMs, 90) as p90, percentile(latencyMs, 99) as p99 | by bin(5m)
# Slow requests (> 10 seconds) fields @timestamp, requestId, latencyMs, modelId, inputTokens | filter latencyMs > 10000 | sort latencyMs desc | limit 50

Error & Guardrail Queries

# Throttling errors by minute filter @message like /ThrottlingException/ | stats count(*) as throttleCount by bin(1m) | sort @timestamp asc
# KB ingestion failures by error type fields @timestamp, documentId, errorCode, errorMessage | filter errorCode in ["RESOURCE_IGNORED","EMBEDDING_FAILED","INDEXING_FAILED"] | stats count(*) as failures by errorCode | sort failures desc
# Guardrail blocks by policy type fields @timestamp, requestId, guardrailPolicyType, action | filter action = "BLOCKED" | stats count(*) as blockCount by guardrailPolicyType | sort blockCount desc

Agent Debugging Queries

# Failed tool calls in Bedrock Agents fields @timestamp, sessionId, toolName, errorType, errorMessage | filter type = "TOOL_CALL_FAILURE" | stats count(*) as failures by toolName, errorType | sort failures desc
# Agent sessions that exceeded max iterations fields @timestamp, sessionId, iterationCount, finalStatus | filter finalStatus = "MAX_ITERATIONS_EXCEEDED" | sort iterationCount desc | limit 20

Exam Traps — Monitoring, Troubleshooting & Evaluation

The most common wrong-answer choices in Domain 4 and 5 questions.

Monitoring & Troubleshooting Traps

TrapWrong AnswerCorrect Answer
KB ingestion failuresCloudTrail (logs API calls)CloudWatch Logs (EMBEDDING_FAILED, INDEXING_FAILED, RESOURCE_IGNORED)
Which guardrail policy blocked?CloudTrail eventsGuardrailPolicyType CloudWatch metric dimension
OpenSearch latency with least ops overheadAdd ElastiCache in frontEnable OpenSearch Auto-Tune (built-in, no new service)
Multi-service latency attributionCloudWatch Logs onlyAWS X-Ray (shows per-service breakdown in trace)
Guardrail per-action-type custom message"Bedrock Guardrails settings per PII type"Not possible — only one global blockedInputMessaging + blockedOutputsMessaging per guardrail
Force guardrail use on all API callsCustom proxy LambdaIAM Deny with bedrock:GuardrailIdentifier condition key

Evaluation Metric Traps

TrapWrong AnswerCorrect Answer
Measuring if retrieved docs are in correct orderROUGENDCG@K (accounts for position of relevant documents)
Measuring summarization qualityBLEUROUGE (BLEU is for translation; ROUGE for summarization)
Semantic similarity (paraphrase-tolerant)BLEU or ROUGEBERTScore (uses embeddings, not n-gram overlap)
RAG faithfulness (no hallucination)BLEU scoreRAGAS Faithfulness metric
Retrieval missing relevant docsLow Context PrecisionLow Context Recall (Precision = noise in retrieved; Recall = missed relevant)
A/B testing LLM versions for correctnessRun on 5 examples manuallyAutomated evaluation with LLM-as-Judge on 100+ query golden dataset