CloudWatch & Observability Domain 4.3

GenAI monitoring differs from traditional apps — you need to track token usage, prompt quality, hallucination rates, and retrieval relevance in addition to standard infrastructure metrics.

Bedrock-Specific CloudWatch Metrics

Metric / Log	What It Measures	When to Alert
`InputTokenCount`	Tokens consumed in prompts per request	Sudden spike → prompt injection or runaway context
`OutputTokenCount`	Tokens generated in responses	Consistently near `maxTokens` → responses being truncated
`InvocationLatency`	End-to-end model call latency	P99 > SLA threshold
`InvocationThrottles`	Throttled requests due to quota limits	Any non-zero value → scale up provisioned throughput
`GuardrailPolicyType`	Which specific guardrail policy triggered	High rate on ContentPolicy → review content filter config
Model Invocation Logs	Full request/response JSON logged to S3/CloudWatch	Enable for debugging; disable in production if PII present

            GuardrailPolicyType dimension: Identifies exactly which policy blocked a request — ContentPolicy, TopicPolicy, SensitiveInformationPolicy, PromptAttackPolicy. Use this to distinguish why a block occurred, not just that it occurred.
          

Monitoring Architecture — Three Layers

Infrastructure Layer

Lambda: invocations, errors, duration, concurrency
API Gateway: 4xx/5xx rates, latency, cache hit rate
SQS: ApproximateNumberOfMessages (queue depth)
DynamoDB: read/write capacity consumed vs. provisioned
ECS: CPU/memory utilization, task count

AI Application Layer

Token consumption per user / per session
Prompt effectiveness score (custom metric)
Cache hit rate (semantic & prompt cache)
Knowledge Base retrieval latency
Guardrail block rate by policy type
Agent tool call success/failure rate

Business Layer

Cost per query / per user session
User satisfaction scores (thumbs up/down)
Task completion rate for agents
Hallucination rate (from evaluator pipeline)
Response quality score trend over time

X-Ray Distributed Tracing for GenAI

X-Ray traces the full request path across services, letting you pinpoint exactly where latency or errors originate:

API Gateway receives requestX-Ray trace ID generated and propagated downstream.

Lambda invokedSubsegment created; measure time before Bedrock call. Identify if Lambda code is the bottleneck.

Bedrock InvokeModel callSubsegment records model call latency separately from Lambda overhead.

OpenSearch / pgvector retrievalSubsegment for KB retrieval; isolate vector search latency from model latency.

Response returnedFull trace shows per-segment time breakdown. Service map visualizes bottleneck service.

X-Ray vs. CloudWatch Logs: X-Ray for latency attribution across services (where is slow?). CloudWatch Logs for content-level debugging (what did the prompt/response look like?). Use both together.

Anomaly Detection & Alerting

Anomaly Type	Detection Method	Response
Token cost spike	CloudWatch metric alarm on `InputTokenCount` sum	SNS alert → Lambda auto-throttle user sessions
Guardrail block surge	CloudWatch alarm on `GuardrailCoverage` > threshold	Review prompt injection attack patterns in logs
Retrieval quality drop	CloudWatch custom metric for retrieval score	Trigger KB re-sync or embedding model refresh
Latency regression	CloudWatch Anomaly Detection band on P99 latency	Auto-scale or switch to provisioned throughput
Hallucination rate increase	Automated evaluator Lambda checking factuality score	Alert on-call; roll back prompt template

            Baseline establishment: GenAI metrics are variable — token counts and latency fluctuate significantly by request type. Use CloudWatch Anomaly Detection (ML-based) rather than fixed thresholds. Collect 2 weeks of baseline data before enabling anomaly alerts.
          

Vector Database Monitoring Domain 4.3

RAG system performance depends heavily on vector database health. These metrics are specific to high-dimensional operations.

Key Performance Indicators

Query Latency P99Time for vector similarity search

Index Build TimeTime to index new documents

Recall@K% of true matches in top-K results

QPSQueries per second throughput

Index SizeMemory footprint of vector index

Shard HealthOpenSearch shard distribution

KPI	Healthy Range	Problem Indicator
Query latency P99	< 100ms for HNSW	> 500ms → index not warmed, over-sharded, or undersized
Recall@10	> 0.95	< 0.90 → HNSW efSearch too low; index quality degraded
Index memory usage	< 75% of available RAM	> 90% → risk of OOM; add nodes or reduce vector dimension
Shard count per index	~10GB per shard	Too many small shards → overhead; too few → hot shards

OpenSearch Auto-Tune vs. Manual Tuning

OpenSearch Auto-Tune (Built-In)

Automatically optimizes JVM heap sizing
Adjusts shard allocation across nodes
Optimizes cache settings (field data, query cache)
Schedules optimizations during low-traffic windows
Zero new infrastructure — just enable it
Exam answer for "least ops overhead"

Manual Tuning Options

HNSW parameters: m (connections), ef_construction (build quality), ef_search (query quality)
Refresh interval: increase for write-heavy workloads
Segment merging: control merge policy for index size
Replica count: 1 replica = HA; 0 = write performance only

⚠ Exam Trap: OpenSearch + ElastiCache

Adding ElastiCache in front of OpenSearch = additional service to manage, additional cost, cache invalidation complexity
The exam question says "least operational overhead" → Auto-Tune wins over ElastiCache every time
Only add ElastiCache when the need is specifically caching identical repeated queries (not general latency improvement)

Knowledge Base Ingestion Monitoring

When using Bedrock Knowledge Bases, ingestion failures log to CloudWatch Logs — not CloudTrail. Know the error codes:

Error Code	Meaning	Fix
`RESOURCE_IGNORED`	Document skipped (unsupported format, too large, or duplicate)	Check file format + size limits; deduplicate source data
`EMBEDDING_FAILED`	Embedding model call failed (timeout, rate limit)	Check embedding model quota; add retry with exponential backoff
`INDEXING_FAILED`	Vector store write failed (OpenSearch down, auth error)	Check OpenSearch cluster health; verify IAM permissions for KB
`METADATA_EXTRACTION_FAILED`	Metadata parsing failed (malformed JSON/tags)	Validate metadata format before ingestion

CloudTrail vs. CloudWatch Logs for KB: CloudTrail logs API calls (who called StartIngestionJob, when). CloudWatch Logs captures document-level ingestion status (which specific document failed and why). For troubleshooting ingestion failures, you need CloudWatch Logs.

Diagnostic Framework Domain 5.2

When a GenAI application misbehaves, the first step is isolating which layer is the problem. This framework guides you to the right diagnosis quickly.

The Four-Layer Diagnostic Model

Layer 1: Retrieval

Wrong documents retrieved
No results returned
Irrelevant chunks in context
Stale / outdated information
Low recall (missing correct docs)

Layer 2: Generation

Hallucination (fabricated facts)
Off-topic or irrelevant response
Format errors (wrong JSON, missing fields)
Truncated responses
Inconsistent style or language

Layer 3: Infrastructure

High latency (> expected)
Throttling errors (429)
Timeout errors
Cold start delays
Memory / compute exhaustion

Layer 4: Data / Input

PII leakage in output
Prompt injection detected
Guardrail blocks legitimate content
Data quality issues in source docs
Schema drift in structured inputs

Step-by-Step Diagnostic Decision Tree

Is the response factually wrong or invented?→ Hallucination → Go to Generation Failures section. Check: was the correct information even in the retrieved context?

Is the retrieved context correct but the answer still wrong?→ Generation issue → model isn't using context correctly. Try: better instruction in system prompt; higher-capability model; fewer distracting chunks.

Is the retrieved context empty or irrelevant?→ Retrieval failure → Go to Retrieval Failures section. Check: embedding model mismatch, query preprocessing, chunk quality.

Is the latency too high?→ Infrastructure issue. Use X-Ray to identify which service is slow. Check: cold starts, vector DB query time, model inference time separately.

Is a guardrail blocking legitimate content?→ Data/safety layer issue. Check guardrail trace with trace: ENABLED; identify which GuardrailPolicyType triggered; adjust policy thresholds.

Is the problem intermittent?→ Check for: ThrottlingExceptions (use exponential backoff), embedding model quota limits, OpenSearch shard contention during index refresh.

Retrieval Failure Troubleshooting Domain 5.2

Most RAG quality problems trace back to retrieval — wrong chunks, missing chunks, or irrelevant context being passed to the model.

Common Retrieval Failure Patterns

Symptom	Root Cause	Fix
Always returns same documents regardless of query	Embedding model saturating (all vectors similar); poor chunking	Increase chunk diversity; try different embedding model; add metadata filtering
Returns 0 results	Index empty / not synced; embedding dimension mismatch; query preprocessing issue	Check KB sync status in CloudWatch; verify embedding dimensions match between index and query
Returns technically relevant but contextually wrong docs	Pure semantic search missing keyword signals	Switch to hybrid search (BM25 + semantic); add reranker
Retrieves outdated information	KB not re-synced after source update	Enable S3 Event Notifications → incremental sync (`IngestKnowledgeBaseDocuments`)
Retrieves correct docs but model hallucinates anyway	Chunks too large (irrelevant content dilutes signal); too many chunks	Reduce top_k; use smaller, more precise chunks; add reranker to filter top-k

Embedding & Chunking Issues

Embedding Model Mismatch

Document embeddings and query embeddings must use the same model
Changing embedding model → must re-index all documents
Check: embedding dimension (e.g., 1536 vs. 3072) matches index config
Drift detection: monitor cosine similarity distribution over time

Chunking Strategy Fixes

Fixed-size chunks returning wrong context: Switch to semantic chunking
Tables being split across chunks: Use Bedrock Data Automation + chunking
Hierarchical docs (headers + sub-content): Hierarchical chunking
Chunk too short: Add overlap (10-15%) to preserve context across boundaries

Reranker Usage

Retrieve top-50 docs, rerank to top-5 before passing to model
Reranker uses cross-encoder (expensive but accurate)
When to add: retrieval finds relevant docs but ordering is poor
Metrics: compare NDCG@10 before and after reranker

Query Preprocessing

HyDE: Generate a hypothetical answer, embed it, search with that
Query expansion: Add synonyms or related terms before embedding
Sub-query decomposition: Split complex query into 2-3 simpler searches, merge results with RRF

Generation Failure Troubleshooting Domain 5.2

Generation failures happen at the model layer — hallucination, format errors, and safety blocks are the most common categories.

Hallucination Detection & Prevention

Detecting Hallucinations

Citation checking: Verify every claim in response is supported by a retrieved chunk
Faithfulness score: RAGAS faithfulness metric (automated)
LLM-as-Judge: Use a second model to evaluate factual accuracy
Ground truth comparison: Compare against known-correct answers
Bedrock Guardrail trace: Check groundednessScore if grounding filter enabled

Prevention Techniques

Explicit grounding instruction: "Only use information from the provided context. Say 'I don't know' if the answer isn't in the context."
Reduce temperature: Lower temperature = less creative/hallucinated
Citation enforcement: Require model to cite source chunk IDs
Retrieval quality first: Hallucination drops when relevant context is present
Smaller context window: Fewer chunks reduces off-topic generation

Important distinction: Hallucination is a generation problem but its most common cause is a retrieval problem. If the correct information is absent from retrieved context, the model invents an answer. Fix retrieval first.

Format & Safety Block Failures

Failure Type	Diagnosis	Fix
JSON output malformed	Model didn't follow schema; output truncated by maxTokens	Use JSON mode / constrained generation; increase maxTokens; provide schema in prompt with example
Response truncated mid-sentence	`maxTokens` too low	Increase maxTokens limit; or add "be concise" instruction to reduce response length
Guardrail blocks legitimate queries	Over-aggressive policy; topic too broad in denied topics list	Enable Guardrail trace; identify which policy triggered; narrow the policy scope or adjust sensitivity
PII in output despite filter	PII type not covered in sensitive info policy; entity not recognized by model	Add specific PII types to guardrail; test with Amazon Comprehend for detection validation
Prompt injection bypassed guardrail	Attack pattern not covered by PROMPT_ATTACK filter	Update Guardrail config; add adversarial examples to filter; use system prompt injection warnings

Agent Failure Troubleshooting Domain 5.2

Agent failures are multi-step — the bug could be in tool selection, parameter extraction, tool execution, or response synthesis.

Agent Trace Debugging

Enable Bedrock Agents trace to see each reasoning step:

# Enable trace in Bedrock Agents API call
response = bedrock_agent_runtime.invoke_agent(
    agentId='AGENT_ID',
    agentAliasId='ALIAS_ID',
    sessionId='SESSION_123',
    inputText='user query',
    enableTrace=True  # Returns step-by-step reasoning
)

# Trace output includes:
# - modelInvocationInput: what was sent to the model
# - rationale: model's reasoning about next action
# - invocationInput: tool call + parameters
# - observation: tool result returned
# - finalResponse: agent's final answer
          

Common Agent Failure Patterns

Symptom	Root Cause	Fix
Agent calls wrong tool	Tool descriptions are ambiguous or overlapping	Rewrite tool descriptions with explicit "use when..." guidance; disambiguate similar tools
Agent passes wrong parameters to tool	Parameter names/types unclear; model hallucinated values	Improve parameter descriptions; add validation in Lambda; use enum types for constrained values
Agent loops indefinitely	Tool returns error agent can't interpret; no exit condition	Set `maxIterations`; return structured error objects; add fallback "give up" instruction
Agent returns "I couldn't complete the task" without reason	All tools failed; guardrail blocked; token limit hit	Check trace for failed tool calls; increase token budget; review guardrail trace
Tool Lambda timing out	Database query too slow; external API rate-limited	Add DynamoDB caching; implement exponential backoff for external APIs; increase Lambda timeout
Agent ignores retrieved KB context	Tool result format unrecognized by agent; context window overflow	Standardize tool output format; reduce top-k chunks; prioritize most recent/relevant

ThrottlingException Handling

# Exponential backoff pattern for Bedrock throttling
import time, random

def invoke_with_retry(client, model_id, body, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.invoke_model(modelId=model_id, body=body)
        except client.exceptions.ThrottlingException:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) * 0.1 + random.uniform(0, 0.1)
            time.sleep(wait)  # 100ms, 200ms, 400ms + jitter
  

            Retry pattern for exam questions: Initial delay 100ms → double on each retry → add random jitter → circuit breaker after max retries. The jitter prevents thundering herd if many clients retry simultaneously.
          

Evaluation Metrics Domain 5.1

GenAI evaluation requires different metrics than traditional ML. Know which metric to use for which evaluation goal.

Metric Selection Guide

What You're Measuring	Metric	Range & Interpretation
Retrieval ranking quality	NDCG@K (Normalized Discounted Cumulative Gain)	0–1; higher = better; rewards finding relevant docs in top positions
First relevant result position	MRR (Mean Reciprocal Rank)	0–1; MRR=1 means first result always relevant; good for single-answer queries
% of relevant docs retrieved	Recall@K	0–1; "Did we find all relevant docs in top K?" Critical for high-recall use cases
% of retrieved docs that are relevant	Precision@K	0–1; "How many of our top-K results are actually relevant?"
Summarization / translation quality	ROUGE (n-gram overlap)	0–1; ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
Machine translation quality	BLEU (Bilingual Evaluation Understudy)	0–1; compares n-gram overlap with human reference; strict on word choice
Semantic similarity (meaning-preserving)	BERTScore	0–1; uses BERT embeddings; tolerates paraphrasing unlike BLEU/ROUGE
Language model quality	Perplexity	Lower = better; measures how well model predicts text; not suited for open-ended generation

            BLEU vs. ROUGE: BLEU = translation quality (precision-focused). ROUGE = summarization quality (recall-focused). BLEU penalizes missing reference words; ROUGE penalizes missing summary content.
          

LLM-as-Judge Evaluation

What It Is

Use a powerful LLM (e.g., Claude Opus) to evaluate outputs of another model
Judge scores: relevance, factuality, helpfulness, safety, coherence
Can replace or augment human evaluators at scale
Bedrock supports LLM-as-Judge natively in evaluation jobs

When to Use

No ground truth available (open-ended Q&A)
Need scalable evaluation without human labelers
Evaluating creative or long-form content
A/B testing two model versions at scale

Limitations

Judge model can have same biases as evaluated model
Positional bias: judge often prefers response listed first
Length bias: longer responses rated higher regardless of quality
Mitigation: swap order of responses; use multiple judges; calibrate with human labels

Evaluation Frameworks in AWS

Framework / Service	What It Evaluates	Key Feature
Bedrock Model Evaluation	Built-in evaluation for Bedrock models	Automated + human evaluation jobs; supports custom metrics
SageMaker Clarify	Bias detection, explainability, drift	Model cards integration; fairness metrics
AgentCore Evaluations	Agent quality (13 built-in evaluators)	Task success rate, tool accuracy, response quality
RAGAS	RAG pipeline end-to-end quality	4 core metrics: faithfulness, answer relevance, context precision, context recall

RAGAS Framework Domain 5.1

RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG pipelines. Know all four metrics and what they measure independently.

The Four RAGAS Metrics

1. Faithfulness

Does the generated answer contain only information from the retrieved context? Measures hallucination at the answer level.

Score: 0–1 (1 = fully grounded in context)
Low score = model is hallucinating or ignoring context
Fix: improve grounding instructions; reduce temperature

2. Answer Relevance

Does the generated answer actually address the user's question? Measures if the response is on-topic.

Score: 0–1 (1 = directly answers the question)
Low score = model gave tangential or off-topic response
Fix: improve system prompt instructions; better few-shot examples

3. Context Precision

Of the retrieved chunks, how many were actually needed to answer the question? Measures retrieval relevance.

Score: 0–1 (1 = all retrieved chunks were relevant)
Low score = retrieval is returning noisy/irrelevant chunks
Fix: better chunking; metadata filtering; reranker

4. Context Recall

Were all necessary pieces of information retrieved? Measures retrieval completeness.

Score: 0–1 (1 = all needed context was retrieved)
Low score = relevant docs are in the KB but not being retrieved
Fix: increase top-k; improve query preprocessing; hybrid search

            Exam diagnostic pattern: Low Faithfulness → generation problem. Low Answer Relevance → generation problem. Low Context Precision → retrieval returning noise. Low Context Recall → retrieval missing relevant docs.
          

What RAGAS Tells You About Your Pipeline

RAGAS Pattern	Diagnosis	Fix
Low Faithfulness + High Context Precision	Good retrieval but model ignores context (hallucinates)	Stronger grounding instructions; lower temperature; citation enforcement
High Faithfulness + Low Answer Relevance	Model uses context faithfully but answers wrong question	Better query understanding; clarify intent in system prompt
Low Context Precision + High Context Recall	Retrieving too many docs (noisy); recall is fine	Reduce top-k; add reranker; tighten metadata filter
High Context Precision + Low Context Recall	Retrieved docs are relevant but missing key information	Increase top-k; improve chunking to avoid splitting key facts
All four metrics low	Systemic failure — likely embedding model or KB indexing issue	Audit KB ingestion logs; test embedding model independently; re-index

A/B Testing & Quality Assurance Domain 5.1

Systematic testing before and after changes — new model version, new prompt, new chunking strategy — prevents regressions.

A/B Testing Framework for LLM Changes

Define the change and success metricE.g., "Switching from Claude Sonnet to Nova Pro. Success = faithfulness ≥ 0.90 and answer relevance ≥ 0.85 with 20% lower cost."

Create evaluation dataset100–500 test queries with expected answers (ground truth). Include edge cases, adversarial inputs, and high-frequency query types.

Run both variants against the datasetModel A (control) and Model B (treatment) process identical queries. Use Bedrock Model Evaluation jobs or custom evaluation Lambda.

Score with RAGAS + LLM-as-JudgeAutomated scoring for all four RAGAS metrics plus cost per query for each variant.

Statistical significance checkEnsure sample size is large enough (n ≥ 100 per variant) and difference is statistically significant (p < 0.05).

Canary rollout if variant winsRoute 5% → 20% → 50% → 100% of traffic to new variant. Monitor production metrics at each stage before full rollout.

Regression Testing & Quality Gates

Regression Test Suite

Maintain a golden dataset of queries with expected outputs
Run on every deployment (CI/CD pipeline)
Fail the pipeline if faithfulness < threshold
Track metric trends over time (detect gradual drift)

Quality Gates

Pre-deploy gate: Automated RAGAS score on staging dataset
Safety gate: Guardrail test with adversarial inputs
Performance gate: P99 latency < SLA threshold
Cost gate: Average tokens/query within budget

Synthetic Evaluation Data

When real labeled data is scarce, generate synthetic test cases
Use a powerful LLM to create query-answer pairs from your KB documents
Validate synthetic data before using (LLM can make errors)
Mix synthetic + real data for best coverage

Security Controls Domain 3.2

Defense-in-depth for GenAI: seven layers from network to application. Know which AWS service addresses which layer.

Seven-Layer Security Model

Layer	Control	AWS Service
1. Network isolation	Keep traffic off public internet	VPC + PrivateLink (Interface VPC Endpoints) for Bedrock, S3, DynamoDB
2. Identity & Access	Least-privilege IAM roles; no wildcard permissions	IAM + AWS Organizations SCPs + IAM Identity Center
3. Data protection at rest	Encrypt all stored data	KMS (CMK for Bedrock KB, S3, DynamoDB); automatic key rotation
4. Data protection in transit	TLS 1.2+ everywhere	Enforced by Bedrock, API Gateway; validate certificates
5. Input/Output safety	Filter harmful content, PII, injections	Bedrock Guardrails; Amazon Comprehend; Amazon Macie
6. Monitoring & audit	Log all access and changes	CloudTrail (API calls); CloudWatch Logs (app logs); AWS Config (resource drift)
7. Fine-grained data access	Row/column level data access control	AWS Lake Formation; DynamoDB fine-grained access control

PII Detection & Redaction

Bedrock Guardrails (PII)

Supported actions: BLOCK, MASK, ANONYMIZE, DETECT
Per entity type: name, SSN, credit card, phone, email, address
Applied to: input (pre-model), output (post-model), or both
Only one global blockedInputMessaging and blockedOutputsMessaging — not per entity type

Amazon Comprehend

PII detection and redaction in text at scale
Supports 100+ PII entity types
Redact mode: replace with [PII] placeholder
Use for: batch document scanning, preprocessing before KB ingestion
Also: sentiment analysis, entity recognition, key phrase extraction

Amazon Macie

Automated sensitive data discovery in S3 buckets
Detects: PII, financial data, credentials in stored files
Use for: ongoing monitoring of S3 data lake for sensitive content
Alerts when new sensitive data appears in monitored buckets

            Decision rule: Guardrails = real-time PII filtering at model I/O boundary. Comprehend = batch processing or pre-ingestion scanning. Macie = S3 data lake monitoring for sensitive data discovery.
          

IAM Enforcement for Guardrails

# IAM Deny policy — forces all Bedrock calls to include a guardrail
{
  "Effect": "Deny",
  "Action": ["bedrock:InvokeModel", "bedrock:Converse"],
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {
      "bedrock:GuardrailIdentifier": [
        "guardrail-id-prod-v1",
        "guardrail-id-prod-v2"
      ]
    }
  }
}
# Any InvokeModel call WITHOUT a valid GuardrailIdentifier is DENIED
# This is the CENTRAL enforcement mechanism — no proxy Lambda needed
          

Exam pattern: "Force all developers to use guardrails on every Bedrock call" → IAM Deny with bedrock:GuardrailIdentifier condition key. This is stronger than documentation or guidelines — it's technically enforced at the API level.

AI Governance & Compliance Domain 3.3–3.4

Governance ensures AI systems are deployed responsibly, with audit trails, model versioning, and compliance frameworks in place.

Governance, Security, and Compliance Triad

Governance

Framework of rules, practices, and processes
Defines who has authority and accountability
SageMaker Model Registry: version control for models
SageMaker Model Cards: document model metadata, intended use, limitations
AWS Organizations SCPs: prevent policy violations across accounts

Compliance

Adhering to laws, regulations, and internal policies
HIPAA: PHI encryption + access logging + BAA with AWS
SOC2: audit logging, access controls, change management
FedRAMP: requires GovCloud region; PrivateLink required
GDPR: right to erasure → document data lineage
AWS Artifact: download compliance reports

Audit & Monitoring

CloudTrail: All API calls logged (who, when, what) — immutable
AWS Config: Resource configuration drift detection
CloudWatch Logs: Application-level audit trail
S3 Object Lock: WORM compliance for audit logs
Macie: Monitor for accidental PII exposure in S3

SageMaker Model Cards

Model Cards provide structured documentation for ML models — a governance requirement for responsible AI deployment.

What Model Cards Document

Model description and intended use cases
Training data description and known biases
Evaluation results (performance metrics by demographic)
Ethical considerations and limitations
Model version and deployment history

Integration with Model Registry

Each model version in SageMaker Model Registry links to its Model Card
Approval workflow: models require approval before prod deployment
SageMaker Clarify generates bias reports attached to Model Cards
SageMaker Model Dashboard: unified view of all deployed models

Responsible AI — Bias & Explainability

Tool / Feature	Purpose	When to Use
SageMaker Clarify	Bias detection in training data & model predictions; feature importance (SHAP)	Before and after model training to audit fairness
SageMaker Model Monitor	Detect data drift and model quality drift in production	After deployment to catch degradation over time
Bedrock Guardrails (grounding)	Prevent model from making claims not supported by retrieved context	RAG applications where hallucination is a compliance risk
Amazon Augmented AI (A2I)	Human review workflow for low-confidence AI predictions	When automated decisions need human oversight (lending, healthcare)

            Fairness metrics (SageMaker Clarify): Demographic parity difference, disparate impact, equal opportunity difference. These measure whether model performance differs across demographic groups.
          

CloudWatch Insights Queries

Ready-to-use CloudWatch Logs Insights queries for diagnosing common GenAI application issues.

Token Usage & Cost Queries

# High token consumption sessions (top 20)
fields @timestamp, sessionId, inputTokens, outputTokens
| filter inputTokens + outputTokens > 5000
| sort (inputTokens + outputTokens) desc
| limit 20
          

# Token usage trend by hour
stats sum(inputTokens) as totalInput, sum(outputTokens) as totalOutput
| by bin(1h)
| sort @timestamp asc
          

Latency & Performance Queries

# P50/P90/P99 latency breakdown
stats percentile(latencyMs, 50) as p50,
      percentile(latencyMs, 90) as p90,
      percentile(latencyMs, 99) as p99
| by bin(5m)
          

# Slow requests (> 10 seconds)
fields @timestamp, requestId, latencyMs, modelId, inputTokens
| filter latencyMs > 10000
| sort latencyMs desc
| limit 50
          

Error & Guardrail Queries

# Throttling errors by minute
filter @message like /ThrottlingException/
| stats count(*) as throttleCount by bin(1m)
| sort @timestamp asc
          

# KB ingestion failures by error type
fields @timestamp, documentId, errorCode, errorMessage
| filter errorCode in ["RESOURCE_IGNORED","EMBEDDING_FAILED","INDEXING_FAILED"]
| stats count(*) as failures by errorCode
| sort failures desc
          

# Guardrail blocks by policy type
fields @timestamp, requestId, guardrailPolicyType, action
| filter action = "BLOCKED"
| stats count(*) as blockCount by guardrailPolicyType
| sort blockCount desc
          

Agent Debugging Queries

# Failed tool calls in Bedrock Agents
fields @timestamp, sessionId, toolName, errorType, errorMessage
| filter type = "TOOL_CALL_FAILURE"
| stats count(*) as failures by toolName, errorType
| sort failures desc
          

# Agent sessions that exceeded max iterations
fields @timestamp, sessionId, iterationCount, finalStatus
| filter finalStatus = "MAX_ITERATIONS_EXCEEDED"
| sort iterationCount desc
| limit 20
          

Exam Traps — Monitoring, Troubleshooting & Evaluation

The most common wrong-answer choices in Domain 4 and 5 questions.

Monitoring & Troubleshooting Traps

Trap	Wrong Answer	Correct Answer
KB ingestion failures	CloudTrail (logs API calls)	CloudWatch Logs (EMBEDDING_FAILED, INDEXING_FAILED, RESOURCE_IGNORED)
Which guardrail policy blocked?	CloudTrail events	GuardrailPolicyType CloudWatch metric dimension
OpenSearch latency with least ops overhead	Add ElastiCache in front	Enable OpenSearch Auto-Tune (built-in, no new service)
Multi-service latency attribution	CloudWatch Logs only	AWS X-Ray (shows per-service breakdown in trace)
Guardrail per-action-type custom message	"Bedrock Guardrails settings per PII type"	Not possible — only one global blockedInputMessaging + blockedOutputsMessaging per guardrail
Force guardrail use on all API calls	Custom proxy Lambda	IAM Deny with bedrock:GuardrailIdentifier condition key

Evaluation Metric Traps

Trap	Wrong Answer	Correct Answer
Measuring if retrieved docs are in correct order	ROUGE	NDCG@K (accounts for position of relevant documents)
Measuring summarization quality	BLEU	ROUGE (BLEU is for translation; ROUGE for summarization)
Semantic similarity (paraphrase-tolerant)	BLEU or ROUGE	BERTScore (uses embeddings, not n-gram overlap)
RAG faithfulness (no hallucination)	BLEU score	RAGAS Faithfulness metric
Retrieval missing relevant docs	Low Context Precision	Low Context Recall (Precision = noise in retrieved; Recall = missed relevant)
A/B testing LLM versions for correctness	Run on 5 examples manually	Automated evaluation with LLM-as-Judge on 100+ query golden dataset