CloudWatch & Observability Domain 4.3
GenAI monitoring differs from traditional apps — you need to track token usage, prompt quality, hallucination rates, and retrieval relevance in addition to standard infrastructure metrics.
Bedrock-Specific CloudWatch Metrics
| Metric / Log | What It Measures | When to Alert |
|---|---|---|
InputTokenCount | Tokens consumed in prompts per request | Sudden spike → prompt injection or runaway context |
OutputTokenCount | Tokens generated in responses | Consistently near maxTokens → responses being truncated |
InvocationLatency | End-to-end model call latency | P99 > SLA threshold |
InvocationThrottles | Throttled requests due to quota limits | Any non-zero value → scale up provisioned throughput |
GuardrailPolicyType | Which specific guardrail policy triggered | High rate on ContentPolicy → review content filter config |
| Model Invocation Logs | Full request/response JSON logged to S3/CloudWatch | Enable for debugging; disable in production if PII present |
Monitoring Architecture — Three Layers
Infrastructure Layer
- Lambda: invocations, errors, duration, concurrency
- API Gateway: 4xx/5xx rates, latency, cache hit rate
- SQS: ApproximateNumberOfMessages (queue depth)
- DynamoDB: read/write capacity consumed vs. provisioned
- ECS: CPU/memory utilization, task count
AI Application Layer
- Token consumption per user / per session
- Prompt effectiveness score (custom metric)
- Cache hit rate (semantic & prompt cache)
- Knowledge Base retrieval latency
- Guardrail block rate by policy type
- Agent tool call success/failure rate
Business Layer
- Cost per query / per user session
- User satisfaction scores (thumbs up/down)
- Task completion rate for agents
- Hallucination rate (from evaluator pipeline)
- Response quality score trend over time
X-Ray Distributed Tracing for GenAI
X-Ray traces the full request path across services, letting you pinpoint exactly where latency or errors originate:
Anomaly Detection & Alerting
| Anomaly Type | Detection Method | Response |
|---|---|---|
| Token cost spike | CloudWatch metric alarm on InputTokenCount sum | SNS alert → Lambda auto-throttle user sessions |
| Guardrail block surge | CloudWatch alarm on GuardrailCoverage > threshold | Review prompt injection attack patterns in logs |
| Retrieval quality drop | CloudWatch custom metric for retrieval score | Trigger KB re-sync or embedding model refresh |
| Latency regression | CloudWatch Anomaly Detection band on P99 latency | Auto-scale or switch to provisioned throughput |
| Hallucination rate increase | Automated evaluator Lambda checking factuality score | Alert on-call; roll back prompt template |
Vector Database Monitoring Domain 4.3
RAG system performance depends heavily on vector database health. These metrics are specific to high-dimensional operations.
Key Performance Indicators
| KPI | Healthy Range | Problem Indicator |
|---|---|---|
| Query latency P99 | < 100ms for HNSW | > 500ms → index not warmed, over-sharded, or undersized |
| Recall@10 | > 0.95 | < 0.90 → HNSW efSearch too low; index quality degraded |
| Index memory usage | < 75% of available RAM | > 90% → risk of OOM; add nodes or reduce vector dimension |
| Shard count per index | ~10GB per shard | Too many small shards → overhead; too few → hot shards |
OpenSearch Auto-Tune vs. Manual Tuning
OpenSearch Auto-Tune (Built-In)
- Automatically optimizes JVM heap sizing
- Adjusts shard allocation across nodes
- Optimizes cache settings (field data, query cache)
- Schedules optimizations during low-traffic windows
- Zero new infrastructure — just enable it
- Exam answer for "least ops overhead"
Manual Tuning Options
- HNSW parameters:
m(connections),ef_construction(build quality),ef_search(query quality) - Refresh interval: increase for write-heavy workloads
- Segment merging: control merge policy for index size
- Replica count: 1 replica = HA; 0 = write performance only
⚠ Exam Trap: OpenSearch + ElastiCache
- Adding ElastiCache in front of OpenSearch = additional service to manage, additional cost, cache invalidation complexity
- The exam question says "least operational overhead" → Auto-Tune wins over ElastiCache every time
- Only add ElastiCache when the need is specifically caching identical repeated queries (not general latency improvement)
Knowledge Base Ingestion Monitoring
When using Bedrock Knowledge Bases, ingestion failures log to CloudWatch Logs — not CloudTrail. Know the error codes:
| Error Code | Meaning | Fix |
|---|---|---|
RESOURCE_IGNORED | Document skipped (unsupported format, too large, or duplicate) | Check file format + size limits; deduplicate source data |
EMBEDDING_FAILED | Embedding model call failed (timeout, rate limit) | Check embedding model quota; add retry with exponential backoff |
INDEXING_FAILED | Vector store write failed (OpenSearch down, auth error) | Check OpenSearch cluster health; verify IAM permissions for KB |
METADATA_EXTRACTION_FAILED | Metadata parsing failed (malformed JSON/tags) | Validate metadata format before ingestion |
Diagnostic Framework Domain 5.2
When a GenAI application misbehaves, the first step is isolating which layer is the problem. This framework guides you to the right diagnosis quickly.
The Four-Layer Diagnostic Model
Layer 1: Retrieval
- Wrong documents retrieved
- No results returned
- Irrelevant chunks in context
- Stale / outdated information
- Low recall (missing correct docs)
Layer 2: Generation
- Hallucination (fabricated facts)
- Off-topic or irrelevant response
- Format errors (wrong JSON, missing fields)
- Truncated responses
- Inconsistent style or language
Layer 3: Infrastructure
- High latency (> expected)
- Throttling errors (429)
- Timeout errors
- Cold start delays
- Memory / compute exhaustion
Layer 4: Data / Input
- PII leakage in output
- Prompt injection detected
- Guardrail blocks legitimate content
- Data quality issues in source docs
- Schema drift in structured inputs
Step-by-Step Diagnostic Decision Tree
trace: ENABLED; identify which GuardrailPolicyType triggered; adjust policy thresholds.Retrieval Failure Troubleshooting Domain 5.2
Most RAG quality problems trace back to retrieval — wrong chunks, missing chunks, or irrelevant context being passed to the model.
Common Retrieval Failure Patterns
| Symptom | Root Cause | Fix |
|---|---|---|
| Always returns same documents regardless of query | Embedding model saturating (all vectors similar); poor chunking | Increase chunk diversity; try different embedding model; add metadata filtering |
| Returns 0 results | Index empty / not synced; embedding dimension mismatch; query preprocessing issue | Check KB sync status in CloudWatch; verify embedding dimensions match between index and query |
| Returns technically relevant but contextually wrong docs | Pure semantic search missing keyword signals | Switch to hybrid search (BM25 + semantic); add reranker |
| Retrieves outdated information | KB not re-synced after source update | Enable S3 Event Notifications → incremental sync (IngestKnowledgeBaseDocuments) |
| Retrieves correct docs but model hallucinates anyway | Chunks too large (irrelevant content dilutes signal); too many chunks | Reduce top_k; use smaller, more precise chunks; add reranker to filter top-k |
Embedding & Chunking Issues
Embedding Model Mismatch
- Document embeddings and query embeddings must use the same model
- Changing embedding model → must re-index all documents
- Check: embedding dimension (e.g., 1536 vs. 3072) matches index config
- Drift detection: monitor cosine similarity distribution over time
Chunking Strategy Fixes
- Fixed-size chunks returning wrong context: Switch to semantic chunking
- Tables being split across chunks: Use Bedrock Data Automation + chunking
- Hierarchical docs (headers + sub-content): Hierarchical chunking
- Chunk too short: Add overlap (10-15%) to preserve context across boundaries
Reranker Usage
- Retrieve top-50 docs, rerank to top-5 before passing to model
- Reranker uses cross-encoder (expensive but accurate)
- When to add: retrieval finds relevant docs but ordering is poor
- Metrics: compare NDCG@10 before and after reranker
Query Preprocessing
- HyDE: Generate a hypothetical answer, embed it, search with that
- Query expansion: Add synonyms or related terms before embedding
- Sub-query decomposition: Split complex query into 2-3 simpler searches, merge results with RRF
Generation Failure Troubleshooting Domain 5.2
Generation failures happen at the model layer — hallucination, format errors, and safety blocks are the most common categories.
Hallucination Detection & Prevention
Detecting Hallucinations
- Citation checking: Verify every claim in response is supported by a retrieved chunk
- Faithfulness score: RAGAS faithfulness metric (automated)
- LLM-as-Judge: Use a second model to evaluate factual accuracy
- Ground truth comparison: Compare against known-correct answers
- Bedrock Guardrail trace: Check
groundednessScoreif grounding filter enabled
Prevention Techniques
- Explicit grounding instruction: "Only use information from the provided context. Say 'I don't know' if the answer isn't in the context."
- Reduce temperature: Lower temperature = less creative/hallucinated
- Citation enforcement: Require model to cite source chunk IDs
- Retrieval quality first: Hallucination drops when relevant context is present
- Smaller context window: Fewer chunks reduces off-topic generation
Format & Safety Block Failures
| Failure Type | Diagnosis | Fix |
|---|---|---|
| JSON output malformed | Model didn't follow schema; output truncated by maxTokens | Use JSON mode / constrained generation; increase maxTokens; provide schema in prompt with example |
| Response truncated mid-sentence | maxTokens too low | Increase maxTokens limit; or add "be concise" instruction to reduce response length |
| Guardrail blocks legitimate queries | Over-aggressive policy; topic too broad in denied topics list | Enable Guardrail trace; identify which policy triggered; narrow the policy scope or adjust sensitivity |
| PII in output despite filter | PII type not covered in sensitive info policy; entity not recognized by model | Add specific PII types to guardrail; test with Amazon Comprehend for detection validation |
| Prompt injection bypassed guardrail | Attack pattern not covered by PROMPT_ATTACK filter | Update Guardrail config; add adversarial examples to filter; use system prompt injection warnings |
Agent Failure Troubleshooting Domain 5.2
Agent failures are multi-step — the bug could be in tool selection, parameter extraction, tool execution, or response synthesis.
Agent Trace Debugging
Enable Bedrock Agents trace to see each reasoning step:
Common Agent Failure Patterns
| Symptom | Root Cause | Fix |
|---|---|---|
| Agent calls wrong tool | Tool descriptions are ambiguous or overlapping | Rewrite tool descriptions with explicit "use when..." guidance; disambiguate similar tools |
| Agent passes wrong parameters to tool | Parameter names/types unclear; model hallucinated values | Improve parameter descriptions; add validation in Lambda; use enum types for constrained values |
| Agent loops indefinitely | Tool returns error agent can't interpret; no exit condition | Set maxIterations; return structured error objects; add fallback "give up" instruction |
| Agent returns "I couldn't complete the task" without reason | All tools failed; guardrail blocked; token limit hit | Check trace for failed tool calls; increase token budget; review guardrail trace |
| Tool Lambda timing out | Database query too slow; external API rate-limited | Add DynamoDB caching; implement exponential backoff for external APIs; increase Lambda timeout |
| Agent ignores retrieved KB context | Tool result format unrecognized by agent; context window overflow | Standardize tool output format; reduce top-k chunks; prioritize most recent/relevant |
ThrottlingException Handling
Evaluation Metrics Domain 5.1
GenAI evaluation requires different metrics than traditional ML. Know which metric to use for which evaluation goal.
Metric Selection Guide
| What You're Measuring | Metric | Range & Interpretation |
|---|---|---|
| Retrieval ranking quality | NDCG@K (Normalized Discounted Cumulative Gain) | 0–1; higher = better; rewards finding relevant docs in top positions |
| First relevant result position | MRR (Mean Reciprocal Rank) | 0–1; MRR=1 means first result always relevant; good for single-answer queries |
| % of relevant docs retrieved | Recall@K | 0–1; "Did we find all relevant docs in top K?" Critical for high-recall use cases |
| % of retrieved docs that are relevant | Precision@K | 0–1; "How many of our top-K results are actually relevant?" |
| Summarization / translation quality | ROUGE (n-gram overlap) | 0–1; ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence) |
| Machine translation quality | BLEU (Bilingual Evaluation Understudy) | 0–1; compares n-gram overlap with human reference; strict on word choice |
| Semantic similarity (meaning-preserving) | BERTScore | 0–1; uses BERT embeddings; tolerates paraphrasing unlike BLEU/ROUGE |
| Language model quality | Perplexity | Lower = better; measures how well model predicts text; not suited for open-ended generation |
LLM-as-Judge Evaluation
What It Is
- Use a powerful LLM (e.g., Claude Opus) to evaluate outputs of another model
- Judge scores: relevance, factuality, helpfulness, safety, coherence
- Can replace or augment human evaluators at scale
- Bedrock supports LLM-as-Judge natively in evaluation jobs
When to Use
- No ground truth available (open-ended Q&A)
- Need scalable evaluation without human labelers
- Evaluating creative or long-form content
- A/B testing two model versions at scale
Limitations
- Judge model can have same biases as evaluated model
- Positional bias: judge often prefers response listed first
- Length bias: longer responses rated higher regardless of quality
- Mitigation: swap order of responses; use multiple judges; calibrate with human labels
Evaluation Frameworks in AWS
| Framework / Service | What It Evaluates | Key Feature |
|---|---|---|
| Bedrock Model Evaluation | Built-in evaluation for Bedrock models | Automated + human evaluation jobs; supports custom metrics |
| SageMaker Clarify | Bias detection, explainability, drift | Model cards integration; fairness metrics |
| AgentCore Evaluations | Agent quality (13 built-in evaluators) | Task success rate, tool accuracy, response quality |
| RAGAS | RAG pipeline end-to-end quality | 4 core metrics: faithfulness, answer relevance, context precision, context recall |
RAGAS Framework Domain 5.1
RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework for RAG pipelines. Know all four metrics and what they measure independently.
The Four RAGAS Metrics
1. Faithfulness
Does the generated answer contain only information from the retrieved context? Measures hallucination at the answer level.
- Score: 0–1 (1 = fully grounded in context)
- Low score = model is hallucinating or ignoring context
- Fix: improve grounding instructions; reduce temperature
2. Answer Relevance
Does the generated answer actually address the user's question? Measures if the response is on-topic.
- Score: 0–1 (1 = directly answers the question)
- Low score = model gave tangential or off-topic response
- Fix: improve system prompt instructions; better few-shot examples
3. Context Precision
Of the retrieved chunks, how many were actually needed to answer the question? Measures retrieval relevance.
- Score: 0–1 (1 = all retrieved chunks were relevant)
- Low score = retrieval is returning noisy/irrelevant chunks
- Fix: better chunking; metadata filtering; reranker
4. Context Recall
Were all necessary pieces of information retrieved? Measures retrieval completeness.
- Score: 0–1 (1 = all needed context was retrieved)
- Low score = relevant docs are in the KB but not being retrieved
- Fix: increase top-k; improve query preprocessing; hybrid search
What RAGAS Tells You About Your Pipeline
| RAGAS Pattern | Diagnosis | Fix |
|---|---|---|
| Low Faithfulness + High Context Precision | Good retrieval but model ignores context (hallucinates) | Stronger grounding instructions; lower temperature; citation enforcement |
| High Faithfulness + Low Answer Relevance | Model uses context faithfully but answers wrong question | Better query understanding; clarify intent in system prompt |
| Low Context Precision + High Context Recall | Retrieving too many docs (noisy); recall is fine | Reduce top-k; add reranker; tighten metadata filter |
| High Context Precision + Low Context Recall | Retrieved docs are relevant but missing key information | Increase top-k; improve chunking to avoid splitting key facts |
| All four metrics low | Systemic failure — likely embedding model or KB indexing issue | Audit KB ingestion logs; test embedding model independently; re-index |
A/B Testing & Quality Assurance Domain 5.1
Systematic testing before and after changes — new model version, new prompt, new chunking strategy — prevents regressions.
A/B Testing Framework for LLM Changes
Regression Testing & Quality Gates
Regression Test Suite
- Maintain a golden dataset of queries with expected outputs
- Run on every deployment (CI/CD pipeline)
- Fail the pipeline if faithfulness < threshold
- Track metric trends over time (detect gradual drift)
Quality Gates
- Pre-deploy gate: Automated RAGAS score on staging dataset
- Safety gate: Guardrail test with adversarial inputs
- Performance gate: P99 latency < SLA threshold
- Cost gate: Average tokens/query within budget
Synthetic Evaluation Data
- When real labeled data is scarce, generate synthetic test cases
- Use a powerful LLM to create query-answer pairs from your KB documents
- Validate synthetic data before using (LLM can make errors)
- Mix synthetic + real data for best coverage
Security Controls Domain 3.2
Defense-in-depth for GenAI: seven layers from network to application. Know which AWS service addresses which layer.
Seven-Layer Security Model
| Layer | Control | AWS Service |
|---|---|---|
| 1. Network isolation | Keep traffic off public internet | VPC + PrivateLink (Interface VPC Endpoints) for Bedrock, S3, DynamoDB |
| 2. Identity & Access | Least-privilege IAM roles; no wildcard permissions | IAM + AWS Organizations SCPs + IAM Identity Center |
| 3. Data protection at rest | Encrypt all stored data | KMS (CMK for Bedrock KB, S3, DynamoDB); automatic key rotation |
| 4. Data protection in transit | TLS 1.2+ everywhere | Enforced by Bedrock, API Gateway; validate certificates |
| 5. Input/Output safety | Filter harmful content, PII, injections | Bedrock Guardrails; Amazon Comprehend; Amazon Macie |
| 6. Monitoring & audit | Log all access and changes | CloudTrail (API calls); CloudWatch Logs (app logs); AWS Config (resource drift) |
| 7. Fine-grained data access | Row/column level data access control | AWS Lake Formation; DynamoDB fine-grained access control |
PII Detection & Redaction
Bedrock Guardrails (PII)
- Supported actions: BLOCK, MASK, ANONYMIZE, DETECT
- Per entity type: name, SSN, credit card, phone, email, address
- Applied to: input (pre-model), output (post-model), or both
- Only one global
blockedInputMessagingandblockedOutputsMessaging— not per entity type
Amazon Comprehend
- PII detection and redaction in text at scale
- Supports 100+ PII entity types
- Redact mode: replace with
[PII]placeholder - Use for: batch document scanning, preprocessing before KB ingestion
- Also: sentiment analysis, entity recognition, key phrase extraction
Amazon Macie
- Automated sensitive data discovery in S3 buckets
- Detects: PII, financial data, credentials in stored files
- Use for: ongoing monitoring of S3 data lake for sensitive content
- Alerts when new sensitive data appears in monitored buckets
IAM Enforcement for Guardrails
bedrock:GuardrailIdentifier condition key. This is stronger than documentation or guidelines — it's technically enforced at the API level.
AI Governance & Compliance Domain 3.3–3.4
Governance ensures AI systems are deployed responsibly, with audit trails, model versioning, and compliance frameworks in place.
Governance, Security, and Compliance Triad
Governance
- Framework of rules, practices, and processes
- Defines who has authority and accountability
- SageMaker Model Registry: version control for models
- SageMaker Model Cards: document model metadata, intended use, limitations
- AWS Organizations SCPs: prevent policy violations across accounts
Compliance
- Adhering to laws, regulations, and internal policies
- HIPAA: PHI encryption + access logging + BAA with AWS
- SOC2: audit logging, access controls, change management
- FedRAMP: requires GovCloud region; PrivateLink required
- GDPR: right to erasure → document data lineage
- AWS Artifact: download compliance reports
Audit & Monitoring
- CloudTrail: All API calls logged (who, when, what) — immutable
- AWS Config: Resource configuration drift detection
- CloudWatch Logs: Application-level audit trail
- S3 Object Lock: WORM compliance for audit logs
- Macie: Monitor for accidental PII exposure in S3
SageMaker Model Cards
Model Cards provide structured documentation for ML models — a governance requirement for responsible AI deployment.
What Model Cards Document
- Model description and intended use cases
- Training data description and known biases
- Evaluation results (performance metrics by demographic)
- Ethical considerations and limitations
- Model version and deployment history
Integration with Model Registry
- Each model version in SageMaker Model Registry links to its Model Card
- Approval workflow: models require approval before prod deployment
- SageMaker Clarify generates bias reports attached to Model Cards
- SageMaker Model Dashboard: unified view of all deployed models
Responsible AI — Bias & Explainability
| Tool / Feature | Purpose | When to Use |
|---|---|---|
| SageMaker Clarify | Bias detection in training data & model predictions; feature importance (SHAP) | Before and after model training to audit fairness |
| SageMaker Model Monitor | Detect data drift and model quality drift in production | After deployment to catch degradation over time |
| Bedrock Guardrails (grounding) | Prevent model from making claims not supported by retrieved context | RAG applications where hallucination is a compliance risk |
| Amazon Augmented AI (A2I) | Human review workflow for low-confidence AI predictions | When automated decisions need human oversight (lending, healthcare) |
CloudWatch Insights Queries
Ready-to-use CloudWatch Logs Insights queries for diagnosing common GenAI application issues.
Token Usage & Cost Queries
Latency & Performance Queries
Error & Guardrail Queries
Agent Debugging Queries
Exam Traps — Monitoring, Troubleshooting & Evaluation
The most common wrong-answer choices in Domain 4 and 5 questions.
Monitoring & Troubleshooting Traps
| Trap | Wrong Answer | Correct Answer |
|---|---|---|
| KB ingestion failures | CloudTrail (logs API calls) | CloudWatch Logs (EMBEDDING_FAILED, INDEXING_FAILED, RESOURCE_IGNORED) |
| Which guardrail policy blocked? | CloudTrail events | GuardrailPolicyType CloudWatch metric dimension |
| OpenSearch latency with least ops overhead | Add ElastiCache in front | Enable OpenSearch Auto-Tune (built-in, no new service) |
| Multi-service latency attribution | CloudWatch Logs only | AWS X-Ray (shows per-service breakdown in trace) |
| Guardrail per-action-type custom message | "Bedrock Guardrails settings per PII type" | Not possible — only one global blockedInputMessaging + blockedOutputsMessaging per guardrail |
| Force guardrail use on all API calls | Custom proxy Lambda | IAM Deny with bedrock:GuardrailIdentifier condition key |
Evaluation Metric Traps
| Trap | Wrong Answer | Correct Answer |
|---|---|---|
| Measuring if retrieved docs are in correct order | ROUGE | NDCG@K (accounts for position of relevant documents) |
| Measuring summarization quality | BLEU | ROUGE (BLEU is for translation; ROUGE for summarization) |
| Semantic similarity (paraphrase-tolerant) | BLEU or ROUGE | BERTScore (uses embeddings, not n-gram overlap) |
| RAG faithfulness (no hallucination) | BLEU score | RAGAS Faithfulness metric |
| Retrieval missing relevant docs | Low Context Precision | Low Context Recall (Precision = noise in retrieved; Recall = missed relevant) |
| A/B testing LLM versions for correctness | Run on 5 examples manually | Automated evaluation with LLM-as-Judge on 100+ query golden dataset |