Domain 4 & 5 — Operations, Evaluation & Troubleshooting

4.1 — Cost Optimization & Resource Efficiency

Batch Inference for High-Volume Cost Efficiency

Batch inference jobs = submit all your inputs at once (stored in S3), Bedrock processes them offline and writes results back to S3. Significantly cheaper than real-time inference for large volumes because you're not paying for always-on capacity.

Pattern: S3 (input JSONL) → CreateModelInvocationJob API → processing → S3 (output JSONL). No real-time latency requirement = batch. Combine with inference profiles to distribute across regions for even higher throughput.

Synchronous API calls + Lambda concurrent invocations can scale, but you pay real-time pricing for everything. For millions of documents, batch inference is the cost-efficient approach. Real-time = pay premium for low latency. Batch = pay less for high throughput.

Scenario: "process millions of content summaries, maximize throughput, cost-effective" → batch inference jobs + S3 I/O configuration + inference profiles for workload distribution.

Inference Profiles for Cost Attribution

Inference profiles tag your Bedrock API calls with a logical identifier (profile ARN). AWS Cost Explorer then shows costs grouped by profile — perfect for per-team, per-product, or per-environment cost visibility.

One profile per cost center. Invoke models via profile ARN instead of model ARN directly. The profile knows which model to use — you don't change the model, just the entry point.

Scenario: "need to track Bedrock costs per business unit / per clinic / per department" → inference profiles. One profile per unit. Cost reports break down automatically.

Model Distillation for Cost Reduction

Run inference on a smaller, cheaper student model that has learned to mimic the expensive teacher model. Token costs drop substantially while accuracy remains high for the trained tasks.

The input to distillation = only prompts (not prompt-response pairs). Bedrock runs the teacher model against those prompts during distillation training to generate the training signal internally.

Scenario: "high accuracy application currently using Nova Pro, need to cut costs, maintain accuracy" → distillation to Nova Lite (student) using Nova Pro (teacher). Supply prompts from invocation logs — Bedrock generates its own responses for training.

4.2 — Optimize Application Performance

Semantic Caching — OpenSearch k-NN

Semantic caching avoids FM invocations entirely for queries that are semantically similar to previously-seen ones. "What's the return policy?" and "Can I return a product?" should both hit the same cache entry.

Architecture: incoming query → Lambda generates embedding → query OpenSearch k-NN vector index for similar past queries → if cosine similarity exceeds threshold → return cached response → skip Bedrock entirely.

OpenSearch k-NN Semantic Cache ✓

Stores embeddings of past queries

Matches by meaning, not exact text

High cache hit rate for conversational AI

ElastiCache Redis Exact-Match ✗

Stores query string as key

Only matches identical strings

Low hit rate — users never type the same thing twice

NOT semantic — just fast key-value lookup

Response Streaming for Perceived Latency

Streaming sends tokens to the user as they're generated instead of waiting for the full response. A 6-second response feels much faster if the first tokens appear in 400ms.

Streaming = better perceived latency. It does NOT make the model faster or more accurate. The total tokens generated is the same — you're just delivering them progressively.

Scenario: "users experience long wait times before seeing any output" → implement streaming. Scenario: "users receive wrong product information" → RAG. Never confuse these two.

Inference Profile HA — Primary + Secondary Region

Configure an inference profile with a primary region and a secondary (failover) region. If the primary region's model capacity is unavailable, requests automatically route to the secondary.

This is NOT load balancing / round-robin. The secondary only activates when the primary fails. If you want true geographic load distribution, that's a different architecture (e.g., multiple endpoints with Route 53 latency routing).

Scenario: "need high availability for Bedrock, handle regional outages" → inference profile with primary + secondary region config.

4.3 — Monitoring Systems for GenAI Applications

CloudWatch GenAI Observability — Purpose-Built

Amazon CloudWatch GenAI observability is a purpose-built monitoring capability for generative AI applications using Lambda and Bedrock. It automatically collects model invocation metrics, latency, token counts, and error rates without custom instrumentation.

CloudWatch Application Signals + X-Ray is for distributed tracing of microservices. While useful for general apps, it wasn't built for GenAI metrics. The exam expects you to know that CloudWatch GenAI observability exists and is the right choice for Lambda + Bedrock apps.

Scenario: "serverless app uses Lambda + Bedrock, need to monitor model performance" → CloudWatch GenAI observability. Not Application Signals + X-Ray.

CloudWatch Dashboard for RAG Performance (Not X-Ray)

For troubleshooting RAG performance (slow retrievals, degraded KB queries), build a custom CloudWatch dashboard combining: context retrieval latency metrics + OpenSearch operation counts + Bedrock invocation log analysis.

X-Ray traces show you the call graph for individual requests, but for RAG performance analysis you need aggregated metrics across many requests — that's what CloudWatch dashboards give you. X-Ray is great for debugging individual slow requests, not for identifying systematic retrieval degradation patterns.

Bedrock invocation logs → CloudWatch Logs → CloudWatch Logs Insights queries → identify which KB queries are consistently slow or returning low-quality results.

Three Different Log Types — Know the Difference

Log Type	What It Captures	Where to Enable
Model invocation logs	LLM input prompts, output text, token counts, latency, model ID	Bedrock console → Model invocation logging
KB ingestion logs	Document processing events, chunking/embedding success/failure per file	Knowledge Base settings → CloudWatch Logs
CloudTrail API logs	Who called which Bedrock API, when, from which role, with which parameters	Enabled by default, goes to CloudTrail

KB file failing to ingest → KB ingestion logs.
Track what prompts users sent → model invocation logs.
Compliance audit who invoked which model → CloudTrail.

5.1 — Evaluation Systems ⚠️ Your LOWEST score (17%) — highest priority

Bedrock Evaluation — The Full Picture

Automatic model evaluation: Bedrock runs the evaluation pipeline for you using built-in or custom metrics. You provide a prompt dataset and a judge model. Bedrock invokes each candidate FM, judges the outputs, and writes a report to S3.

Human-based evaluation: You define the evaluation criteria and bring your own expert workforce. Bedrock routes outputs to human reviewers through a workflow you configure.

The API is CreateEvaluationJob. Use a consistent evaluator (judge) model for ALL candidate FMs in the same comparison — different judge models give incomparable scores.

Running batch inference jobs + custom Lambda LLM-judge + Spearman correlation is reinventing the wheel. CreateEvaluationJob does all of this managed — parallel, consistent, results to S3. Your Lambda only needs to do post-processing analysis on the S3 results.

Evaluation Dataset Hard Limit — 1,000 Prompts

Bedrock automatic model evaluation jobs have a hard limit of 1,000 prompts per dataset per job. If your dataset has 5,000 prompts, split it into 5 jobs of 1,000 each.

S3 bucket versioning, CORS configuration, file compression — none of these affect the prompt limit. The limit is in Bedrock's evaluation job API, not in S3. The only fix is to split the dataset.

Evaluation job fails with a size/quota error → split your dataset. Not a storage issue, not a permissions issue — a Bedrock quota issue.

Human Evaluation Workforce — Cognito, Not Ground Truth

For Bedrock human-based model evaluation with a custom expert workforce: create an Amazon Cognito user pool, add your experts as users, then assign them to a work team in Bedrock's evaluation configuration.

Amazon SageMaker Ground Truth = labeling service for ML training datasets (image annotation, text classification for model training). It is NOT the workforce tool for Bedrock model quality evaluation. Different product, different purpose.

Bedrock human eval workforce = Cognito user pools. SageMaker labeling workforce = SageMaker Ground Truth. The exam swaps these frequently.

Custom Metrics for Domain-Specific Evaluation

Industry-standard benchmarks (MMLU, HumanEval, GLUE) measure general model capability. They tell you nothing about whether the model writes in your brand's tone or uses your company's preferred formality level.

For company-specific quality criteria (brand tone, formality scale, domain vocabulary): create a human-validated custom evaluation dataset + define custom metrics that match your criteria.

Scenario: "evaluate model for company-specific tone and formality, RAG knowledge system" → custom dataset + custom metrics. NOT industry benchmark (MMLU etc.) + generic human eval.

Evaluation Process — Know the Output

When the exam asks about a systematic evaluation process for replacing a model in production, the final deliverable is an evaluation report with analysis — not just a test dataset creation.

Evaluation process order: define metrics → create/select dataset → run evaluation job → analyze results + generate comprehensive evaluation report → make go/no-go decision for model swap.

The trap is stopping at "create a diverse test dataset." That's step 1. The answer the exam wants is the analysis and report generation — what you do WITH the evaluation results.

Output Traceability — Tag at Generation Time

When you want to trace which knowledge base documents influenced a specific AI response, the pattern is to tag the FM output with source metadata (document ID, chunk reference, source URL) at the time of generation.

Model invocation logs capture what went in and came out of the LLM. They do not tell you which retrieved KB chunk was the actual source of information. You have to add that context explicitly at generation time.

Scenario: "content generation system, need to verify which data source influenced each output" → tag outputs with source metadata during generation. Not invocation logging alone.

Guardrails Contextual Grounding for RAG Quality

Contextual grounding is technically a guardrails feature but it shows up in evaluation scenarios. It scores every response against the retrieved context and blocks responses that aren't grounded.

SageMaker Clarify does not evaluate RAG quality — it's for ML model explainability and bias detection in traditional ML. For RAG trustworthiness → Bedrock Guardrails contextual grounding.

Scenario: "scalable, trustworthy evaluation of RAG application output quality" → configure Bedrock Guardrails with contextual grounding checks. Automated, no infrastructure needed.

Bias Evaluation — BOLD Dataset via Bedrock Eval

For evaluating demographic bias in text generation (not classification bias): use Bedrock model evaluation + BOLD dataset. BOLD (Bias in Open-Ended Language Generation) is specifically designed for this.

BOLD = demographic bias in text generation. RealToxicityPrompts = toxicity (not the same as bias). WikiText2 = general text quality. T-REx = factual knowledge. Know these dataset names — the exam uses them as distractors.

5.2 — Troubleshoot GenAI Applications

Troubleshooting Decision Table — Symptom → Fix

Symptom	Likely Cause	Fix	NOT This
Response cuts off mid-sentence	`max_tokens` too low	Increase `max_tokens` parameter	Don't analyze logs — the fix is the parameter
Response is factually wrong about your data	Model doesn't have the knowledge	Add RAG / Knowledge Base with current data	Not streaming, not temperature
Responses too repetitive / boring	Temperature too low or top-p too restrictive	Increase temperature or top-p	Not chunking, not embeddings
Users wait long before seeing anything	No streaming, response delayed	Enable response streaming	Not RAG, not more memory
KB documents fail to ingest	File too large / unsupported format	Split large docs into smaller files	Don't compress — doesn't help Bedrock limits
Retrieved chunks irrelevant to query	Poor retrieval ranking	Enable reranking in Knowledge Base	Don't just increase K (more chunks = more noise)
High latency in RAG responses	Vector search or embedding generation slow	CloudWatch dashboard: retrieval + OpenSearch metrics	X-Ray alone misses the aggregate picture

SageMaker Shadow Test vs A/B Test

Shadow Test — Safe Validation

New model receives a copy of traffic

Real users only see old model responses

Zero user impact during testing

Compare new vs old performance metrics

Use: validate before any user exposure

A/B Test (Production Variants)

Traffic split between old and new model

Real users see both variants

Good for incremental rollout

Risk: users may see worse responses

Use: after initial shadow validation

If the question says "validate without production impact" → shadow test. If it says "gradually shift traffic" → A/B or canary deployment. Shadow first, A/B after.

KB Document Size — Split, Don't Compress

Bedrock Knowledge Bases has per-document content size limits. Large PDFs (hundreds of pages) may exceed these limits.

Fix: split the large document into smaller files (by chapter, section, or page range) before uploading to S3 for KB ingestion.

S3 compression (gzip) reduces file storage size but Bedrock decompresses the file before processing — it still sees the original large document. Compression helps S3 storage costs, not Bedrock ingestion limits.

Truncated Responses — max_tokens Is Always the Fix

If model responses are getting cut off before completing the expected output → the max_tokens (or maxTokens) parameter in your model invocation is too low.

This is a parameter tuning issue, not a monitoring issue, not a logging issue, not a chunking issue. Just increase max_tokens.

Using CloudWatch Logs Insights to "analyze API call patterns" for a truncation problem is over-engineering a parameter fix. The logs will confirm truncation happened, but the fix is always max_tokens.

Common model parameters: max_tokens (output length), temperature (randomness), top_p (nucleus sampling), top_k (vocabulary restriction), stop_sequences (where to stop).

🔵 Domain 4 & 5 — Operations, Evaluation & Troubleshooting

📊 Domain 4 — Operational Efficiency & Optimization

Batch Inference for High-Volume Cost Efficiency

Inference Profiles for Cost Attribution

Model Distillation for Cost Reduction

Semantic Caching — OpenSearch k-NN

OpenSearch k-NN Semantic Cache ✓

ElastiCache Redis Exact-Match ✗

Response Streaming for Perceived Latency

Inference Profile HA — Primary + Secondary Region

CloudWatch GenAI Observability — Purpose-Built

CloudWatch Dashboard for RAG Performance (Not X-Ray)

Three Different Log Types — Know the Difference

🧪 Domain 5 — Testing, Validation & Troubleshooting

Bedrock Evaluation — The Full Picture

Evaluation Dataset Hard Limit — 1,000 Prompts

Human Evaluation Workforce — Cognito, Not Ground Truth

Custom Metrics for Domain-Specific Evaluation

Evaluation Process — Know the Output

Output Traceability — Tag at Generation Time

Guardrails Contextual Grounding for RAG Quality

Bias Evaluation — BOLD Dataset via Bedrock Eval

Troubleshooting Decision Table — Symptom → Fix

SageMaker Shadow Test vs A/B Test

Shadow Test — Safe Validation

A/B Test (Production Variants)

KB Document Size — Split, Don't Compress

Truncated Responses — max_tokens Is Always the Fix