๐Ÿ”ต Domain 4 & 5 โ€” Operations, Evaluation & Troubleshooting

Domain 4: Operational Efficiency ยท Domain 5: Testing, Validation & Troubleshooting

๐Ÿ“Š Domain 4 โ€” Operational Efficiency & Optimization

4.1 โ€” Cost Optimization & Resource Efficiency

Batch Inference for High-Volume Cost Efficiency

Batch inference jobs = submit all your inputs at once (stored in S3), Bedrock processes them offline and writes results back to S3. Significantly cheaper than real-time inference for large volumes because you're not paying for always-on capacity.
Pattern: S3 (input JSONL) โ†’ CreateModelInvocationJob API โ†’ processing โ†’ S3 (output JSONL). No real-time latency requirement = batch. Combine with inference profiles to distribute across regions for even higher throughput.
Synchronous API calls + Lambda concurrent invocations can scale, but you pay real-time pricing for everything. For millions of documents, batch inference is the cost-efficient approach. Real-time = pay premium for low latency. Batch = pay less for high throughput.
Scenario: "process millions of content summaries, maximize throughput, cost-effective" โ†’ batch inference jobs + S3 I/O configuration + inference profiles for workload distribution.

Inference Profiles for Cost Attribution

Inference profiles tag your Bedrock API calls with a logical identifier (profile ARN). AWS Cost Explorer then shows costs grouped by profile โ€” perfect for per-team, per-product, or per-environment cost visibility.
One profile per cost center. Invoke models via profile ARN instead of model ARN directly. The profile knows which model to use โ€” you don't change the model, just the entry point.
Scenario: "need to track Bedrock costs per business unit / per clinic / per department" โ†’ inference profiles. One profile per unit. Cost reports break down automatically.

Model Distillation for Cost Reduction

Run inference on a smaller, cheaper student model that has learned to mimic the expensive teacher model. Token costs drop substantially while accuracy remains high for the trained tasks.
The input to distillation = only prompts (not prompt-response pairs). Bedrock runs the teacher model against those prompts during distillation training to generate the training signal internally.
Scenario: "high accuracy application currently using Nova Pro, need to cut costs, maintain accuracy" โ†’ distillation to Nova Lite (student) using Nova Pro (teacher). Supply prompts from invocation logs โ€” Bedrock generates its own responses for training.
4.2 โ€” Optimize Application Performance

Semantic Caching โ€” OpenSearch k-NN

Semantic caching avoids FM invocations entirely for queries that are semantically similar to previously-seen ones. "What's the return policy?" and "Can I return a product?" should both hit the same cache entry.
Architecture: incoming query โ†’ Lambda generates embedding โ†’ query OpenSearch k-NN vector index for similar past queries โ†’ if cosine similarity exceeds threshold โ†’ return cached response โ†’ skip Bedrock entirely.

OpenSearch k-NN Semantic Cache โœ“

  • Stores embeddings of past queries
  • Matches by meaning, not exact text
  • High cache hit rate for conversational AI
  • Similar questions โ†’ same cache hit
  • ElastiCache Redis Exact-Match โœ—

  • Stores query string as key
  • Only matches identical strings
  • Low hit rate โ€” users never type the same thing twice
  • NOT semantic โ€” just fast key-value lookup
  • Response Streaming for Perceived Latency

    Streaming sends tokens to the user as they're generated instead of waiting for the full response. A 6-second response feels much faster if the first tokens appear in 400ms.
    Streaming = better perceived latency. It does NOT make the model faster or more accurate. The total tokens generated is the same โ€” you're just delivering them progressively.
    Scenario: "users experience long wait times before seeing any output" โ†’ implement streaming. Scenario: "users receive wrong product information" โ†’ RAG. Never confuse these two.

    Inference Profile HA โ€” Primary + Secondary Region

    Configure an inference profile with a primary region and a secondary (failover) region. If the primary region's model capacity is unavailable, requests automatically route to the secondary.
    This is NOT load balancing / round-robin. The secondary only activates when the primary fails. If you want true geographic load distribution, that's a different architecture (e.g., multiple endpoints with Route 53 latency routing).
    Scenario: "need high availability for Bedrock, handle regional outages" โ†’ inference profile with primary + secondary region config.
    4.3 โ€” Monitoring Systems for GenAI Applications

    CloudWatch GenAI Observability โ€” Purpose-Built

    Amazon CloudWatch GenAI observability is a purpose-built monitoring capability for generative AI applications using Lambda and Bedrock. It automatically collects model invocation metrics, latency, token counts, and error rates without custom instrumentation.
    CloudWatch Application Signals + X-Ray is for distributed tracing of microservices. While useful for general apps, it wasn't built for GenAI metrics. The exam expects you to know that CloudWatch GenAI observability exists and is the right choice for Lambda + Bedrock apps.
    Scenario: "serverless app uses Lambda + Bedrock, need to monitor model performance" โ†’ CloudWatch GenAI observability. Not Application Signals + X-Ray.

    CloudWatch Dashboard for RAG Performance (Not X-Ray)

    For troubleshooting RAG performance (slow retrievals, degraded KB queries), build a custom CloudWatch dashboard combining: context retrieval latency metrics + OpenSearch operation counts + Bedrock invocation log analysis.
    X-Ray traces show you the call graph for individual requests, but for RAG performance analysis you need aggregated metrics across many requests โ€” that's what CloudWatch dashboards give you. X-Ray is great for debugging individual slow requests, not for identifying systematic retrieval degradation patterns.
    Bedrock invocation logs โ†’ CloudWatch Logs โ†’ CloudWatch Logs Insights queries โ†’ identify which KB queries are consistently slow or returning low-quality results.

    Three Different Log Types โ€” Know the Difference

    Log TypeWhat It CapturesWhere to Enable
    Model invocation logsLLM input prompts, output text, token counts, latency, model IDBedrock console โ†’ Model invocation logging
    KB ingestion logsDocument processing events, chunking/embedding success/failure per fileKnowledge Base settings โ†’ CloudWatch Logs
    CloudTrail API logsWho called which Bedrock API, when, from which role, with which parametersEnabled by default, goes to CloudTrail
    KB file failing to ingest โ†’ KB ingestion logs.
    Track what prompts users sent โ†’ model invocation logs.
    Compliance audit who invoked which model โ†’ CloudTrail.

    ๐Ÿงช Domain 5 โ€” Testing, Validation & Troubleshooting

    5.1 โ€” Evaluation Systems โš ๏ธ Your LOWEST score (17%) โ€” highest priority

    Bedrock Evaluation โ€” The Full Picture

    Automatic model evaluation: Bedrock runs the evaluation pipeline for you using built-in or custom metrics. You provide a prompt dataset and a judge model. Bedrock invokes each candidate FM, judges the outputs, and writes a report to S3.
    Human-based evaluation: You define the evaluation criteria and bring your own expert workforce. Bedrock routes outputs to human reviewers through a workflow you configure.
    The API is CreateEvaluationJob. Use a consistent evaluator (judge) model for ALL candidate FMs in the same comparison โ€” different judge models give incomparable scores.
    Running batch inference jobs + custom Lambda LLM-judge + Spearman correlation is reinventing the wheel. CreateEvaluationJob does all of this managed โ€” parallel, consistent, results to S3. Your Lambda only needs to do post-processing analysis on the S3 results.

    Evaluation Dataset Hard Limit โ€” 1,000 Prompts

    Bedrock automatic model evaluation jobs have a hard limit of 1,000 prompts per dataset per job. If your dataset has 5,000 prompts, split it into 5 jobs of 1,000 each.
    S3 bucket versioning, CORS configuration, file compression โ€” none of these affect the prompt limit. The limit is in Bedrock's evaluation job API, not in S3. The only fix is to split the dataset.
    Evaluation job fails with a size/quota error โ†’ split your dataset. Not a storage issue, not a permissions issue โ€” a Bedrock quota issue.

    Human Evaluation Workforce โ€” Cognito, Not Ground Truth

    For Bedrock human-based model evaluation with a custom expert workforce: create an Amazon Cognito user pool, add your experts as users, then assign them to a work team in Bedrock's evaluation configuration.
    Amazon SageMaker Ground Truth = labeling service for ML training datasets (image annotation, text classification for model training). It is NOT the workforce tool for Bedrock model quality evaluation. Different product, different purpose.
    Bedrock human eval workforce = Cognito user pools. SageMaker labeling workforce = SageMaker Ground Truth. The exam swaps these frequently.

    Custom Metrics for Domain-Specific Evaluation

    Industry-standard benchmarks (MMLU, HumanEval, GLUE) measure general model capability. They tell you nothing about whether the model writes in your brand's tone or uses your company's preferred formality level.
    For company-specific quality criteria (brand tone, formality scale, domain vocabulary): create a human-validated custom evaluation dataset + define custom metrics that match your criteria.
    Scenario: "evaluate model for company-specific tone and formality, RAG knowledge system" โ†’ custom dataset + custom metrics. NOT industry benchmark (MMLU etc.) + generic human eval.

    Evaluation Process โ€” Know the Output

    When the exam asks about a systematic evaluation process for replacing a model in production, the final deliverable is an evaluation report with analysis โ€” not just a test dataset creation.
    Evaluation process order: define metrics โ†’ create/select dataset โ†’ run evaluation job โ†’ analyze results + generate comprehensive evaluation report โ†’ make go/no-go decision for model swap.
    The trap is stopping at "create a diverse test dataset." That's step 1. The answer the exam wants is the analysis and report generation โ€” what you do WITH the evaluation results.

    Output Traceability โ€” Tag at Generation Time

    When you want to trace which knowledge base documents influenced a specific AI response, the pattern is to tag the FM output with source metadata (document ID, chunk reference, source URL) at the time of generation.
    Model invocation logs capture what went in and came out of the LLM. They do not tell you which retrieved KB chunk was the actual source of information. You have to add that context explicitly at generation time.
    Scenario: "content generation system, need to verify which data source influenced each output" โ†’ tag outputs with source metadata during generation. Not invocation logging alone.

    Guardrails Contextual Grounding for RAG Quality

    Contextual grounding is technically a guardrails feature but it shows up in evaluation scenarios. It scores every response against the retrieved context and blocks responses that aren't grounded.
    SageMaker Clarify does not evaluate RAG quality โ€” it's for ML model explainability and bias detection in traditional ML. For RAG trustworthiness โ†’ Bedrock Guardrails contextual grounding.
    Scenario: "scalable, trustworthy evaluation of RAG application output quality" โ†’ configure Bedrock Guardrails with contextual grounding checks. Automated, no infrastructure needed.

    Bias Evaluation โ€” BOLD Dataset via Bedrock Eval

    For evaluating demographic bias in text generation (not classification bias): use Bedrock model evaluation + BOLD dataset. BOLD (Bias in Open-Ended Language Generation) is specifically designed for this.
    BOLD = demographic bias in text generation. RealToxicityPrompts = toxicity (not the same as bias). WikiText2 = general text quality. T-REx = factual knowledge. Know these dataset names โ€” the exam uses them as distractors.
    5.2 โ€” Troubleshoot GenAI Applications

    Troubleshooting Decision Table โ€” Symptom โ†’ Fix

    SymptomLikely CauseFixNOT This
    Response cuts off mid-sentencemax_tokens too lowIncrease max_tokens parameterDon't analyze logs โ€” the fix is the parameter
    Response is factually wrong about your dataModel doesn't have the knowledgeAdd RAG / Knowledge Base with current dataNot streaming, not temperature
    Responses too repetitive / boringTemperature too low or top-p too restrictiveIncrease temperature or top-pNot chunking, not embeddings
    Users wait long before seeing anythingNo streaming, response delayedEnable response streamingNot RAG, not more memory
    KB documents fail to ingestFile too large / unsupported formatSplit large docs into smaller filesDon't compress โ€” doesn't help Bedrock limits
    Retrieved chunks irrelevant to queryPoor retrieval rankingEnable reranking in Knowledge BaseDon't just increase K (more chunks = more noise)
    High latency in RAG responsesVector search or embedding generation slowCloudWatch dashboard: retrieval + OpenSearch metricsX-Ray alone misses the aggregate picture

    SageMaker Shadow Test vs A/B Test

    Shadow Test โ€” Safe Validation

  • New model receives a copy of traffic
  • Real users only see old model responses
  • Zero user impact during testing
  • Compare new vs old performance metrics
  • Use: validate before any user exposure
  • A/B Test (Production Variants)

  • Traffic split between old and new model
  • Real users see both variants
  • Good for incremental rollout
  • Risk: users may see worse responses
  • Use: after initial shadow validation
  • If the question says "validate without production impact" โ†’ shadow test. If it says "gradually shift traffic" โ†’ A/B or canary deployment. Shadow first, A/B after.

    KB Document Size โ€” Split, Don't Compress

    Bedrock Knowledge Bases has per-document content size limits. Large PDFs (hundreds of pages) may exceed these limits.
    Fix: split the large document into smaller files (by chapter, section, or page range) before uploading to S3 for KB ingestion.
    S3 compression (gzip) reduces file storage size but Bedrock decompresses the file before processing โ€” it still sees the original large document. Compression helps S3 storage costs, not Bedrock ingestion limits.

    Truncated Responses โ€” max_tokens Is Always the Fix

    If model responses are getting cut off before completing the expected output โ†’ the max_tokens (or maxTokens) parameter in your model invocation is too low.
    This is a parameter tuning issue, not a monitoring issue, not a logging issue, not a chunking issue. Just increase max_tokens.
    Using CloudWatch Logs Insights to "analyze API call patterns" for a truncation problem is over-engineering a parameter fix. The logs will confirm truncation happened, but the fix is always max_tokens.
    Common model parameters: max_tokens (output length), temperature (randomness), top_p (nucleus sampling), top_k (vocabulary restriction), stop_sequences (where to stop).