Bedrock Evaluation โ The Full Picture
Automatic model evaluation: Bedrock runs the evaluation pipeline for you using built-in or custom metrics. You provide a prompt dataset and a judge model. Bedrock invokes each candidate FM, judges the outputs, and writes a report to S3.
Human-based evaluation: You define the evaluation criteria and bring your own expert workforce. Bedrock routes outputs to human reviewers through a workflow you configure.
The API is CreateEvaluationJob. Use a consistent evaluator (judge) model for ALL candidate FMs in the same comparison โ different judge models give incomparable scores.
Running batch inference jobs + custom Lambda LLM-judge + Spearman correlation is reinventing the wheel. CreateEvaluationJob does all of this managed โ parallel, consistent, results to S3. Your Lambda only needs to do post-processing analysis on the S3 results.
Evaluation Dataset Hard Limit โ 1,000 Prompts
Bedrock automatic model evaluation jobs have a hard limit of 1,000 prompts per dataset per job. If your dataset has 5,000 prompts, split it into 5 jobs of 1,000 each.
S3 bucket versioning, CORS configuration, file compression โ none of these affect the prompt limit. The limit is in Bedrock's evaluation job API, not in S3. The only fix is to split the dataset.
Evaluation job fails with a size/quota error โ split your dataset. Not a storage issue, not a permissions issue โ a Bedrock quota issue.
Human Evaluation Workforce โ Cognito, Not Ground Truth
For Bedrock human-based model evaluation with a custom expert workforce: create an Amazon Cognito user pool, add your experts as users, then assign them to a work team in Bedrock's evaluation configuration.
Amazon SageMaker Ground Truth = labeling service for ML training datasets (image annotation, text classification for model training). It is NOT the workforce tool for Bedrock model quality evaluation. Different product, different purpose.
Bedrock human eval workforce = Cognito user pools. SageMaker labeling workforce = SageMaker Ground Truth. The exam swaps these frequently.
Custom Metrics for Domain-Specific Evaluation
Industry-standard benchmarks (MMLU, HumanEval, GLUE) measure general model capability. They tell you nothing about whether the model writes in your brand's tone or uses your company's preferred formality level.
For company-specific quality criteria (brand tone, formality scale, domain vocabulary): create a human-validated custom evaluation dataset + define custom metrics that match your criteria.
Scenario: "evaluate model for company-specific tone and formality, RAG knowledge system" โ custom dataset + custom metrics. NOT industry benchmark (MMLU etc.) + generic human eval.
Evaluation Process โ Know the Output
When the exam asks about a systematic evaluation process for replacing a model in production, the final deliverable is an evaluation report with analysis โ not just a test dataset creation.
Evaluation process order: define metrics โ create/select dataset โ run evaluation job โ analyze results + generate comprehensive evaluation report โ make go/no-go decision for model swap.
The trap is stopping at "create a diverse test dataset." That's step 1. The answer the exam wants is the analysis and report generation โ what you do WITH the evaluation results.
Output Traceability โ Tag at Generation Time
When you want to trace which knowledge base documents influenced a specific AI response, the pattern is to tag the FM output with source metadata (document ID, chunk reference, source URL) at the time of generation.
Model invocation logs capture what went in and came out of the LLM. They do not tell you which retrieved KB chunk was the actual source of information. You have to add that context explicitly at generation time.
Scenario: "content generation system, need to verify which data source influenced each output" โ tag outputs with source metadata during generation. Not invocation logging alone.
Guardrails Contextual Grounding for RAG Quality
Contextual grounding is technically a guardrails feature but it shows up in evaluation scenarios. It scores every response against the retrieved context and blocks responses that aren't grounded.
SageMaker Clarify does not evaluate RAG quality โ it's for ML model explainability and bias detection in traditional ML. For RAG trustworthiness โ Bedrock Guardrails contextual grounding.
Scenario: "scalable, trustworthy evaluation of RAG application output quality" โ configure Bedrock Guardrails with contextual grounding checks. Automated, no infrastructure needed.
Bias Evaluation โ BOLD Dataset via Bedrock Eval
For evaluating demographic bias in text generation (not classification bias): use Bedrock model evaluation + BOLD dataset. BOLD (Bias in Open-Ended Language Generation) is specifically designed for this.
BOLD = demographic bias in text generation. RealToxicityPrompts = toxicity (not the same as bias). WikiText2 = general text quality. T-REx = factual knowledge. Know these dataset names โ the exam uses them as distractors.