Domain 1 — Foundation Models, Data & RAG

1.2 — Select & Configure Foundation Models

Model Distillation — A Recurring Trap

Distillation = compress a large teacher model into a small student model. You supply only the prompts — Bedrock runs the teacher internally to generate the training responses.

The trap: using "prompt-response pairs from invocation logs." That approach is for fine-tuning, NOT distillation. With distillation, the teacher generates its own responses — you cannot inject pre-captured responses.

Distillation input = prompts only. Fine-tuning input = prompt + response pairs in JSONL. Never mix them up.

If the scenario says "reduce cost while maintaining accuracy using an existing high-accuracy model" → think distillation → supply only prompts, choose a smaller student (e.g., Nova Lite from Nova Pro).

Inference Profiles — Two Separate Jobs

Job 1 — Cost attribution: Create one inference profile per cost center / clinic / team. Invoke models through the profile → AWS Cost Explorer breaks costs down by profile ID automatically.

Job 2 — High availability: Configure an inference profile with a primary region + secondary region. Bedrock fails over automatically — this is NOT round-robin load balancing.

Cross-Region inference provides failover capability, but round-robin load balancing is the wrong mental model. The exam wants failover (primary fails → secondary takes over).

Scenario: "multi-clinic app, need to track costs per clinic" → inference profiles per clinic ID.
Scenario: "high availability across regions" → inference profile with primary + secondary region config.

Fine-Tuning vs Other Customization Techniques

Technique	When to Use	Data Input
Fine-tuning	Model needs domain-specific behavior / tone / format	Prompt + response JSONL pairs
Distillation	Make a cheaper smaller model that mimics a large one	Prompts only (teacher generates responses)
Continued pre-training	Model needs deep domain knowledge (raw text)	Unlabeled text corpus
RAG	Model needs up-to-date or private knowledge	No model training — runtime retrieval
Prompt engineering	Quick behavior changes, no training needed	Just the prompt

RAG = no model modification. Fine-tuning = model weights change. Distillation = new smaller model. These are the three things the exam loves to mix.

Bedrock On-Demand vs SageMaker Endpoints

Scenario mentions a native Bedrock model (Nova, Claude, Titan, Llama via Bedrock) with unpredictable traffic → use Bedrock on-demand inference via Lambda. No endpoints to manage, automatic scaling, pay-per-token.

SageMaker real-time endpoints are for custom or self-managed models (models you trained yourself, HuggingFace models via JumpStart). For first-party Bedrock models, you don't need SageMaker endpoints at all.

Avoid: "deploy Nova to SageMaker endpoint + auto scaling" — Nova is a Bedrock-native model, you never deploy it to SageMaker. Use Bedrock's API directly.

1.3 — Data Validation & Processing Pipelines ⚠️ Your weakest subdomain (33%)

Bedrock Data Automation (BDA) — The Full Picture

BDA is a multimodal document intelligence service. It extracts structured information from PDFs, images, audio, and video automatically — think of it as a smart parser that understands document structure without you writing extraction logic.

BDA architecture: 1 project → multiple blueprints. Each blueprint describes one document type (e.g., electric bill, water bill, gas bill). When you send a document in, BDA automatically selects the right blueprint — you don't pick it.

The trap is creating one project per document type. That's backwards. One project, many blueprints inside it. The auto-selection is the whole point.

Scenario: "extract fields from various document types (bills, contracts, invoices)" → BDA with one project containing one blueprint per document type → invoke via InvokeDataAutomationAsync API.

BDA as RAG pre-processor: For complex multimodal content (financial filings, PDFs with charts), use BDA first to extract structured text, THEN feed into a Knowledge Base for RAG. Raw PDFs in a KB miss the embedded chart data; BDA catches it all.

BDA Blueprint: Transformation vs Validation

Transformation = reshape or reformat an extracted value. Example: split "John Smith" into FIRST_NAME + LAST_NAME using a custom type (reusable transformation definition).

Validation = check a constraint and reject if violated. Example: reject if a field is null or malformed.

When asked "how do you split a field into subcomponents?" → the answer is transformation with a custom type, not validation. Validation rejects; transformation reshapes.

Avoid confusing "enforce required subfields" with "split into subfields." Enforcing = validation. Splitting/reformatting = transformation.

Fine-Tuning Data Pipeline — Glue ETL, Not EMR

The standard fine-tuning data pipeline is: S3 (raw data) → AWS Glue crawler (catalog it) → Glue Data Catalog → Glue ETL jobs (transform to JSONL in Bedrock Converse API format) → S3 (curated) → Bedrock fine-tuning job.

EMR with Apache Spark is a valid tool for big data transformation, but it requires cluster management. For a straightforward fine-tuning data prep pipeline, Glue ETL is the AWS-recommended approach — serverless, managed, no infrastructure.

Glue crawler = discovers and catalogs data. Glue ETL = transforms it. These are two different things — you need BOTH in the pipeline.

Scenario: "prepare customer support transcripts for fine-tuning" → Glue crawler (catalog) → Glue ETL job (transform to JSONL) → S3 → Bedrock fine-tuning. Not EMR, not Lambda alone.

Amazon Comprehend — Entity Recognition vs Classification

Entity Recognition — use when extracting

Pull product names, brands, specs FROM text

Extract people, places, dates, organizations

Structured attribute extraction from prose

"What is in this text?"

Custom Classification — use when labeling

Assign a category label to a whole document

Sentiment (positive/negative/neutral)

Topic labeling ("this is a billing complaint")

"What type is this text?"

If the scenario asks to "extract product attributes" or "pull specs from descriptions" → entity recognition. Classification just puts a label on the whole thing — it doesn't extract individual fields.

Chunking Strategies for Knowledge Bases

Fixed-size chunking: Split at N tokens. Simple, fast, but can break mid-sentence. Good for homogenous content.

Hierarchical chunking (built-in): Parent chunks + child chunks. Parent provides context, child provides precision. Good for standard documents with clear structure.

Semantic chunking: Split based on meaning/topic shifts. Better retrieval quality, slightly more expensive.

Custom Lambda chunking: Your own logic via Lambda + libraries like LangChain. Use for complex HTML, custom formats, non-standard structures.

Scenario: "complex HTML with nested headers and mixed content types" → custom Lambda chunking with LangChain deployed as a layer. The built-in strategies don't handle arbitrary HTML structure well enough.

The built-in hierarchical chunker works for Word docs and simple PDFs. For HTML or any complex custom format → custom Lambda chunking.

PII in Data Pipelines

The right tool depends on where the data is and what you need:

Scenario	Right Tool
Detect + redact PII in text (emails, transcripts) before sending to FM	Amazon Comprehend — real-time PII detection API
Discover PII across S3 buckets at scale	Amazon Macie — automated S3 data discovery, custom classifiers for continuous monitoring
Extract text from scanned documents / images	Amazon Textract — OCR only, not NLP/PII detection
Search through documents after PII removal	Amazon Kendra — enterprise search (comes after Comprehend removes PII)

Textract is NOT a PII detector — it just extracts text from images/PDFs. Macie does not redact; it discovers and alerts. Comprehend actually redacts.

S3 Metadata Types — Exact Distinction

Type	Set By	Examples
System-defined	S3 automatically	Last-Modified, Content-Type, ETag, Content-Length, Storage-Class
User-defined	You, at upload time	x-amz-meta-author, x-amz-meta-source-id, x-amz-meta-department
Object tags	You, any time (even after upload)	Classification=Confidential, Discipline=Physics, env=prod

Timestamps (when the file was uploaded/modified) are system-defined — S3 manages them automatically. You cannot set Last-Modified yourself. Authorship details, source identifiers → user-defined metadata.

The difference matters because system metadata = S3-controlled, user metadata = you control at upload, tags = you control any time post-upload.

BDA vs EventBridge + Step Functions for Orchestration

BDA is an intelligence extraction service — it understands documents and media. It is not a workflow orchestrator.

Scenario: "process uploaded videos using Rekognition (object detection) + Bedrock (summaries)" → EventBridge + Step Functions: S3 upload → EventBridge rule → Step Functions state machine → call Rekognition → call Bedrock FMs → store results.

You cannot use a BDA blueprint to "orchestrate" multi-service video processing. BDA blueprints define what fields to extract from a document, not how to chain services together.

Avoid: "create a BDA blueprint to orchestrate the processing steps." BDA blueprints = field extraction schemas. Orchestration = Step Functions.

1.4 — Vector Store Solutions

OpenSearch Service vs Serverless — Hybrid Search

Hybrid search = combine dense vectors (semantic / embedding-based) with sparse vectors (keyword / BM25). You get semantic understanding AND exact-term matching in one query.

OpenSearch Service (managed)

Full feature set including hybrid search

Both sparse + dense vector support

Sub-second latency with k-NN indexing

Advanced filtering, analytics, sharding

✓ Use for sub-second hybrid search

OpenSearch Serverless

Simpler, auto-scaling, less ops burden

Good for variable workloads

Dense vectors only (limited sparse support)

Fewer advanced search features

✗ Not ideal for full hybrid search

If the exam scenario requires sub-second response or hybrid search (semantic + keyword) → OpenSearch Service (managed), NOT Serverless.

Semantic Cache with OpenSearch k-NN

Semantic caching = cache responses by meaning, not by exact string. "What's the price?" and "How much does it cost?" should hit the same cache entry.

Implementation: Lambda generates an embedding for the incoming query → searches OpenSearch k-NN vector index → if similarity score exceeds threshold → return cached response → skip FM invocation entirely.

ElastiCache (Redis) with key-value exact string matching is NOT semantic caching. It will only hit cache if the user types the exact same question. For conversational AI this has very low hit rates.

Scenario: "high costs from repeated similar questions, need caching" → OpenSearch k-NN semantic cache, not ElastiCache with exact matching.

Vector Store Selection Guide

Service	Best For	Key Trait
OpenSearch Service	High-performance, real-time, hybrid search	Most features, sub-ms latency
OpenSearch Serverless	Variable workloads, managed scaling	No cluster management
Aurora PostgreSQL + pgvector	Relational data + vector search in same DB	SQL interface, ACID transactions
Amazon S3 Vectors	Large-scale cost-effective storage	Cheapest per vector
Bedrock Knowledge Bases	Fully managed RAG, no vector ops expertise needed	Zero infrastructure, end-to-end managed

1.5 — Retrieval Mechanisms for FM Augmentation

Reranking — When More Docs Isn't the Answer

After the initial vector search returns Top-K chunks, a reranker does a second pass — it scores each chunk against the query for relevance and reorders them. The best chunks go to the LLM; noisy irrelevant ones fall off.

Increasing the number of retrieved documents (K) does NOT improve relevance — it usually makes things worse by adding low-quality noise into the context. The model gets confused by irrelevant chunks.

Scenario: "retrieved chunks are correct but summaries are contextually irrelevant" → enable reranking in Bedrock Knowledge Bases. The retrieval is finding the right documents, but the ranking is wrong.

Retrieval pipeline: Query → Vector search → Top-K candidates → Reranker → Best-N → LLM context window. The reranker is the quality filter before the LLM sees anything.

Knowledge Base Sync — SQS for Resilience

Pattern for near-real-time KB sync: S3 Event Notification → SQS queue → Lambda polls queue → calls IngestKnowledgeBaseDocuments API.

Direct Lambda trigger (S3 → Lambda directly, no SQS) works but has no retry buffer. If Lambda fails during ingestion, the event is gone. With SQS, the message stays in the queue and gets retried automatically.

SQS = resilience via message persistence + retry. Direct Lambda trigger = fire-and-forget with no retry. For production KB sync, always use SQS as the buffer.

For deletes: S3 fires both object-created AND object-deleted events. Your Lambda should call the appropriate KB API for each event type.

KB Ingestion — Split Large Docs, Don't Compress

Bedrock KB has per-document size limits. If ingestion fails for large PDFs, the solution is to split the document into smaller files before uploading to S3.

S3 bucket compression (gzip, etc.) changes file size but Bedrock decompresses before processing — it still sees the same large document. Compression doesn't help KB size limits.

Avoid: "enable S3 compression to reduce document size." The KB limit is about document content size, not storage footprint.

RAG vs Response Streaming — Different Problems

RAG solves accuracy and knowledge freshness problems. The model doesn't know your product catalog? → inject it via RAG.

Response streaming solves perceived latency problems. The model response takes 8 seconds to complete? → stream tokens as they're generated so the user sees output immediately.

If the scenario describes inaccurate answers about products → RAG. If the scenario describes slow user experience / users waiting → streaming. Never use streaming to fix accuracy.

Streaming doesn't change what the model knows. It only changes when the user sees the response. RAG changes what the model can answer correctly.

Knowledge Base Logging — Two Separate Log Types

Model invocation logs = what was sent to the LLM and what it replied (input prompts, output text, token counts, latency). Captured via Amazon S3 or CloudWatch Logs in Bedrock settings.

Knowledge base ingestion logs = what happened during document processing (which files succeeded, which failed, why chunking/embedding errored). Configured separately in KB settings → CloudWatch Logs destination.

Scenario: "documents are failing to ingest / embeddings not generated" → KB ingestion logs → CloudWatch Logs Insights.
Scenario: "track what prompts users are sending" → model invocation logs.

These are two completely separate log streams. You cannot find KB ingestion failures in model invocation logs.

📗 Domain 1 — Foundation Models, Data & RAG

Model Distillation — A Recurring Trap

Inference Profiles — Two Separate Jobs

Fine-Tuning vs Other Customization Techniques

Bedrock On-Demand vs SageMaker Endpoints

Bedrock Data Automation (BDA) — The Full Picture

BDA Blueprint: Transformation vs Validation

Fine-Tuning Data Pipeline — Glue ETL, Not EMR

Amazon Comprehend — Entity Recognition vs Classification

Entity Recognition — use when extracting

Custom Classification — use when labeling

Chunking Strategies for Knowledge Bases

PII in Data Pipelines

S3 Metadata Types — Exact Distinction

BDA vs EventBridge + Step Functions for Orchestration

OpenSearch Service vs Serverless — Hybrid Search

OpenSearch Service (managed)

OpenSearch Serverless

Semantic Cache with OpenSearch k-NN

Vector Store Selection Guide

Reranking — When More Docs Isn't the Answer

Knowledge Base Sync — SQS for Resilience

KB Ingestion — Split Large Docs, Don't Compress

RAG vs Response Streaming — Different Problems

Knowledge Base Logging — Two Separate Log Types