๐Ÿ“— Domain 1 โ€” Foundation Models, Data & RAG

Subdomains 1.1โ€“1.6 ยท Your gap areas: 1.3 Data Pipelines ยท 1.4 Vector Stores ยท 1.5 Retrieval

1.2 โ€” Select & Configure Foundation Models

Model Distillation โ€” A Recurring Trap

Distillation = compress a large teacher model into a small student model. You supply only the prompts โ€” Bedrock runs the teacher internally to generate the training responses.
The trap: using "prompt-response pairs from invocation logs." That approach is for fine-tuning, NOT distillation. With distillation, the teacher generates its own responses โ€” you cannot inject pre-captured responses.
Distillation input = prompts only. Fine-tuning input = prompt + response pairs in JSONL. Never mix them up.
If the scenario says "reduce cost while maintaining accuracy using an existing high-accuracy model" โ†’ think distillation โ†’ supply only prompts, choose a smaller student (e.g., Nova Lite from Nova Pro).

Inference Profiles โ€” Two Separate Jobs

Job 1 โ€” Cost attribution: Create one inference profile per cost center / clinic / team. Invoke models through the profile โ†’ AWS Cost Explorer breaks costs down by profile ID automatically.
Job 2 โ€” High availability: Configure an inference profile with a primary region + secondary region. Bedrock fails over automatically โ€” this is NOT round-robin load balancing.
Cross-Region inference provides failover capability, but round-robin load balancing is the wrong mental model. The exam wants failover (primary fails โ†’ secondary takes over).
Scenario: "multi-clinic app, need to track costs per clinic" โ†’ inference profiles per clinic ID.
Scenario: "high availability across regions" โ†’ inference profile with primary + secondary region config.

Fine-Tuning vs Other Customization Techniques

TechniqueWhen to UseData Input
Fine-tuningModel needs domain-specific behavior / tone / formatPrompt + response JSONL pairs
DistillationMake a cheaper smaller model that mimics a large onePrompts only (teacher generates responses)
Continued pre-trainingModel needs deep domain knowledge (raw text)Unlabeled text corpus
RAGModel needs up-to-date or private knowledgeNo model training โ€” runtime retrieval
Prompt engineeringQuick behavior changes, no training neededJust the prompt
RAG = no model modification. Fine-tuning = model weights change. Distillation = new smaller model. These are the three things the exam loves to mix.

Bedrock On-Demand vs SageMaker Endpoints

Scenario mentions a native Bedrock model (Nova, Claude, Titan, Llama via Bedrock) with unpredictable traffic โ†’ use Bedrock on-demand inference via Lambda. No endpoints to manage, automatic scaling, pay-per-token.
SageMaker real-time endpoints are for custom or self-managed models (models you trained yourself, HuggingFace models via JumpStart). For first-party Bedrock models, you don't need SageMaker endpoints at all.
Avoid: "deploy Nova to SageMaker endpoint + auto scaling" โ€” Nova is a Bedrock-native model, you never deploy it to SageMaker. Use Bedrock's API directly.
1.3 โ€” Data Validation & Processing Pipelines โš ๏ธ Your weakest subdomain (33%)

Bedrock Data Automation (BDA) โ€” The Full Picture

BDA is a multimodal document intelligence service. It extracts structured information from PDFs, images, audio, and video automatically โ€” think of it as a smart parser that understands document structure without you writing extraction logic.
BDA architecture: 1 project โ†’ multiple blueprints. Each blueprint describes one document type (e.g., electric bill, water bill, gas bill). When you send a document in, BDA automatically selects the right blueprint โ€” you don't pick it.
The trap is creating one project per document type. That's backwards. One project, many blueprints inside it. The auto-selection is the whole point.
Scenario: "extract fields from various document types (bills, contracts, invoices)" โ†’ BDA with one project containing one blueprint per document type โ†’ invoke via InvokeDataAutomationAsync API.
BDA as RAG pre-processor: For complex multimodal content (financial filings, PDFs with charts), use BDA first to extract structured text, THEN feed into a Knowledge Base for RAG. Raw PDFs in a KB miss the embedded chart data; BDA catches it all.

BDA Blueprint: Transformation vs Validation

Transformation = reshape or reformat an extracted value. Example: split "John Smith" into FIRST_NAME + LAST_NAME using a custom type (reusable transformation definition).
Validation = check a constraint and reject if violated. Example: reject if a field is null or malformed.
When asked "how do you split a field into subcomponents?" โ†’ the answer is transformation with a custom type, not validation. Validation rejects; transformation reshapes.
Avoid confusing "enforce required subfields" with "split into subfields." Enforcing = validation. Splitting/reformatting = transformation.

Fine-Tuning Data Pipeline โ€” Glue ETL, Not EMR

The standard fine-tuning data pipeline is: S3 (raw data) โ†’ AWS Glue crawler (catalog it) โ†’ Glue Data Catalog โ†’ Glue ETL jobs (transform to JSONL in Bedrock Converse API format) โ†’ S3 (curated) โ†’ Bedrock fine-tuning job.
EMR with Apache Spark is a valid tool for big data transformation, but it requires cluster management. For a straightforward fine-tuning data prep pipeline, Glue ETL is the AWS-recommended approach โ€” serverless, managed, no infrastructure.
Glue crawler = discovers and catalogs data. Glue ETL = transforms it. These are two different things โ€” you need BOTH in the pipeline.
Scenario: "prepare customer support transcripts for fine-tuning" โ†’ Glue crawler (catalog) โ†’ Glue ETL job (transform to JSONL) โ†’ S3 โ†’ Bedrock fine-tuning. Not EMR, not Lambda alone.

Amazon Comprehend โ€” Entity Recognition vs Classification

Entity Recognition โ€” use when extracting

  • Pull product names, brands, specs FROM text
  • Extract people, places, dates, organizations
  • Structured attribute extraction from prose
  • "What is in this text?"
  • Custom Classification โ€” use when labeling

  • Assign a category label to a whole document
  • Sentiment (positive/negative/neutral)
  • Topic labeling ("this is a billing complaint")
  • "What type is this text?"
  • If the scenario asks to "extract product attributes" or "pull specs from descriptions" โ†’ entity recognition. Classification just puts a label on the whole thing โ€” it doesn't extract individual fields.

    Chunking Strategies for Knowledge Bases

    Fixed-size chunking: Split at N tokens. Simple, fast, but can break mid-sentence. Good for homogenous content.
    Hierarchical chunking (built-in): Parent chunks + child chunks. Parent provides context, child provides precision. Good for standard documents with clear structure.
    Semantic chunking: Split based on meaning/topic shifts. Better retrieval quality, slightly more expensive.
    Custom Lambda chunking: Your own logic via Lambda + libraries like LangChain. Use for complex HTML, custom formats, non-standard structures.
    Scenario: "complex HTML with nested headers and mixed content types" โ†’ custom Lambda chunking with LangChain deployed as a layer. The built-in strategies don't handle arbitrary HTML structure well enough.
    The built-in hierarchical chunker works for Word docs and simple PDFs. For HTML or any complex custom format โ†’ custom Lambda chunking.

    PII in Data Pipelines

    The right tool depends on where the data is and what you need:
    ScenarioRight Tool
    Detect + redact PII in text (emails, transcripts) before sending to FMAmazon Comprehend โ€” real-time PII detection API
    Discover PII across S3 buckets at scaleAmazon Macie โ€” automated S3 data discovery, custom classifiers for continuous monitoring
    Extract text from scanned documents / imagesAmazon Textract โ€” OCR only, not NLP/PII detection
    Search through documents after PII removalAmazon Kendra โ€” enterprise search (comes after Comprehend removes PII)
    Textract is NOT a PII detector โ€” it just extracts text from images/PDFs. Macie does not redact; it discovers and alerts. Comprehend actually redacts.

    S3 Metadata Types โ€” Exact Distinction

    TypeSet ByExamples
    System-definedS3 automaticallyLast-Modified, Content-Type, ETag, Content-Length, Storage-Class
    User-definedYou, at upload timex-amz-meta-author, x-amz-meta-source-id, x-amz-meta-department
    Object tagsYou, any time (even after upload)Classification=Confidential, Discipline=Physics, env=prod
    Timestamps (when the file was uploaded/modified) are system-defined โ€” S3 manages them automatically. You cannot set Last-Modified yourself. Authorship details, source identifiers โ†’ user-defined metadata.
    The difference matters because system metadata = S3-controlled, user metadata = you control at upload, tags = you control any time post-upload.

    BDA vs EventBridge + Step Functions for Orchestration

    BDA is an intelligence extraction service โ€” it understands documents and media. It is not a workflow orchestrator.
    Scenario: "process uploaded videos using Rekognition (object detection) + Bedrock (summaries)" โ†’ EventBridge + Step Functions: S3 upload โ†’ EventBridge rule โ†’ Step Functions state machine โ†’ call Rekognition โ†’ call Bedrock FMs โ†’ store results.
    You cannot use a BDA blueprint to "orchestrate" multi-service video processing. BDA blueprints define what fields to extract from a document, not how to chain services together.
    Avoid: "create a BDA blueprint to orchestrate the processing steps." BDA blueprints = field extraction schemas. Orchestration = Step Functions.
    1.4 โ€” Vector Store Solutions

    OpenSearch Service vs Serverless โ€” Hybrid Search

    Hybrid search = combine dense vectors (semantic / embedding-based) with sparse vectors (keyword / BM25). You get semantic understanding AND exact-term matching in one query.

    OpenSearch Service (managed)

  • Full feature set including hybrid search
  • Both sparse + dense vector support
  • Sub-second latency with k-NN indexing
  • Advanced filtering, analytics, sharding
  • โœ“ Use for sub-second hybrid search
  • OpenSearch Serverless

  • Simpler, auto-scaling, less ops burden
  • Good for variable workloads
  • Dense vectors only (limited sparse support)
  • Fewer advanced search features
  • โœ— Not ideal for full hybrid search
  • If the exam scenario requires sub-second response or hybrid search (semantic + keyword) โ†’ OpenSearch Service (managed), NOT Serverless.

    Semantic Cache with OpenSearch k-NN

    Semantic caching = cache responses by meaning, not by exact string. "What's the price?" and "How much does it cost?" should hit the same cache entry.
    Implementation: Lambda generates an embedding for the incoming query โ†’ searches OpenSearch k-NN vector index โ†’ if similarity score exceeds threshold โ†’ return cached response โ†’ skip FM invocation entirely.
    ElastiCache (Redis) with key-value exact string matching is NOT semantic caching. It will only hit cache if the user types the exact same question. For conversational AI this has very low hit rates.
    Scenario: "high costs from repeated similar questions, need caching" โ†’ OpenSearch k-NN semantic cache, not ElastiCache with exact matching.

    Vector Store Selection Guide

    ServiceBest ForKey Trait
    OpenSearch ServiceHigh-performance, real-time, hybrid searchMost features, sub-ms latency
    OpenSearch ServerlessVariable workloads, managed scalingNo cluster management
    Aurora PostgreSQL + pgvectorRelational data + vector search in same DBSQL interface, ACID transactions
    Amazon S3 VectorsLarge-scale cost-effective storageCheapest per vector
    Bedrock Knowledge BasesFully managed RAG, no vector ops expertise neededZero infrastructure, end-to-end managed
    1.5 โ€” Retrieval Mechanisms for FM Augmentation

    Reranking โ€” When More Docs Isn't the Answer

    After the initial vector search returns Top-K chunks, a reranker does a second pass โ€” it scores each chunk against the query for relevance and reorders them. The best chunks go to the LLM; noisy irrelevant ones fall off.
    Increasing the number of retrieved documents (K) does NOT improve relevance โ€” it usually makes things worse by adding low-quality noise into the context. The model gets confused by irrelevant chunks.
    Scenario: "retrieved chunks are correct but summaries are contextually irrelevant" โ†’ enable reranking in Bedrock Knowledge Bases. The retrieval is finding the right documents, but the ranking is wrong.
    Retrieval pipeline: Query โ†’ Vector search โ†’ Top-K candidates โ†’ Reranker โ†’ Best-N โ†’ LLM context window. The reranker is the quality filter before the LLM sees anything.

    Knowledge Base Sync โ€” SQS for Resilience

    Pattern for near-real-time KB sync: S3 Event Notification โ†’ SQS queue โ†’ Lambda polls queue โ†’ calls IngestKnowledgeBaseDocuments API.
    Direct Lambda trigger (S3 โ†’ Lambda directly, no SQS) works but has no retry buffer. If Lambda fails during ingestion, the event is gone. With SQS, the message stays in the queue and gets retried automatically.
    SQS = resilience via message persistence + retry. Direct Lambda trigger = fire-and-forget with no retry. For production KB sync, always use SQS as the buffer.
    For deletes: S3 fires both object-created AND object-deleted events. Your Lambda should call the appropriate KB API for each event type.

    KB Ingestion โ€” Split Large Docs, Don't Compress

    Bedrock KB has per-document size limits. If ingestion fails for large PDFs, the solution is to split the document into smaller files before uploading to S3.
    S3 bucket compression (gzip, etc.) changes file size but Bedrock decompresses before processing โ€” it still sees the same large document. Compression doesn't help KB size limits.
    Avoid: "enable S3 compression to reduce document size." The KB limit is about document content size, not storage footprint.

    RAG vs Response Streaming โ€” Different Problems

    RAG solves accuracy and knowledge freshness problems. The model doesn't know your product catalog? โ†’ inject it via RAG.
    Response streaming solves perceived latency problems. The model response takes 8 seconds to complete? โ†’ stream tokens as they're generated so the user sees output immediately.
    If the scenario describes inaccurate answers about products โ†’ RAG. If the scenario describes slow user experience / users waiting โ†’ streaming. Never use streaming to fix accuracy.
    Streaming doesn't change what the model knows. It only changes when the user sees the response. RAG changes what the model can answer correctly.

    Knowledge Base Logging โ€” Two Separate Log Types

    Model invocation logs = what was sent to the LLM and what it replied (input prompts, output text, token counts, latency). Captured via Amazon S3 or CloudWatch Logs in Bedrock settings.
    Knowledge base ingestion logs = what happened during document processing (which files succeeded, which failed, why chunking/embedding errored). Configured separately in KB settings โ†’ CloudWatch Logs destination.
    Scenario: "documents are failing to ingest / embeddings not generated" โ†’ KB ingestion logs โ†’ CloudWatch Logs Insights.
    Scenario: "track what prompts users are sending" โ†’ model invocation logs.
    These are two completely separate log streams. You cannot find KB ingestion failures in model invocation logs.