Bedrock Data Automation (BDA) โ The Full Picture
BDA is a multimodal document intelligence service. It extracts structured information from PDFs, images, audio, and video automatically โ think of it as a smart parser that understands document structure without you writing extraction logic.
BDA architecture: 1 project โ multiple blueprints. Each blueprint describes one document type (e.g., electric bill, water bill, gas bill). When you send a document in, BDA automatically selects the right blueprint โ you don't pick it.
The trap is creating one project per document type. That's backwards. One project, many blueprints inside it. The auto-selection is the whole point.
Scenario: "extract fields from various document types (bills, contracts, invoices)" โ BDA with one project containing one blueprint per document type โ invoke via InvokeDataAutomationAsync API.
BDA as RAG pre-processor: For complex multimodal content (financial filings, PDFs with charts), use BDA first to extract structured text, THEN feed into a Knowledge Base for RAG. Raw PDFs in a KB miss the embedded chart data; BDA catches it all.
BDA Blueprint: Transformation vs Validation
Transformation = reshape or reformat an extracted value. Example: split "John Smith" into FIRST_NAME + LAST_NAME using a custom type (reusable transformation definition).
Validation = check a constraint and reject if violated. Example: reject if a field is null or malformed.
When asked "how do you split a field into subcomponents?" โ the answer is transformation with a custom type, not validation. Validation rejects; transformation reshapes.
Avoid confusing "enforce required subfields" with "split into subfields." Enforcing = validation. Splitting/reformatting = transformation.
Fine-Tuning Data Pipeline โ Glue ETL, Not EMR
The standard fine-tuning data pipeline is: S3 (raw data) โ AWS Glue crawler (catalog it) โ Glue Data Catalog โ Glue ETL jobs (transform to JSONL in Bedrock Converse API format) โ S3 (curated) โ Bedrock fine-tuning job.
EMR with Apache Spark is a valid tool for big data transformation, but it requires cluster management. For a straightforward fine-tuning data prep pipeline, Glue ETL is the AWS-recommended approach โ serverless, managed, no infrastructure.
Glue crawler = discovers and catalogs data. Glue ETL = transforms it. These are two different things โ you need BOTH in the pipeline.
Scenario: "prepare customer support transcripts for fine-tuning" โ Glue crawler (catalog) โ Glue ETL job (transform to JSONL) โ S3 โ Bedrock fine-tuning. Not EMR, not Lambda alone.
Amazon Comprehend โ Entity Recognition vs Classification
Entity Recognition โ use when extracting
Pull product names, brands, specs FROM text
Extract people, places, dates, organizations
Structured attribute extraction from prose
"What is in this text?"
Custom Classification โ use when labeling
Assign a category label to a whole document
Sentiment (positive/negative/neutral)
Topic labeling ("this is a billing complaint")
"What type is this text?"
If the scenario asks to "extract product attributes" or "pull specs from descriptions" โ entity recognition. Classification just puts a label on the whole thing โ it doesn't extract individual fields.
Chunking Strategies for Knowledge Bases
Fixed-size chunking: Split at N tokens. Simple, fast, but can break mid-sentence. Good for homogenous content.
Hierarchical chunking (built-in): Parent chunks + child chunks. Parent provides context, child provides precision. Good for standard documents with clear structure.
Semantic chunking: Split based on meaning/topic shifts. Better retrieval quality, slightly more expensive.
Custom Lambda chunking: Your own logic via Lambda + libraries like LangChain. Use for complex HTML, custom formats, non-standard structures.
Scenario: "complex HTML with nested headers and mixed content types" โ custom Lambda chunking with LangChain deployed as a layer. The built-in strategies don't handle arbitrary HTML structure well enough.
The built-in hierarchical chunker works for Word docs and simple PDFs. For HTML or any complex custom format โ custom Lambda chunking.
PII in Data Pipelines
The right tool depends on where the data is and what you need:
| Scenario | Right Tool |
| Detect + redact PII in text (emails, transcripts) before sending to FM | Amazon Comprehend โ real-time PII detection API |
| Discover PII across S3 buckets at scale | Amazon Macie โ automated S3 data discovery, custom classifiers for continuous monitoring |
| Extract text from scanned documents / images | Amazon Textract โ OCR only, not NLP/PII detection |
| Search through documents after PII removal | Amazon Kendra โ enterprise search (comes after Comprehend removes PII) |
Textract is NOT a PII detector โ it just extracts text from images/PDFs. Macie does not redact; it discovers and alerts. Comprehend actually redacts.
S3 Metadata Types โ Exact Distinction
| Type | Set By | Examples |
| System-defined | S3 automatically | Last-Modified, Content-Type, ETag, Content-Length, Storage-Class |
| User-defined | You, at upload time | x-amz-meta-author, x-amz-meta-source-id, x-amz-meta-department |
| Object tags | You, any time (even after upload) | Classification=Confidential, Discipline=Physics, env=prod |
Timestamps (when the file was uploaded/modified) are system-defined โ S3 manages them automatically. You cannot set Last-Modified yourself. Authorship details, source identifiers โ user-defined metadata.
The difference matters because system metadata = S3-controlled, user metadata = you control at upload, tags = you control any time post-upload.
BDA vs EventBridge + Step Functions for Orchestration
BDA is an intelligence extraction service โ it understands documents and media. It is not a workflow orchestrator.
Scenario: "process uploaded videos using Rekognition (object detection) + Bedrock (summaries)" โ EventBridge + Step Functions: S3 upload โ EventBridge rule โ Step Functions state machine โ call Rekognition โ call Bedrock FMs โ store results.
You cannot use a BDA blueprint to "orchestrate" multi-service video processing. BDA blueprints define what fields to extract from a document, not how to chain services together.
Avoid: "create a BDA blueprint to orchestrate the processing steps." BDA blueprints = field extraction schemas. Orchestration = Step Functions.