DP-700 Mission Control — 4-Hour Study Sprint

START DP-700 Exam Overview

700

Passing Score / 1000

~55

Questions (est.)

Equally Weighted Domains

Open

Book Exam (MS Learn)

Domain 1 30–35%

Implement & Manage Analytics Solution
Workspace settings (Spark, domain, OneLake, Dataflows Gen2), lifecycle management (Git, deployment pipelines, database projects), security & governance (RLS, CLS, DDM, sensitivity labels, audit logs), orchestration patterns.

Domain 2 30–35%

Ingest & Transform Data
Full/incremental loads, dimensional model prep, shortcuts, mirroring, pipelines, PySpark/SQL/KQL transforms, streaming (Eventstreams, Spark Structured Streaming), windowing functions.

Domain 3 30–35%

Monitor & Optimize Analytics Solution
Monitor ingestion/transformation/semantic model refresh, configure alerts, resolve pipeline/Dataflow/notebook/KQL/Eventstream/T-SQL errors, optimize Lakehouse/pipeline/warehouse/Spark/query performance.

📖

Open-book strategy: At exam start, immediately open 3 MS Learn tabs: search arg_max() (KQL reference), DENSE_RANK (T-SQL reference), pyspark.sql (PySpark reference). Use split-screen for at most 5–10 lookups — it eats time fast. Answer all questions first.

🎯

What this exam really tests: Pick the RIGHT Fabric item for each scenario. Three pattern types dominate: (1) Lakehouse vs Warehouse vs Eventhouse, (2) Pipeline vs Dataflow vs Notebook, (3) Write/debug code in T-SQL / PySpark / KQL. Fabric-native ALWAYS beats legacy Azure equivalents.

The 5 "Always Wrong" Answers

If you see…	The exam answer is…
Azure Data Factory	Fabric Data Pipeline (same UI, SaaS-native)
Synapse Dedicated SQL Pool	Fabric Warehouse
Azure Stream Analytics	Eventstream + Eventhouse
DAX authoring questions	Not on DP-700 (that's DP-600)
ARM templates for deployment	Fabric Deployment Pipeline + Git integration

D1·1 Architecture & OneLake

💡

OneLake = "OneDrive for data." One per tenant. Zero provisioning. Built on ADLS Gen2 (same APIs/SDKs) but fully SaaS — no storage accounts, no access keys, no resource groups. Every Fabric item writes Delta Parquet to OneLake automatically.

Hierarchy — Memorize This

Level	Description	Key fact
Tenant	One OneLake per org	Single Azure AD tenant
Capacity	Compute unit (CUs)	Runs in ONE Azure region; defines SKU (F2–F2048)
Workspace	Primary security boundary	Linked to exactly one capacity
Item	Lakehouse, Warehouse, Pipeline, Notebook, Eventhouse…	All store data in OneLake as Delta Parquet
Sub-item	/Tables, /Files folders	Only /Tables (Delta) visible in SQL endpoint & Direct Lake

OneLake Shortcuts — The Exam Loves These

Shortcuts are symbolic links that point to data elsewhere without copying it.

Property	Detail
Where creatable	Lakehouses and KQL databases ONLY
Internal sources	Other lakehouses, warehouses, KQL databases, mirrored databases
External sources	ADLS Gen2, Amazon S3, Google Cloud Storage, Dataverse, Azure Blob Storage, S3-compatible
In /Tables	Top-level ONLY — target must contain Delta data to be recognized as a table
In /Files	Any folder level
Internal security	Passthrough (caller's identity)
External security	Delegated (stored credentials — any lakehouse user can read through)
Cross-cloud caching	1–28 day retention; files >1 GB are NOT cached
Deleting a shortcut	Removes pointer ONLY — target data untouched

🔴

Exam trap: "S3 data with egress cost concerns" → shortcut with caching disabled (no local copy). Know the difference between accelerated vs. non-accelerated shortcuts in Real-Time Intelligence (query acceleration adds a cached copy in OneLake).

Lakehouse vs. Warehouse vs. Eventhouse

Signal in the question	Answer
Spark, PySpark, notebooks, unstructured data, ML, medallion architecture	Lakehouse
T-SQL, stored procedures, multi-table transactions, star schema, dimensional model	Warehouse
Streaming, telemetry, IoT, logs, time-series, sub-second latency, KQL	Eventhouse / KQL DB
Semi-structured data readable by T-SQL, KQL, AND Spark, written by Spark	Lakehouse
On-premises SQL Server data to Fabric	Data Pipeline (Copy)

SQL Analytics Endpoint — Read-Only, Period

Every Lakehouse auto-gets a SQL analytics endpoint. Supports: SELECT, CREATE VIEW, CREATE FUNCTION, CREATE SECURITY POLICY. No INSERT, UPDATE, DELETE, CREATE TABLE, or any DML. Auto-syncs Delta tables from /Tables within ~1 min. Sync halts after 15 min of inactivity.

Note: Lakehouses now support custom schemas (e.g., beyond just `dbo`). Tables organized into custom schemas via Spark will successfully sync to the SQL analytics endpoint.

🔴

Exam trap: Unmanaged external tables created via Spark with a custom LOCATION pointing to /Files are NOT visible in the SQL analytics endpoint. Only managed Delta tables in /Tables appear. Use OneLake shortcuts to bridge the gap.

Managed vs. External Tables

	Managed	External (Unmanaged)
Spark manages	Data + metadata	Metadata only
Storage	`/Tables` (default)	User-specified (e.g., `/Files/bronze/`)
On DROP TABLE	Data + metadata deleted	Only metadata removed — files persist
SQL analytics endpoint	✅ Visible	❌ Not visible
Direct Lake (Power BI)	✅ Works	❌ Does not work

Fabric Warehouse Key Facts

Supported

Full T-SQL DML/DDL: INSERT, UPDATE, DELETE, MERGE
Stored procedures, COPY INTO, CTAS
CREATE TABLE AS CLONE OF (zero-copy, 30-day time travel)
Three-part cross-database queries (same workspace)
IDENTITY property for generating surrogate keys

NOT Supported

Triggers, user-created indexes, cursors
Recursive CTEs, materialized views
@@IDENTITY / SCOPE_IDENTITY(), FOR XML
DBCC commands, CREATE USER
Named procedure parameters
Spark access (Warehouse is T-SQL only)

⚠️

Data type migration gotchas: money→decimal(19,4), nvarchar→varchar, tinyint→smallint, datetimeoffset→datetime2 (time zone data lost)

D1·2 Security & Governance

Workspace Roles Permission Matrix

Capability	Admin	Member	Contributor	Viewer
Delete workspace, manage Git integration	✅	❌	❌	❌
Add/remove users (lower roles)	✅	✅	❌	❌
Create/modify/execute items (notebooks, pipelines)	✅	✅	✅	❌
Read data via T-SQL / SQL endpoint	✅	✅	✅	✅
Read data via Spark / OneLake APIs	✅	✅	✅	❌
Share items with others	✅	✅	❌	❌

🔴

Critical: Admin/Member/Contributor have implicit OneLake Write access that overrides OneLake security roles and RLS/CLS. Viewers can query the SQL analytics endpoint but CANNOT access data via Spark, Lakehouse explorer, or OneLake APIs.

Item-Level Permissions

Permission	Allows	Watch out
Read	Connect to the item
ReadData	Query data via T-SQL (SQL endpoint)	Respects RLS/CLS/DDM
ReadAll	Access raw OneLake files directly	BYPASSES all SQL-level security!
Write	Full administrative access
Share	Manage permissions for others

🔴

ReadAll bypasses RLS/CLS/DDM. When implementing SQL security, ensure users have ONLY ReadData, not ReadAll. ReadAll gives access to raw Parquet files in OneLake, bypassing all warehouse-level security.

Row-Level Security (RLS)

T-SQL — RLS Setup

-- Step 1: Create filter function
CREATE FUNCTION Security.fn_filter(@SalesRep varchar(50))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN SELECT 1 AS result
WHERE @SalesRep = USER_NAME() OR USER_NAME() = 'manager@contoso.com';

-- Step 2: Create security policy
CREATE SECURITY POLICY SalesFilter
ADD FILTER PREDICATE Security.fn_filter(SalesPersonColumn)
ON dbo.Sales WITH (STATE = ON);

Works in both Warehouse and Lakehouse SQL analytics endpoint. Uses USER_NAME() or SUSER_SNAME(). Admin/Member/Contributor bypass RLS — it only restricts Viewer-level and shared-access users.

Dynamic Data Masking (DDM)

Mask type	Output example	Use case
`default()`	XXXX / 0 / 01-01-1900	Generic sensitive values
`email()`	aXX@XXXX.com	Email addresses
`partial(2,"XXX",2)`	55XXXXX89	Phone numbers, IDs
`random(1,100)`	42 (random number)	Numeric values

⚠️

Admin/Member/Contributor see unmasked data. Use GRANT UNMASK to selectively reveal. DDM alone doesn't protect against inference attacks — pair with RLS/CLS.

OneLake Security (Data Access Roles)

Fine-grained deny-by-default access at storage layer for Lakehouses
Create roles granting Read (or ReadWrite) to specific tables/folders, assign Entra users/groups
Only affects Viewer-level users — Admin/Member/Contributor override it
Every Lakehouse has a DefaultReader role — can be modified or deleted

Microsoft Purview Integration

📖 Unified Data Catalog

Auto-catalogs all Fabric assets for discovery

🔗 Data Lineage

Traces data flow across items automatically

🏷️ Sensitivity Labels

Propagate downstream — label source, derived items inherit. Requires Power BI Pro + AIP license

🛡️ DLP Policies

Detect sensitive data in OneLake

📋 Audit Logs

Activity logging via Fabric audit log (Admin portal → Audit logs)

D1·3 Orchestration: Pipeline vs Dataflow vs Notebook

Tool	Use when	Do NOT use when
Dataflows Gen2	Low-code Power Query transforms, 100+ connectors, light ingestion + reshaping	Complex orchestration, massive datasets, code-heavy logic
Data Pipelines	Orchestrating end-to-end workflows, scheduling, chaining, conditional logic, error handling	Transforming data (pipelines have no native transformation)
Spark Notebooks	Complex transforms, large-scale data, SCD Type 2 logic, ML, schema inference	Simple copy operations (overkill — use Copy activity)

🔴

Key rule: Pipelines are the orchestrator (the "glue"). They call Notebooks, Dataflows, Stored Procedures, and Copy activities. They never transform data themselves.
• "Orchestrate/schedule" → Pipeline • "Transform 50M rows" → Notebook • "Ingest from SharePoint/API with light cleanup" → Dataflow Gen2

Dataflows Gen2 Key Facts

Gen2 vs Gen1

Gen2 outputs to multiple destinations (Lakehouse, Warehouse, Azure SQL DB, KQL DB) — Gen1 only wrote to internal storage
Gen2 supports Git integration; Gen1 does not
Gen2 requires Fabric capacity (no Pro workspace)

Output modes & Staging

Only append or replace — no native upsert/MERGE
For SCD Type 2 → use a Notebook
Staging: Enabled by default for ALL destinations, including Lakehouses. Best practice: manually disable for simple copy-to-lakehouse to avoid double-write penalty.

Pipeline Activities Cheat Sheet

Activity	Purpose
Copy Data	Moves data source→destination, no transformation
Notebook	Runs a Spark notebook
Dataflow	Triggers Dataflow Gen2
Stored Procedure	Runs SP in supported SQL databases
ForEach	Iterates over a collection
If Condition / Switch	Conditional branching
Lookup	Reads a value from a source; output feeds downstream
Set Variable / Append Variable	Manage pipeline variables
Invoke Pipeline	Calls another pipeline (supports cross-workspace)
Web	Calls a REST endpoint
Get Metadata	Retrieves metadata from data sources
Fail	Forces pipeline failure with custom error message

🔴

Activities with no dependency arrows run SIMULTANEOUSLY (not sequentially). Max 120 activities per pipeline.

D1·4 Lifecycle Management & Deployment

Git Integration

Supports Azure DevOps and GitHub
For Azure DevOps you need: organization, project, Git repository, and branch
Only one Git repo branch per workspace at a time
Not all items support Git (e.g., dashboards, some real-time items)

Deployment Pipelines

Fabric-native CI/CD: Dev → Test → Prod stages
Use Fabric Deployment Pipeline (not GitHub Actions, not ADF, not ARM templates)
Deploy eventhouse, lakehouse, warehouse, notebooks, pipelines
Pre/post-deployment rules can parameterize connections per stage

🔴

"Deploy an eventhouse to production" → Use a Fabric deployment pipeline, NOT GitHub Actions, NOT Azure DevOps pipeline, NOT ARM templates.

Workspace Settings (Exam Targets)

Setting type	What you configure
Spark settings	Spark compute pools, default runtime version, environment defaults, high concurrency mode
OneLake settings	OneLake security roles, data access controls at folder/table level
Domain settings	Assign workspace to a domain for governance grouping in Purview
Dataflows Gen2 settings	Enable/disable staging, configure default destination

D2·1 The Big Decision Matrices

🧠

These decision trees appear in nearly every question set. Memorize them cold.

Choosing the Right Data Store

Need	Use	Why
File storage + Spark processing + SQL access	Lakehouse	Delta tables in /Tables auto-exposed to SQL endpoint
Full T-SQL with transactions, star schema	Warehouse	Full DML, stored procs, MERGE, CLONE OF
Real-time streaming, IoT, telemetry, KQL analytics	Eventhouse/KQL DB	Auto-indexed, partitioned by time, KQL is native
Connect to existing external data without copying	OneLake Shortcut	Symbolic link — no data movement, no egress
Mirror existing Azure/external DB to Fabric	Mirroring	Near-real-time replication into OneLake Delta

Choosing the Right Transform / Ingest Tool

Scenario	Best tool
Large-scale transform, SCD Type 2, ML, complex joins	Spark Notebook
Low-code, 100+ connectors, Power Query logic	Dataflow Gen2
Copy data source to destination (no transform)	Pipeline: Copy Data activity
Schedule, orchestrate, chain multiple tasks	Data Pipeline
T-SQL transform inside Warehouse	T-SQL / Stored Procedure
KQL transform in Eventhouse	KQL / Update Policy

Choosing the Right Streaming Engine

Scenario	Use
Ingest from Event Hubs / IoT Hub / Kafka, route to Eventhouse/Lakehouse	Eventstream
Complex event processing with PySpark on a stream	Spark Structured Streaming
Query, aggregate, alert on time-series data in KQL DB	KQL + Activator
Trigger action when condition met (email, Teams, PA flow)	Fabric Activator

Native vs. OneLake Shortcuts in Real-Time Intelligence

Option	When to use	Tradeoff
Native KQL tables	Best query performance, real-time alerting	Data stored inside Eventhouse
OneLake shortcut (standard)	Data lives in OneLake/Lakehouse, query via KQL	No caching; queries go to OneLake every time
OneLake shortcut (query-accelerated)	Frequently queried external OneLake data	Cached copy maintained, faster queries

D2·2 Batch Ingestion & Transformation

Loading Patterns

🔄 Full Load

Truncate-and-reload. Simple but expensive. Use for small/slowly changing dimensions or when change tracking isn't available.

➕ Incremental Load

Load only new/changed records. Use watermark column (timestamp/ID). More complex but efficient for large tables.

SCD Type 2 with Hash-Based Change Detection

✅

Exam tip: "SCD Type 2 with many columns" → use a hash function on SOURCE TABLE attributes to detect changes efficiently instead of comparing each column.

PySpark — Hash-Based Change Detection

from pyspark.sql import functions as F

# Create hash of all tracked columns
df_source = df_source.withColumn(
    "row_hash",
    F.md5(F.concat_ws("||", F.col("col1"), F.col("col2"), F.col("col3")))
)
# Compare hash with target to detect changes
changed = df_source.join(df_target, "id").filter(
    df_source.row_hash != df_target.row_hash
)

Mirroring

Near-real-time replication of external DBs into Fabric OneLake (Delta format)
Supported sources: Azure SQL DB, Azure Cosmos DB, Azure SQL MI, Snowflake, and more
No pipeline needed — zero-ETL replication
Data lands as Delta Parquet in OneLake → queryable via SQL endpoint, Spark, etc.

Data Quality: Handling Common Issues

Issue	PySpark approach
Duplicates	`df.dropDuplicates(["id", "date"])` or `MERGE` with dedup logic
Missing values	`df.fillna({"col": 0})` or `df.dropna(subset=["key_col"])`
Late-arriving data	Watermarking in Spark Structured Streaming; or reprocess with incremental load using event_time
Schema mismatch	`mergeSchema=True` for evolution; schema enforcement catches mismatches early

D2·3 Streaming & Real-Time Intelligence

Eventstream Sources & Destinations

Sources (ingest FROM)

Azure Event Hubs, IoT Hub
Custom App / Kafka / Confluent Cloud Kafka
Azure SQL CDC, PostgreSQL CDC, MySQL CDC, Cosmos DB CDC
Google Cloud Pub/Sub, Amazon Kinesis
Azure Blob Storage events, Fabric workspace/OneLake events
Sample data (built-in)

Destinations (route TO)

Eventhouse / KQL Database
Lakehouse
Custom endpoint
Fabric Activator (alerts/actions)
Derived Stream (branch the stream)

🔴

"Propagate events from Azure SQL to an Eventstream" → Enable Change Data Capture (CDC) on the database. NOT Extended Events, NOT Stream Analytics, NOT a read-only replica.

Eventhouse / KQL Database

Eventhouse = container for KQL databases (creating one auto-provisions a default KQL DB)
Data is auto-indexed and partitioned by ingestion time
Eventhouses auto-suspend when idle
T-SQL endpoint on KQL DB is read-only and extremely limited — use KQL for everything
Enable "One Logical Copy" to expose KQL data in OneLake as Delta Parquet

Fabric Activator (formerly Data Activator / Reflex)

No-code event detection engine. Triggers actions (email, Teams message, Power Automate flow, run Pipeline, run Notebook) when conditions are met. Consumes from: eventstreams, Power BI reports, real-time dashboards. Evaluates rules per "business object" (identified by an Object ID column).

Condition types: "Changes by amount/percentage" or "Becomes greater/less than threshold"

Real-Time Hub

A tenant-wide single logical place for all streaming data in motion (one per tenant, like OneLake). Discovery + management layer — lists all eventstream outputs and KQL tables you have access to.

Windowing Functions (Streaming)

Window type	Behavior	Use case
Tumbling	Fixed, non-overlapping intervals	Hourly totals, daily summaries
Hopping	Fixed size, overlapping (hop < window size)	Rolling average every 5 min over last 1 hour
Sliding	Fires when event occurs within window	Detect patterns within any N-minute span
Session	Groups events with gaps ≤ timeout	User session analytics

D2·4 Delta Lake Deep Dive

The Four Delta Commands You Must Know

SQL — MERGE (upsert pattern)

MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN
    UPDATE SET target.name = source.name
WHEN NOT MATCHED THEN
    INSERT (id, name) VALUES (source.id, source.name)
WHEN NOT MATCHED BY SOURCE THEN DELETE;
-- Note: Fabric enables Low Shuffle Merge by default — no code changes needed

SQL — OPTIMIZE (file compaction)

OPTIMIZE table_name;                           -- bin-packing (compacts small files)
OPTIMIZE table_name ZORDER BY (customer_id);   -- co-locate related data for filter queries
OPTIMIZE table_name VORDER;                    -- apply V-Order encoding

SQL — VACUUM (cleanup old files)

VACUUM table_name RETAIN 168 HOURS;  -- default 7 days = 168 hours
-- Removes data files no longer referenced older than retention period
-- Does NOT remove _delta_log
-- After VACUUM: time travel to vacuumed versions is IMPOSSIBLE

SQL — Time Travel

SELECT * FROM my_table VERSION AS OF 5;
SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15T10:30:00Z';
DESCRIBE HISTORY my_table;
RESTORE TABLE my_table TO VERSION AS OF 3;

-- Warehouse syntax:
SELECT * FROM dbo.dim_customer
OPTION (FOR TIMESTAMP AS OF '2024-05-23T08:00:00');

Feature	Lakehouse	Warehouse
File retention / time travel window	7 days (default)	30 days (max)
VACUUM minimum retention	7 days (portal enforces this)	N/A
OPTIMIZE ZORDER	✅	✅
V-Order default	✅ Enabled	✅ Enabled

V-Order vs. OPTIMIZE — Know the Difference

	V-Order	OPTIMIZE
What it does	Optimizes file internal layout (encoding/compression within row groups)	Compacts many small files into fewer large ones
When applied	Write time (automatic)	On-demand manual command
Write penalty	~15% slower writes	No write penalty
Read benefit	Up to 50% faster reads (especially Direct Lake)	Fewer file reads, better partition pruning

🔴

V-Order exam traps:
• Write-intensive staging tables read once then dropped → Disable V-Order (write overhead wasted)
• Read-heavy reporting tables → Enable V-Order
• V-Order is enabled by default across all Fabric compute engines (including Spark/Lakehouses and Warehouses).

Schema Enforcement vs. Evolution

PySpark — Schema Evolution

# Schema evolution — opt-in per write operation
df.write.option("mergeSchema", "true").mode("append").saveAsTable("target")

# Complete schema replacement (use rarely)
df.write.option("overwriteSchema", "true").mode("overwrite").saveAsTable("target")

# For MERGE with schema evolution:
-- MERGE WITH SCHEMA EVOLUTION INTO target USING source ON ...

🔴

Avoid setting spark.databricks.delta.schema.autoMerge.enabled = true at session level in production — use per-operation mergeSchema instead for safety.

D3·1 Monitoring & Capacity Management

⚠️

Capacity Constraint: Dataflows Gen2 and Eventhouses require an active Fabric capacity (F-SKU) or Premium capacity (P-SKU). They cannot run in a standard Power BI Pro workspace.

Capacity Metrics App — 3 Pages

Compute Page

14-day view. CU utilization %, throttling timeline, per-item CU consumption matrix. Data latency: 10–15 minutes.

Storage Page

30-day view. Billable storage by workspace.

Health Page

(Preview) High-level capacity overview, throttling/rejection identification.

Monitoring Hub

Select "Monitor" in nav pane. Shows active and recently completed jobs (up to 100 per item, 30-day history) across pipelines, notebooks, dataflows, Spark jobs, semantic model refreshes, eventstreams. You only see activities you have permission to view.

CU Consumption Model

Operation type	Smoothing window	Examples
Interactive (user-triggered)	5–64 minutes	Ad-hoc queries, notebook runs triggered by user
Background (scheduled)	24 hours	Scheduled pipelines, Spark jobs, most warehouse operations

Throttling Escalation Order — Memorize This

Stage	Effect
1. No throttling	Normal operation
2. Interactive Delay	20-second delay added to interactive operations
3. Interactive Rejection	Interactive requests rejected; background continues
4. Background Rejection	Background jobs also rejected
5. Full Rejection	Everything rejected

Recovery options: Wait for natural burndown, scale up SKU, or pause workloads. Spark capacity: each CU = 2 Spark VCores (F64 = 128 VCores). HTTP 430 error = capacity fully consumed.

Bursting

Temporary use of more CUs than the SKU provides — jobs finish faster but consume more momentarily. Smoothing spreads the accounting so bursts don't immediately throttle.

D3·2 Troubleshooting & Error Resolution

Item type	Where to look	Common errors
Pipeline	Pipeline run history → activity details → Error tab	Source/sink connectivity, schema mismatch, timeout, permission denied
Dataflow Gen2	Monitoring hub → Dataflow → refresh history → error details	Staging lakehouse issues, connector auth, output destination schema mismatch
Notebook	Spark job details → executor logs → event timeline	OOM errors (increase executor memory), SparkException, AnalysisException
Eventhouse	Diagnostics logs → ingestion errors; `.show ingestion failures` KQL command	Schema mismatch, mapping errors, bad data format
Eventstream	Monitoring metrics tab → error rate; stream health indicators	Source/destination connectivity loss, throughput limits
T-SQL	`sys.dm_exec_sessions`, `sys.dm_exec_requests`, Query Insights views	Deadlocks, timeout, unsupported syntax, permission errors
OneLake Shortcut	Lakehouse explorer → shortcut → properties; check credential validity	Invalid credentials, source path changed, ADLS firewall blocking

✅

"Find active running queries in Warehouse" → Use sys.dm_exec_sessions filtered by session status = 'running'

D3·3 Performance Optimization

Optimize a Lakehouse Table ▶

Run OPTIMIZE to compact small files (bin-packing)
Run OPTIMIZE ZORDER BY (col) to co-locate data for selective filter queries
Enable Optimize Write (default ON): auto-coalesces small writes at write time
Run VACUUM periodically to remove stale files
Use partitioning for very large tables (partition by date for time-series)
V-Order is enabled by default; disable for write-only staging tables to save compute.

Optimize a Data Warehouse ▶

V-Order enabled by default — benefits Direct Lake and SQL queries
Use CREATE TABLE AS CLONE OF for zero-copy snapshots (testing, backups)
Use sys.dm_exec_sessions and Query Insights (DMVs) to find slow queries
Use views to abstract complex joins and pre-filter data
Statistics are auto-maintained — no manual UPDATE STATISTICS needed

Optimize Spark Performance ▶

Cache DataFrames used multiple times: df.cache() or df.persist()
Use broadcast joins for small lookup tables: broadcast(small_df)
Filter early before joins to reduce shuffle size
Avoid collect() on large datasets (brings all data to driver)
Use columnar formats (Delta/Parquet) — avoid CSV for large Spark jobs
Tune spark.sql.shuffle.partitions (default 200 — often too high or too low)
Enable Adaptive Query Execution (AQE): spark.sql.adaptive.enabled=true
High Concurrency mode: shares Spark cluster across multiple notebooks (reduces startup time)

Optimize Eventstreams & Eventhouses ▶

Use batching in eventstream to reduce ingestion overhead for high-volume streams
Apply update policies in KQL DB to transform data at ingestion time (instead of at query time)
Use materialized views in KQL DB for pre-aggregated data
Enable query acceleration on OneLake shortcuts used frequently in KQL
Filter in KQL using where as the first operator to leverage auto-indexing

REF KQL Syntax Reference

📌

KQL is pipe-based and read-only. Pattern: Filter early → transform → aggregate → render. render must ALWAYS be the LAST operator.

KQL — Core Operators

// where — filter rows (put FIRST for best performance)
StormEvents | where State == "FLORIDA" | where DamageProperty > 1000000

// project — select columns (REDUCES output; more efficient than extend)
StormEvents | project State, EventType, DamageProperty

// extend — add computed columns (KEEPS all existing columns)
StormEvents | extend TotalInjuries = InjuriesDirect + InjuriesIndirect

// summarize — aggregation (replaces GROUP BY)
StormEvents | summarize EventCount = count(), AvgDamage = avg(DamageProperty) by State
StormEvents | summarize count() by bin(StartTime, 1h)   // time-bucket (1 hour bins)

// join — combine datasets (filter BEFORE join for performance!)
Table1 | where Region == "US"
       | join kind=inner (Table2) on CommonColumn

// sort by — ordering (default is DESCENDING)
StormEvents | sort by DamageProperty desc | take 10

// render — MUST BE LAST
StormEvents | summarize count() by State
            | render barchart with (title="Events by State")

Key KQL Functions

count()sum()avg() dcount()countif()sumif() percentile()arg_max()arg_min() bin()ago()prev()next() parse_json()case()strcat()

Join Kinds

Kind	Returns
`inner`	Only matching rows from both tables
`leftouter`	All left rows + matching right (NULLs for non-matches)
`rightouter`	All right rows + matching left
`fullouter`	All rows from both sides
`anti / leftanti`	Left rows with NO match in right
`semi / leftsemi`	Left rows that HAVE a match in right (no right columns returned)

🔴

KQL exam traps:
• sort by is valid KQL (order by also works but sort by is preferred KQL style)
• project is more efficient than extend — use project when you don't need all columns
• Moving a where filter BEFORE a join significantly improves performance
• kind=outer join INCREASES dataset size and WORSENS performance — never the "optimization" answer

REF PySpark & T-SQL Snippets

PySpark — Common DataFrame Operations

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Read Delta table
df = spark.read.format("delta").load("Tables/sales")
df = spark.table("lakehouse.sales")

# Common transforms
df = df.filter(F.col("amount") > 100)
df = df.withColumn("year", F.year(F.col("order_date")))
df = df.groupBy("category").agg(F.sum("amount").alias("total"))
df = df.dropDuplicates(["id", "date"])
df = df.fillna({"amount": 0, "category": "Unknown"})

# Window function (rank)
w = Window.partitionBy("region").orderBy(F.desc("sales"))
df = df.withColumn("rank", F.dense_rank().over(w))

# Write managed Delta table (visible in SQL endpoint)
df.write.mode("overwrite").saveAsTable("gold_sales")
df.write.mode("append").saveAsTable("gold_sales")

# Write with schema evolution
df.write.option("mergeSchema", "true").mode("append").saveAsTable("gold_sales")

T-SQL — Warehouse Operations

-- COPY INTO (bulk load)
COPY INTO dbo.fact_sales
FROM 'https://storage.dfs.core.windows.net/container/data/*.parquet'
WITH (FILE_TYPE = 'PARQUET');

-- Zero-copy clone (snapshot for testing)
CREATE TABLE dbo.fact_sales_backup AS CLONE OF dbo.fact_sales;

-- Column-level security
GRANT SELECT ON dbo.customers (CustomerName, Region) TO [analyst@contoso.com];

-- Dynamic data masking
ALTER TABLE dbo.customers
ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()');

-- Cross-database query (same workspace only)
INSERT INTO WarehouseA.dbo.Table
SELECT * FROM LakehouseB.dbo.Table;

-- Time travel
SELECT * FROM dbo.dim_customer
OPTION (FOR TIMESTAMP AS OF '2024-05-23T08:00:00');

🔴 Exam Traps & Gotchas — Memorize These

🔴

1. Parallel pipeline activities: Activities with no dependency arrows run simultaneously, not sequentially. Max 120 activities per pipeline.

🔴

2. ReadAll bypasses security: Users with ReadAll permission bypass ALL warehouse-level security (RLS, CLS, DDM) because they access raw Parquet. Give only ReadData for SQL-only access.

🔴

3. External tables invisible to SQL endpoint: Unmanaged Spark external tables with custom LOCATION (e.g., /Files) are NOT visible in the SQL analytics endpoint. Use managed tables or shortcuts.

🔴

4. SQL analytics endpoint is READ-ONLY: Only SELECT, CREATE VIEW, CREATE FUNCTION, CREATE SECURITY POLICY are supported. No DML. Sync halts after 15 min of inactivity.

🔴

5. V-Order is the standard: It is enabled by default across all Fabric compute engines (Warehouses and Lakehouses). Manually disable it only for write-heavy, read-rarely staging tables.

🔴

6. After VACUUM, time travel is lost: VACUUM removes files older than retention — those versions can no longer be queried. Lakehouse default = 7 days. Warehouse max = 30 days.

🔴

7. Dataflows Gen2 = append or replace only: No native MERGE/upsert. Staging is enabled by default for all destinations. For SCD Type 2 or upsert logic → use a Spark Notebook with MERGE.

🔴

8. Admin/Member/Contributor bypass RLS: Row-level security only restricts Viewer-level and item-share users. Workspace contributors can see all data regardless of RLS policies.

🔴

9. Cross-database queries = same workspace only: Three-part naming (DB.Schema.Table) works only within the same workspace. Cross-workspace and cross-region queries are NOT supported.

🔴

10. Azure SQL → Eventstream requires CDC: Must enable Change Data Capture on the source database. NOT Extended Events, NOT Stream Analytics, NOT read-only replica.

🔴

11. Git integration with Azure DevOps needs 4 things: Organization + Project + Git repository + Branch. All four required.

🔴

12. Eventhouse deployment = Fabric deployment pipeline: NOT GitHub Actions, NOT Azure DevOps pipeline, NOT ARM templates.

🔴

13. Shortcuts in /Tables = top-level only: Cannot create shortcuts in subdirectories of /Tables. In /Files, any folder level is allowed.

🔴

14. KQL outer join worsens performance: kind=outer increases dataset size. It is NEVER the answer to "optimize a KQL query" questions.

🔴

15. SCD Type 2 hash tip: For tables with many columns, use a hash function on source attributes to detect changes efficiently rather than comparing each column individually.