Open-book strategy: At exam start, immediately open 3 MS Learn tabs: search arg_max() (KQL reference), DENSE_RANK (T-SQL reference), pyspark.sql (PySpark reference). Use split-screen for at most 5–10 lookups — it eats time fast. Answer all questions first.
🎯
What this exam really tests: Pick the RIGHT Fabric item for each scenario. Three pattern types dominate: (1) Lakehouse vs Warehouse vs Eventhouse, (2) Pipeline vs Dataflow vs Notebook, (3) Write/debug code in T-SQL / PySpark / KQL. Fabric-native ALWAYS beats legacy Azure equivalents.
The 5 "Always Wrong" Answers
If you see…
The exam answer is…
Azure Data Factory
Fabric Data Pipeline (same UI, SaaS-native)
Synapse Dedicated SQL Pool
Fabric Warehouse
Azure Stream Analytics
Eventstream + Eventhouse
DAX authoring questions
Not on DP-700 (that's DP-600)
ARM templates for deployment
Fabric Deployment Pipeline + Git integration
D1·1Architecture & OneLake
💡
OneLake = "OneDrive for data." One per tenant. Zero provisioning. Built on ADLS Gen2 (same APIs/SDKs) but fully SaaS — no storage accounts, no access keys, no resource groups. Every Fabric item writes Delta Parquet to OneLake automatically.
Top-level ONLY — target must contain Delta data to be recognized as a table
In /Files
Any folder level
Internal security
Passthrough (caller's identity)
External security
Delegated (stored credentials — any lakehouse user can read through)
Cross-cloud caching
1–28 day retention; files >1 GB are NOT cached
Deleting a shortcut
Removes pointer ONLY — target data untouched
🔴
Exam trap: "S3 data with egress cost concerns" → shortcut with caching disabled (no local copy). Know the difference between accelerated vs. non-accelerated shortcuts in Real-Time Intelligence (query acceleration adds a cached copy in OneLake).
Semi-structured data readable by T-SQL, KQL, AND Spark, written by Spark
Lakehouse
On-premises SQL Server data to Fabric
Data Pipeline (Copy)
SQL Analytics Endpoint — Read-Only, Period
Every Lakehouse auto-gets a SQL analytics endpoint. Supports: SELECT, CREATE VIEW, CREATE FUNCTION, CREATE SECURITY POLICY. No INSERT, UPDATE, DELETE, CREATE TABLE, or any DML. Auto-syncs Delta tables from /Tables within ~1 min. Sync halts after 15 min of inactivity.
Note: Lakehouses now support custom schemas (e.g., beyond just `dbo`). Tables organized into custom schemas via Spark will successfully sync to the SQL analytics endpoint.
🔴
Exam trap: Unmanaged external tables created via Spark with a custom LOCATION pointing to /Files are NOT visible in the SQL analytics endpoint. Only managed Delta tables in /Tables appear. Use OneLake shortcuts to bridge the gap.
Managed vs. External Tables
Managed
External (Unmanaged)
Spark manages
Data + metadata
Metadata only
Storage
/Tables (default)
User-specified (e.g., /Files/bronze/)
On DROP TABLE
Data + metadata deleted
Only metadata removed — files persist
SQL analytics endpoint
✅ Visible
❌ Not visible
Direct Lake (Power BI)
✅ Works
❌ Does not work
Fabric Warehouse Key Facts
Supported
Full T-SQL DML/DDL: INSERT, UPDATE, DELETE, MERGE
Stored procedures, COPY INTO, CTAS
CREATE TABLE AS CLONE OF (zero-copy, 30-day time travel)
Critical: Admin/Member/Contributor have implicit OneLake Write access that overrides OneLake security roles and RLS/CLS. Viewers can query the SQL analytics endpoint but CANNOT access data via Spark, Lakehouse explorer, or OneLake APIs.
Item-Level Permissions
Permission
Allows
Watch out
Read
Connect to the item
ReadData
Query data via T-SQL (SQL endpoint)
Respects RLS/CLS/DDM
ReadAll
Access raw OneLake files directly
BYPASSES all SQL-level security!
Write
Full administrative access
Share
Manage permissions for others
🔴
ReadAll bypasses RLS/CLS/DDM. When implementing SQL security, ensure users have ONLY ReadData, not ReadAll. ReadAll gives access to raw Parquet files in OneLake, bypassing all warehouse-level security.
Row-Level Security (RLS)
T-SQL — RLS Setup
-- Step 1: Create filter functionCREATE FUNCTION Security.fn_filter(@SalesRep varchar(50))
RETURNS TABLE WITH SCHEMABINDING ASRETURN SELECT 1 AS result
WHERE@SalesRep = USER_NAME() OR USER_NAME() = 'manager@contoso.com';
-- Step 2: Create security policyCREATE SECURITY POLICY SalesFilter
ADD FILTER PREDICATE Security.fn_filter(SalesPersonColumn)
ON dbo.Sales WITH (STATE = ON);
Works in both Warehouse and Lakehouse SQL analytics endpoint. Uses USER_NAME() or SUSER_SNAME(). Admin/Member/Contributor bypass RLS — it only restricts Viewer-level and shared-access users.
Dynamic Data Masking (DDM)
Mask type
Output example
Use case
default()
XXXX / 0 / 01-01-1900
Generic sensitive values
email()
aXX@XXXX.com
Email addresses
partial(2,"XXX",2)
55XXXXX89
Phone numbers, IDs
random(1,100)
42 (random number)
Numeric values
⚠️
Admin/Member/Contributor see unmasked data. Use GRANT UNMASK to selectively reveal. DDM alone doesn't protect against inference attacks — pair with RLS/CLS.
OneLake Security (Data Access Roles)
Fine-grained deny-by-default access at storage layer for Lakehouses
Create roles granting Read (or ReadWrite) to specific tables/folders, assign Entra users/groups
Only affects Viewer-level users — Admin/Member/Contributor override it
Every Lakehouse has a DefaultReader role — can be modified or deleted
Microsoft Purview Integration
📖 Unified Data Catalog
Auto-catalogs all Fabric assets for discovery
🔗 Data Lineage
Traces data flow across items automatically
🏷️ Sensitivity Labels
Propagate downstream — label source, derived items inherit. Requires Power BI Pro + AIP license
Simple copy operations (overkill — use Copy activity)
🔴
Key rule: Pipelines are the orchestrator (the "glue"). They call Notebooks, Dataflows, Stored Procedures, and Copy activities. They never transform data themselves. • "Orchestrate/schedule" → Pipeline • "Transform 50M rows" → Notebook • "Ingest from SharePoint/API with light cleanup" → Dataflow Gen2
Dataflows Gen2 Key Facts
Gen2 vs Gen1
Gen2 outputs to multiple destinations (Lakehouse, Warehouse, Azure SQL DB, KQL DB) — Gen1 only wrote to internal storage
Gen2 supports Git integration; Gen1 does not
Gen2 requires Fabric capacity (no Pro workspace)
Output modes & Staging
Only append or replace — no native upsert/MERGE
For SCD Type 2 → use a Notebook
Staging: Enabled by default for ALL destinations, including Lakehouses. Best practice: manually disable for simple copy-to-lakehouse to avoid double-write penalty.
Pipeline Activities Cheat Sheet
Activity
Purpose
Copy Data
Moves data source→destination, no transformation
Notebook
Runs a Spark notebook
Dataflow
Triggers Dataflow Gen2
Stored Procedure
Runs SP in supported SQL databases
ForEach
Iterates over a collection
If Condition / Switch
Conditional branching
Lookup
Reads a value from a source; output feeds downstream
Set Variable / Append Variable
Manage pipeline variables
Invoke Pipeline
Calls another pipeline (supports cross-workspace)
Web
Calls a REST endpoint
Get Metadata
Retrieves metadata from data sources
Fail
Forces pipeline failure with custom error message
🔴
Activities with no dependency arrows run SIMULTANEOUSLY (not sequentially). Max 120 activities per pipeline.
D1·4Lifecycle Management & Deployment
Git Integration
Supports Azure DevOps and GitHub
For Azure DevOps you need: organization, project, Git repository, and branch
Only one Git repo branch per workspace at a time
Not all items support Git (e.g., dashboards, some real-time items)
Deployment Pipelines
Fabric-native CI/CD: Dev → Test → Prod stages
Use Fabric Deployment Pipeline (not GitHub Actions, not ADF, not ARM templates)
Large-scale transform, SCD Type 2, ML, complex joins
Spark Notebook
Low-code, 100+ connectors, Power Query logic
Dataflow Gen2
Copy data source to destination (no transform)
Pipeline: Copy Data activity
Schedule, orchestrate, chain multiple tasks
Data Pipeline
T-SQL transform inside Warehouse
T-SQL / Stored Procedure
KQL transform in Eventhouse
KQL / Update Policy
Choosing the Right Streaming Engine
Scenario
Use
Ingest from Event Hubs / IoT Hub / Kafka, route to Eventhouse/Lakehouse
Eventstream
Complex event processing with PySpark on a stream
Spark Structured Streaming
Query, aggregate, alert on time-series data in KQL DB
KQL + Activator
Trigger action when condition met (email, Teams, PA flow)
Fabric Activator
Native vs. OneLake Shortcuts in Real-Time Intelligence
Option
When to use
Tradeoff
Native KQL tables
Best query performance, real-time alerting
Data stored inside Eventhouse
OneLake shortcut (standard)
Data lives in OneLake/Lakehouse, query via KQL
No caching; queries go to OneLake every time
OneLake shortcut (query-accelerated)
Frequently queried external OneLake data
Cached copy maintained, faster queries
D2·2Batch Ingestion & Transformation
Loading Patterns
🔄 Full Load
Truncate-and-reload. Simple but expensive. Use for small/slowly changing dimensions or when change tracking isn't available.
➕ Incremental Load
Load only new/changed records. Use watermark column (timestamp/ID). More complex but efficient for large tables.
SCD Type 2 with Hash-Based Change Detection
✅
Exam tip: "SCD Type 2 with many columns" → use a hash function on SOURCE TABLE attributes to detect changes efficiently instead of comparing each column.
PySpark — Hash-Based Change Detection
from pyspark.sql import functions as F
# Create hash of all tracked columns
df_source = df_source.withColumn(
"row_hash",
F.md5(F.concat_ws("||", F.col("col1"), F.col("col2"), F.col("col3")))
)
# Compare hash with target to detect changes
changed = df_source.join(df_target, "id").filter(
df_source.row_hash != df_target.row_hash
)
Mirroring
Near-real-time replication of external DBs into Fabric OneLake (Delta format)
Supported sources: Azure SQL DB, Azure Cosmos DB, Azure SQL MI, Snowflake, and more
No pipeline needed — zero-ETL replication
Data lands as Delta Parquet in OneLake → queryable via SQL endpoint, Spark, etc.
Data Quality: Handling Common Issues
Issue
PySpark approach
Duplicates
df.dropDuplicates(["id", "date"]) or MERGE with dedup logic
Missing values
df.fillna({"col": 0}) or df.dropna(subset=["key_col"])
Late-arriving data
Watermarking in Spark Structured Streaming; or reprocess with incremental load using event_time
Schema mismatch
mergeSchema=True for evolution; schema enforcement catches mismatches early
D2·3Streaming & Real-Time Intelligence
Eventstream Sources & Destinations
Sources (ingest FROM)
Azure Event Hubs, IoT Hub
Custom App / Kafka / Confluent Cloud Kafka
Azure SQL CDC, PostgreSQL CDC, MySQL CDC, Cosmos DB CDC
"Propagate events from Azure SQL to an Eventstream" → Enable Change Data Capture (CDC) on the database. NOT Extended Events, NOT Stream Analytics, NOT a read-only replica.
Eventhouse / KQL Database
Eventhouse = container for KQL databases (creating one auto-provisions a default KQL DB)
Data is auto-indexed and partitioned by ingestion time
Eventhouses auto-suspend when idle
T-SQL endpoint on KQL DB is read-only and extremely limited — use KQL for everything
Enable "One Logical Copy" to expose KQL data in OneLake as Delta Parquet
Fabric Activator (formerly Data Activator / Reflex)
No-code event detection engine. Triggers actions (email, Teams message, Power Automate flow, run Pipeline, run Notebook) when conditions are met. Consumes from: eventstreams, Power BI reports, real-time dashboards. Evaluates rules per "business object" (identified by an Object ID column).
Condition types: "Changes by amount/percentage" or "Becomes greater/less than threshold"
Real-Time Hub
A tenant-wide single logical place for all streaming data in motion (one per tenant, like OneLake). Discovery + management layer — lists all eventstream outputs and KQL tables you have access to.
Windowing Functions (Streaming)
Window type
Behavior
Use case
Tumbling
Fixed, non-overlapping intervals
Hourly totals, daily summaries
Hopping
Fixed size, overlapping (hop < window size)
Rolling average every 5 min over last 1 hour
Sliding
Fires when event occurs within window
Detect patterns within any N-minute span
Session
Groups events with gaps ≤ timeout
User session analytics
D2·4Delta Lake Deep Dive
The Four Delta Commands You Must Know
SQL — MERGE (upsert pattern)
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THENUPDATE SET target.name = source.name
WHEN NOT MATCHED THENINSERT (id, name) VALUES (source.id, source.name)
WHEN NOT MATCHED BY SOURCE THEN DELETE;
-- Note: Fabric enables Low Shuffle Merge by default — no code changes needed
SQL — OPTIMIZE (file compaction)
OPTIMIZE table_name; -- bin-packing (compacts small files)OPTIMIZE table_name ZORDER BY (customer_id); -- co-locate related data for filter queriesOPTIMIZE table_name VORDER; -- apply V-Order encoding
SQL — VACUUM (cleanup old files)
VACUUM table_name RETAIN 168 HOURS; -- default 7 days = 168 hours-- Removes data files no longer referenced older than retention period-- Does NOT remove _delta_log-- After VACUUM: time travel to vacuumed versions is IMPOSSIBLE
SQL — Time Travel
SELECT * FROM my_table VERSION AS OF 5;
SELECT * FROM my_table TIMESTAMP AS OF'2024-01-15T10:30:00Z';
DESCRIBE HISTORY my_table;
RESTORE TABLE my_table TO VERSION AS OF 3;
-- Warehouse syntax:SELECT * FROM dbo.dim_customer
OPTION (FOR TIMESTAMP AS OF'2024-05-23T08:00:00');
Feature
Lakehouse
Warehouse
File retention / time travel window
7 days (default)
30 days (max)
VACUUM minimum retention
7 days (portal enforces this)
N/A
OPTIMIZE ZORDER
✅
✅
V-Order default
✅ Enabled
✅ Enabled
V-Order vs. OPTIMIZE — Know the Difference
V-Order
OPTIMIZE
What it does
Optimizes file internal layout (encoding/compression within row groups)
Compacts many small files into fewer large ones
When applied
Write time (automatic)
On-demand manual command
Write penalty
~15% slower writes
No write penalty
Read benefit
Up to 50% faster reads (especially Direct Lake)
Fewer file reads, better partition pruning
🔴
V-Order exam traps:
• Write-intensive staging tables read once then dropped → Disable V-Order (write overhead wasted)
• Read-heavy reporting tables → Enable V-Order
• V-Order is enabled by default across all Fabric compute engines (including Spark/Lakehouses and Warehouses).
Schema Enforcement vs. Evolution
PySpark — Schema Evolution
# Schema evolution — opt-in per write operation
df.write.option("mergeSchema", "true").mode("append").saveAsTable("target")
# Complete schema replacement (use rarely)
df.write.option("overwriteSchema", "true").mode("overwrite").saveAsTable("target")
# For MERGE with schema evolution:-- MERGE WITH SCHEMA EVOLUTION INTO target USING source ON ...
🔴
Avoid setting spark.databricks.delta.schema.autoMerge.enabled = true at session level in production — use per-operation mergeSchema instead for safety.
D3·1Monitoring & Capacity Management
⚠️
Capacity Constraint: Dataflows Gen2 and Eventhouses require an active Fabric capacity (F-SKU) or Premium capacity (P-SKU). They cannot run in a standard Power BI Pro workspace.
Capacity Metrics App — 3 Pages
Compute Page
14-day view. CU utilization %, throttling timeline, per-item CU consumption matrix. Data latency: 10–15 minutes.
Select "Monitor" in nav pane. Shows active and recently completed jobs (up to 100 per item, 30-day history) across pipelines, notebooks, dataflows, Spark jobs, semantic model refreshes, eventstreams. You only see activities you have permission to view.
CU Consumption Model
Operation type
Smoothing window
Examples
Interactive (user-triggered)
5–64 minutes
Ad-hoc queries, notebook runs triggered by user
Background (scheduled)
24 hours
Scheduled pipelines, Spark jobs, most warehouse operations
Recovery options: Wait for natural burndown, scale up SKU, or pause workloads. Spark capacity: each CU = 2 Spark VCores (F64 = 128 VCores). HTTP 430 error = capacity fully consumed.
Bursting
Temporary use of more CUs than the SKU provides — jobs finish faster but consume more momentarily. Smoothing spreads the accounting so bursts don't immediately throttle.
D3·2Troubleshooting & Error Resolution
Item type
Where to look
Common errors
Pipeline
Pipeline run history → activity details → Error tab
High Concurrency mode: shares Spark cluster across multiple notebooks (reduces startup time)
Optimize Eventstreams & Eventhouses ▶
Use batching in eventstream to reduce ingestion overhead for high-volume streams
Apply update policies in KQL DB to transform data at ingestion time (instead of at query time)
Use materialized views in KQL DB for pre-aggregated data
Enable query acceleration on OneLake shortcuts used frequently in KQL
Filter in KQL using where as the first operator to leverage auto-indexing
REFKQL Syntax Reference
📌
KQL is pipe-based and read-only. Pattern: Filter early → transform → aggregate → render. render must ALWAYS be the LAST operator.
KQL — Core Operators
// where — filter rows (put FIRST for best performance)
StormEvents | where State == "FLORIDA" | where DamageProperty > 1000000
// project — select columns (REDUCES output; more efficient than extend)
StormEvents | project State, EventType, DamageProperty
// extend — add computed columns (KEEPS all existing columns)
StormEvents | extend TotalInjuries = InjuriesDirect + InjuriesIndirect
// summarize — aggregation (replaces GROUP BY)
StormEvents | summarize EventCount = count(), AvgDamage = avg(DamageProperty) by State
StormEvents | summarizecount() bybin(StartTime, 1h) // time-bucket (1 hour bins)// join — combine datasets (filter BEFORE join for performance!)
Table1 | where Region == "US"
| join kind=inner (Table2) on CommonColumn
// sort by — ordering (default is DESCENDING)
StormEvents | sort by DamageProperty desc | take 10
// render — MUST BE LAST
StormEvents | summarizecount() by State
| render barchart with (title="Events by State")
All left rows + matching right (NULLs for non-matches)
rightouter
All right rows + matching left
fullouter
All rows from both sides
anti / leftanti
Left rows with NO match in right
semi / leftsemi
Left rows that HAVE a match in right (no right columns returned)
🔴
KQL exam traps:
• sort by is valid KQL (order by also works but sort by is preferred KQL style)
• project is more efficient than extend — use project when you don't need all columns
• Moving a where filter BEFORE a join significantly improves performance
• kind=outer join INCREASES dataset size and WORSENS performance — never the "optimization" answer
REFPySpark & T-SQL Snippets
PySpark — Common DataFrame Operations
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Read Delta table
df = spark.read.format("delta").load("Tables/sales")
df = spark.table("lakehouse.sales")
# Common transforms
df = df.filter(F.col("amount") > 100)
df = df.withColumn("year", F.year(F.col("order_date")))
df = df.groupBy("category").agg(F.sum("amount").alias("total"))
df = df.dropDuplicates(["id", "date"])
df = df.fillna({"amount": 0, "category": "Unknown"})
# Window function (rank)
w = Window.partitionBy("region").orderBy(F.desc("sales"))
df = df.withColumn("rank", F.dense_rank().over(w))
# Write managed Delta table (visible in SQL endpoint)
df.write.mode("overwrite").saveAsTable("gold_sales")
df.write.mode("append").saveAsTable("gold_sales")
# Write with schema evolution
df.write.option("mergeSchema", "true").mode("append").saveAsTable("gold_sales")
T-SQL — Warehouse Operations
-- COPY INTO (bulk load)COPY INTO dbo.fact_sales
FROM'https://storage.dfs.core.windows.net/container/data/*.parquet'WITH (FILE_TYPE = 'PARQUET');
-- Zero-copy clone (snapshot for testing)CREATE TABLE dbo.fact_sales_backup AS CLONE OF dbo.fact_sales;
-- Column-level securityGRANT SELECT ON dbo.customers (CustomerName, Region) TO [analyst@contoso.com];
-- Dynamic data maskingALTER TABLE dbo.customers
ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()');
-- Cross-database query (same workspace only)INSERT INTO WarehouseA.dbo.Table
SELECT * FROM LakehouseB.dbo.Table;
-- Time travelSELECT * FROM dbo.dim_customer
OPTION (FOR TIMESTAMP AS OF'2024-05-23T08:00:00');
🔴Exam Traps & Gotchas — Memorize These
🔴
1. Parallel pipeline activities: Activities with no dependency arrows run simultaneously, not sequentially. Max 120 activities per pipeline.
🔴
2. ReadAll bypasses security: Users with ReadAll permission bypass ALL warehouse-level security (RLS, CLS, DDM) because they access raw Parquet. Give only ReadData for SQL-only access.
🔴
3. External tables invisible to SQL endpoint: Unmanaged Spark external tables with custom LOCATION (e.g., /Files) are NOT visible in the SQL analytics endpoint. Use managed tables or shortcuts.
🔴
4. SQL analytics endpoint is READ-ONLY: Only SELECT, CREATE VIEW, CREATE FUNCTION, CREATE SECURITY POLICY are supported. No DML. Sync halts after 15 min of inactivity.
🔴
5. V-Order is the standard: It is enabled by default across all Fabric compute engines (Warehouses and Lakehouses). Manually disable it only for write-heavy, read-rarely staging tables.
🔴
6. After VACUUM, time travel is lost: VACUUM removes files older than retention — those versions can no longer be queried. Lakehouse default = 7 days. Warehouse max = 30 days.
🔴
7. Dataflows Gen2 = append or replace only: No native MERGE/upsert. Staging is enabled by default for all destinations. For SCD Type 2 or upsert logic → use a Spark Notebook with MERGE.
🔴
8. Admin/Member/Contributor bypass RLS: Row-level security only restricts Viewer-level and item-share users. Workspace contributors can see all data regardless of RLS policies.
🔴
9. Cross-database queries = same workspace only: Three-part naming (DB.Schema.Table) works only within the same workspace. Cross-workspace and cross-region queries are NOT supported.
🔴
10. Azure SQL → Eventstream requires CDC: Must enable Change Data Capture on the source database. NOT Extended Events, NOT Stream Analytics, NOT read-only replica.
🔴
11. Git integration with Azure DevOps needs 4 things: Organization + Project + Git repository + Branch. All four required.
🔴
12. Eventhouse deployment = Fabric deployment pipeline: NOT GitHub Actions, NOT Azure DevOps pipeline, NOT ARM templates.
🔴
13. Shortcuts in /Tables = top-level only: Cannot create shortcuts in subdirectories of /Tables. In /Files, any folder level is allowed.
🔴
14. KQL outer join worsens performance:kind=outer increases dataset size. It is NEVER the answer to "optimize a KQL query" questions.
🔴
15. SCD Type 2 hash tip: For tables with many columns, use a hash function on source attributes to detect changes efficiently rather than comparing each column individually.