START DP-700 Exam Overview
700
Passing Score / 1000
~55
Questions (est.)
3
Equally Weighted Domains
Open
Book Exam (MS Learn)
Domain 1 30–35%
Implement & Manage Analytics Solution
Workspace settings (Spark, domain, OneLake, Dataflows Gen2), lifecycle management (Git, deployment pipelines, database projects), security & governance (RLS, CLS, DDM, sensitivity labels, audit logs), orchestration patterns.
Domain 2 30–35%
Ingest & Transform Data
Full/incremental loads, dimensional model prep, shortcuts, mirroring, pipelines, PySpark/SQL/KQL transforms, streaming (Eventstreams, Spark Structured Streaming), windowing functions.
Domain 3 30–35%
Monitor & Optimize Analytics Solution
Monitor ingestion/transformation/semantic model refresh, configure alerts, resolve pipeline/Dataflow/notebook/KQL/Eventstream/T-SQL errors, optimize Lakehouse/pipeline/warehouse/Spark/query performance.
📖
Open-book strategy: At exam start, immediately open 3 MS Learn tabs: search arg_max() (KQL reference), DENSE_RANK (T-SQL reference), pyspark.sql (PySpark reference). Use split-screen for at most 5–10 lookups — it eats time fast. Answer all questions first.
🎯
What this exam really tests: Pick the RIGHT Fabric item for each scenario. Three pattern types dominate: (1) Lakehouse vs Warehouse vs Eventhouse, (2) Pipeline vs Dataflow vs Notebook, (3) Write/debug code in T-SQL / PySpark / KQL. Fabric-native ALWAYS beats legacy Azure equivalents.

The 5 "Always Wrong" Answers

If you see…The exam answer is…
Azure Data FactoryFabric Data Pipeline (same UI, SaaS-native)
Synapse Dedicated SQL PoolFabric Warehouse
Azure Stream AnalyticsEventstream + Eventhouse
DAX authoring questionsNot on DP-700 (that's DP-600)
ARM templates for deploymentFabric Deployment Pipeline + Git integration
D1·1 Architecture & OneLake
💡
OneLake = "OneDrive for data." One per tenant. Zero provisioning. Built on ADLS Gen2 (same APIs/SDKs) but fully SaaS — no storage accounts, no access keys, no resource groups. Every Fabric item writes Delta Parquet to OneLake automatically.

Hierarchy — Memorize This

LevelDescriptionKey fact
TenantOne OneLake per orgSingle Azure AD tenant
CapacityCompute unit (CUs)Runs in ONE Azure region; defines SKU (F2–F2048)
WorkspacePrimary security boundaryLinked to exactly one capacity
ItemLakehouse, Warehouse, Pipeline, Notebook, Eventhouse…All store data in OneLake as Delta Parquet
Sub-item/Tables, /Files foldersOnly /Tables (Delta) visible in SQL endpoint & Direct Lake

OneLake Shortcuts — The Exam Loves These

Shortcuts are symbolic links that point to data elsewhere without copying it.

PropertyDetail
Where creatableLakehouses and KQL databases ONLY
Internal sourcesOther lakehouses, warehouses, KQL databases, mirrored databases
External sourcesADLS Gen2, Amazon S3, Google Cloud Storage, Dataverse, Azure Blob Storage, S3-compatible
In /TablesTop-level ONLY — target must contain Delta data to be recognized as a table
In /FilesAny folder level
Internal securityPassthrough (caller's identity)
External securityDelegated (stored credentials — any lakehouse user can read through)
Cross-cloud caching1–28 day retention; files >1 GB are NOT cached
Deleting a shortcutRemoves pointer ONLY — target data untouched
🔴
Exam trap: "S3 data with egress cost concerns" → shortcut with caching disabled (no local copy). Know the difference between accelerated vs. non-accelerated shortcuts in Real-Time Intelligence (query acceleration adds a cached copy in OneLake).

Lakehouse vs. Warehouse vs. Eventhouse

Signal in the questionAnswer
Spark, PySpark, notebooks, unstructured data, ML, medallion architectureLakehouse
T-SQL, stored procedures, multi-table transactions, star schema, dimensional modelWarehouse
Streaming, telemetry, IoT, logs, time-series, sub-second latency, KQLEventhouse / KQL DB
Semi-structured data readable by T-SQL, KQL, AND Spark, written by SparkLakehouse
On-premises SQL Server data to FabricData Pipeline (Copy)

SQL Analytics Endpoint — Read-Only, Period

Every Lakehouse auto-gets a SQL analytics endpoint. Supports: SELECT, CREATE VIEW, CREATE FUNCTION, CREATE SECURITY POLICY. No INSERT, UPDATE, DELETE, CREATE TABLE, or any DML. Auto-syncs Delta tables from /Tables within ~1 min. Sync halts after 15 min of inactivity.

Note: Lakehouses now support custom schemas (e.g., beyond just `dbo`). Tables organized into custom schemas via Spark will successfully sync to the SQL analytics endpoint.

🔴
Exam trap: Unmanaged external tables created via Spark with a custom LOCATION pointing to /Files are NOT visible in the SQL analytics endpoint. Only managed Delta tables in /Tables appear. Use OneLake shortcuts to bridge the gap.

Managed vs. External Tables

ManagedExternal (Unmanaged)
Spark managesData + metadataMetadata only
Storage/Tables (default)User-specified (e.g., /Files/bronze/)
On DROP TABLEData + metadata deletedOnly metadata removed — files persist
SQL analytics endpoint✅ Visible❌ Not visible
Direct Lake (Power BI)✅ Works❌ Does not work

Fabric Warehouse Key Facts

Supported

  • Full T-SQL DML/DDL: INSERT, UPDATE, DELETE, MERGE
  • Stored procedures, COPY INTO, CTAS
  • CREATE TABLE AS CLONE OF (zero-copy, 30-day time travel)
  • Three-part cross-database queries (same workspace)
  • IDENTITY property for generating surrogate keys

NOT Supported

  • Triggers, user-created indexes, cursors
  • Recursive CTEs, materialized views
  • @@IDENTITY / SCOPE_IDENTITY(), FOR XML
  • DBCC commands, CREATE USER
  • Named procedure parameters
  • Spark access (Warehouse is T-SQL only)
⚠️
Data type migration gotchas: moneydecimal(19,4), nvarcharvarchar, tinyintsmallint, datetimeoffsetdatetime2 (time zone data lost)
D1·2 Security & Governance

Workspace Roles Permission Matrix

CapabilityAdminMemberContributorViewer
Delete workspace, manage Git integration
Add/remove users (lower roles)
Create/modify/execute items (notebooks, pipelines)
Read data via T-SQL / SQL endpoint
Read data via Spark / OneLake APIs
Share items with others
🔴
Critical: Admin/Member/Contributor have implicit OneLake Write access that overrides OneLake security roles and RLS/CLS. Viewers can query the SQL analytics endpoint but CANNOT access data via Spark, Lakehouse explorer, or OneLake APIs.

Item-Level Permissions

PermissionAllowsWatch out
ReadConnect to the item
ReadDataQuery data via T-SQL (SQL endpoint)Respects RLS/CLS/DDM
ReadAllAccess raw OneLake files directlyBYPASSES all SQL-level security!
WriteFull administrative access
ShareManage permissions for others
🔴
ReadAll bypasses RLS/CLS/DDM. When implementing SQL security, ensure users have ONLY ReadData, not ReadAll. ReadAll gives access to raw Parquet files in OneLake, bypassing all warehouse-level security.

Row-Level Security (RLS)

T-SQL — RLS Setup
-- Step 1: Create filter function
CREATE FUNCTION Security.fn_filter(@SalesRep varchar(50))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN SELECT 1 AS result
WHERE @SalesRep = USER_NAME() OR USER_NAME() = 'manager@contoso.com';

-- Step 2: Create security policy
CREATE SECURITY POLICY SalesFilter
ADD FILTER PREDICATE Security.fn_filter(SalesPersonColumn)
ON dbo.Sales WITH (STATE = ON);

Works in both Warehouse and Lakehouse SQL analytics endpoint. Uses USER_NAME() or SUSER_SNAME(). Admin/Member/Contributor bypass RLS — it only restricts Viewer-level and shared-access users.

Dynamic Data Masking (DDM)

Mask typeOutput exampleUse case
default()XXXX / 0 / 01-01-1900Generic sensitive values
email()aXX@XXXX.comEmail addresses
partial(2,"XXX",2)55XXXXX89Phone numbers, IDs
random(1,100)42 (random number)Numeric values
⚠️
Admin/Member/Contributor see unmasked data. Use GRANT UNMASK to selectively reveal. DDM alone doesn't protect against inference attacks — pair with RLS/CLS.

OneLake Security (Data Access Roles)

  • Fine-grained deny-by-default access at storage layer for Lakehouses
  • Create roles granting Read (or ReadWrite) to specific tables/folders, assign Entra users/groups
  • Only affects Viewer-level users — Admin/Member/Contributor override it
  • Every Lakehouse has a DefaultReader role — can be modified or deleted

Microsoft Purview Integration

📖 Unified Data Catalog
Auto-catalogs all Fabric assets for discovery
🔗 Data Lineage
Traces data flow across items automatically
🏷️ Sensitivity Labels
Propagate downstream — label source, derived items inherit. Requires Power BI Pro + AIP license
🛡️ DLP Policies
Detect sensitive data in OneLake
📋 Audit Logs
Activity logging via Fabric audit log (Admin portal → Audit logs)
D1·3 Orchestration: Pipeline vs Dataflow vs Notebook
ToolUse whenDo NOT use when
Dataflows Gen2Low-code Power Query transforms, 100+ connectors, light ingestion + reshapingComplex orchestration, massive datasets, code-heavy logic
Data PipelinesOrchestrating end-to-end workflows, scheduling, chaining, conditional logic, error handlingTransforming data (pipelines have no native transformation)
Spark NotebooksComplex transforms, large-scale data, SCD Type 2 logic, ML, schema inferenceSimple copy operations (overkill — use Copy activity)
🔴
Key rule: Pipelines are the orchestrator (the "glue"). They call Notebooks, Dataflows, Stored Procedures, and Copy activities. They never transform data themselves.
• "Orchestrate/schedule" → Pipeline  • "Transform 50M rows" → Notebook  • "Ingest from SharePoint/API with light cleanup" → Dataflow Gen2

Dataflows Gen2 Key Facts

Gen2 vs Gen1

  • Gen2 outputs to multiple destinations (Lakehouse, Warehouse, Azure SQL DB, KQL DB) — Gen1 only wrote to internal storage
  • Gen2 supports Git integration; Gen1 does not
  • Gen2 requires Fabric capacity (no Pro workspace)

Output modes & Staging

  • Only append or replace — no native upsert/MERGE
  • For SCD Type 2 → use a Notebook
  • Staging: Enabled by default for ALL destinations, including Lakehouses. Best practice: manually disable for simple copy-to-lakehouse to avoid double-write penalty.

Pipeline Activities Cheat Sheet

ActivityPurpose
Copy DataMoves data source→destination, no transformation
NotebookRuns a Spark notebook
DataflowTriggers Dataflow Gen2
Stored ProcedureRuns SP in supported SQL databases
ForEachIterates over a collection
If Condition / SwitchConditional branching
LookupReads a value from a source; output feeds downstream
Set Variable / Append VariableManage pipeline variables
Invoke PipelineCalls another pipeline (supports cross-workspace)
WebCalls a REST endpoint
Get MetadataRetrieves metadata from data sources
FailForces pipeline failure with custom error message
🔴
Activities with no dependency arrows run SIMULTANEOUSLY (not sequentially). Max 120 activities per pipeline.
D1·4 Lifecycle Management & Deployment

Git Integration

  • Supports Azure DevOps and GitHub
  • For Azure DevOps you need: organization, project, Git repository, and branch
  • Only one Git repo branch per workspace at a time
  • Not all items support Git (e.g., dashboards, some real-time items)

Deployment Pipelines

  • Fabric-native CI/CD: Dev → Test → Prod stages
  • Use Fabric Deployment Pipeline (not GitHub Actions, not ADF, not ARM templates)
  • Deploy eventhouse, lakehouse, warehouse, notebooks, pipelines
  • Pre/post-deployment rules can parameterize connections per stage
🔴
"Deploy an eventhouse to production" → Use a Fabric deployment pipeline, NOT GitHub Actions, NOT Azure DevOps pipeline, NOT ARM templates.

Workspace Settings (Exam Targets)

Setting typeWhat you configure
Spark settingsSpark compute pools, default runtime version, environment defaults, high concurrency mode
OneLake settingsOneLake security roles, data access controls at folder/table level
Domain settingsAssign workspace to a domain for governance grouping in Purview
Dataflows Gen2 settingsEnable/disable staging, configure default destination
D2·1 The Big Decision Matrices
🧠
These decision trees appear in nearly every question set. Memorize them cold.

Choosing the Right Data Store

NeedUseWhy
File storage + Spark processing + SQL accessLakehouseDelta tables in /Tables auto-exposed to SQL endpoint
Full T-SQL with transactions, star schemaWarehouseFull DML, stored procs, MERGE, CLONE OF
Real-time streaming, IoT, telemetry, KQL analyticsEventhouse/KQL DBAuto-indexed, partitioned by time, KQL is native
Connect to existing external data without copyingOneLake ShortcutSymbolic link — no data movement, no egress
Mirror existing Azure/external DB to FabricMirroringNear-real-time replication into OneLake Delta

Choosing the Right Transform / Ingest Tool

ScenarioBest tool
Large-scale transform, SCD Type 2, ML, complex joinsSpark Notebook
Low-code, 100+ connectors, Power Query logicDataflow Gen2
Copy data source to destination (no transform)Pipeline: Copy Data activity
Schedule, orchestrate, chain multiple tasksData Pipeline
T-SQL transform inside WarehouseT-SQL / Stored Procedure
KQL transform in EventhouseKQL / Update Policy

Choosing the Right Streaming Engine

ScenarioUse
Ingest from Event Hubs / IoT Hub / Kafka, route to Eventhouse/LakehouseEventstream
Complex event processing with PySpark on a streamSpark Structured Streaming
Query, aggregate, alert on time-series data in KQL DBKQL + Activator
Trigger action when condition met (email, Teams, PA flow)Fabric Activator

Native vs. OneLake Shortcuts in Real-Time Intelligence

OptionWhen to useTradeoff
Native KQL tablesBest query performance, real-time alertingData stored inside Eventhouse
OneLake shortcut (standard)Data lives in OneLake/Lakehouse, query via KQLNo caching; queries go to OneLake every time
OneLake shortcut (query-accelerated)Frequently queried external OneLake dataCached copy maintained, faster queries
D2·2 Batch Ingestion & Transformation

Loading Patterns

🔄 Full Load
Truncate-and-reload. Simple but expensive. Use for small/slowly changing dimensions or when change tracking isn't available.
➕ Incremental Load
Load only new/changed records. Use watermark column (timestamp/ID). More complex but efficient for large tables.

SCD Type 2 with Hash-Based Change Detection

Exam tip: "SCD Type 2 with many columns" → use a hash function on SOURCE TABLE attributes to detect changes efficiently instead of comparing each column.
PySpark — Hash-Based Change Detection
from pyspark.sql import functions as F

# Create hash of all tracked columns
df_source = df_source.withColumn(
    "row_hash",
    F.md5(F.concat_ws("||", F.col("col1"), F.col("col2"), F.col("col3")))
)
# Compare hash with target to detect changes
changed = df_source.join(df_target, "id").filter(
    df_source.row_hash != df_target.row_hash
)

Mirroring

  • Near-real-time replication of external DBs into Fabric OneLake (Delta format)
  • Supported sources: Azure SQL DB, Azure Cosmos DB, Azure SQL MI, Snowflake, and more
  • No pipeline needed — zero-ETL replication
  • Data lands as Delta Parquet in OneLake → queryable via SQL endpoint, Spark, etc.

Data Quality: Handling Common Issues

IssuePySpark approach
Duplicatesdf.dropDuplicates(["id", "date"]) or MERGE with dedup logic
Missing valuesdf.fillna({"col": 0}) or df.dropna(subset=["key_col"])
Late-arriving dataWatermarking in Spark Structured Streaming; or reprocess with incremental load using event_time
Schema mismatchmergeSchema=True for evolution; schema enforcement catches mismatches early
D2·3 Streaming & Real-Time Intelligence

Eventstream Sources & Destinations

Sources (ingest FROM)

  • Azure Event Hubs, IoT Hub
  • Custom App / Kafka / Confluent Cloud Kafka
  • Azure SQL CDC, PostgreSQL CDC, MySQL CDC, Cosmos DB CDC
  • Google Cloud Pub/Sub, Amazon Kinesis
  • Azure Blob Storage events, Fabric workspace/OneLake events
  • Sample data (built-in)

Destinations (route TO)

  • Eventhouse / KQL Database
  • Lakehouse
  • Custom endpoint
  • Fabric Activator (alerts/actions)
  • Derived Stream (branch the stream)
🔴
"Propagate events from Azure SQL to an Eventstream" → Enable Change Data Capture (CDC) on the database. NOT Extended Events, NOT Stream Analytics, NOT a read-only replica.

Eventhouse / KQL Database

  • Eventhouse = container for KQL databases (creating one auto-provisions a default KQL DB)
  • Data is auto-indexed and partitioned by ingestion time
  • Eventhouses auto-suspend when idle
  • T-SQL endpoint on KQL DB is read-only and extremely limited — use KQL for everything
  • Enable "One Logical Copy" to expose KQL data in OneLake as Delta Parquet

Fabric Activator (formerly Data Activator / Reflex)

No-code event detection engine. Triggers actions (email, Teams message, Power Automate flow, run Pipeline, run Notebook) when conditions are met. Consumes from: eventstreams, Power BI reports, real-time dashboards. Evaluates rules per "business object" (identified by an Object ID column).

Condition types: "Changes by amount/percentage" or "Becomes greater/less than threshold"

Real-Time Hub

A tenant-wide single logical place for all streaming data in motion (one per tenant, like OneLake). Discovery + management layer — lists all eventstream outputs and KQL tables you have access to.

Windowing Functions (Streaming)

Window typeBehaviorUse case
TumblingFixed, non-overlapping intervalsHourly totals, daily summaries
HoppingFixed size, overlapping (hop < window size)Rolling average every 5 min over last 1 hour
SlidingFires when event occurs within windowDetect patterns within any N-minute span
SessionGroups events with gaps ≤ timeoutUser session analytics
D2·4 Delta Lake Deep Dive

The Four Delta Commands You Must Know

SQL — MERGE (upsert pattern)
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN
    UPDATE SET target.name = source.name
WHEN NOT MATCHED THEN
    INSERT (id, name) VALUES (source.id, source.name)
WHEN NOT MATCHED BY SOURCE THEN DELETE;
-- Note: Fabric enables Low Shuffle Merge by default — no code changes needed
SQL — OPTIMIZE (file compaction)
OPTIMIZE table_name;                           -- bin-packing (compacts small files)
OPTIMIZE table_name ZORDER BY (customer_id);   -- co-locate related data for filter queries
OPTIMIZE table_name VORDER;                    -- apply V-Order encoding
SQL — VACUUM (cleanup old files)
VACUUM table_name RETAIN 168 HOURS;  -- default 7 days = 168 hours
-- Removes data files no longer referenced older than retention period
-- Does NOT remove _delta_log
-- After VACUUM: time travel to vacuumed versions is IMPOSSIBLE
SQL — Time Travel
SELECT * FROM my_table VERSION AS OF 5;
SELECT * FROM my_table TIMESTAMP AS OF '2024-01-15T10:30:00Z';
DESCRIBE HISTORY my_table;
RESTORE TABLE my_table TO VERSION AS OF 3;

-- Warehouse syntax:
SELECT * FROM dbo.dim_customer
OPTION (FOR TIMESTAMP AS OF '2024-05-23T08:00:00');
FeatureLakehouseWarehouse
File retention / time travel window7 days (default)30 days (max)
VACUUM minimum retention7 days (portal enforces this)N/A
OPTIMIZE ZORDER
V-Order default✅ Enabled✅ Enabled

V-Order vs. OPTIMIZE — Know the Difference

V-OrderOPTIMIZE
What it doesOptimizes file internal layout (encoding/compression within row groups)Compacts many small files into fewer large ones
When appliedWrite time (automatic)On-demand manual command
Write penalty~15% slower writesNo write penalty
Read benefitUp to 50% faster reads (especially Direct Lake)Fewer file reads, better partition pruning
🔴
V-Order exam traps:
• Write-intensive staging tables read once then dropped → Disable V-Order (write overhead wasted)
• Read-heavy reporting tables → Enable V-Order
• V-Order is enabled by default across all Fabric compute engines (including Spark/Lakehouses and Warehouses).

Schema Enforcement vs. Evolution

PySpark — Schema Evolution
# Schema evolution — opt-in per write operation
df.write.option("mergeSchema", "true").mode("append").saveAsTable("target")

# Complete schema replacement (use rarely)
df.write.option("overwriteSchema", "true").mode("overwrite").saveAsTable("target")

# For MERGE with schema evolution:
-- MERGE WITH SCHEMA EVOLUTION INTO target USING source ON ...
🔴
Avoid setting spark.databricks.delta.schema.autoMerge.enabled = true at session level in production — use per-operation mergeSchema instead for safety.
D3·1 Monitoring & Capacity Management
⚠️
Capacity Constraint: Dataflows Gen2 and Eventhouses require an active Fabric capacity (F-SKU) or Premium capacity (P-SKU). They cannot run in a standard Power BI Pro workspace.

Capacity Metrics App — 3 Pages

Compute Page
14-day view. CU utilization %, throttling timeline, per-item CU consumption matrix. Data latency: 10–15 minutes.
Storage Page
30-day view. Billable storage by workspace.
Health Page
(Preview) High-level capacity overview, throttling/rejection identification.

Monitoring Hub

Select "Monitor" in nav pane. Shows active and recently completed jobs (up to 100 per item, 30-day history) across pipelines, notebooks, dataflows, Spark jobs, semantic model refreshes, eventstreams. You only see activities you have permission to view.

CU Consumption Model

Operation typeSmoothing windowExamples
Interactive (user-triggered)5–64 minutesAd-hoc queries, notebook runs triggered by user
Background (scheduled)24 hoursScheduled pipelines, Spark jobs, most warehouse operations

Throttling Escalation Order — Memorize This

StageEffect
1. No throttlingNormal operation
2. Interactive Delay20-second delay added to interactive operations
3. Interactive RejectionInteractive requests rejected; background continues
4. Background RejectionBackground jobs also rejected
5. Full RejectionEverything rejected

Recovery options: Wait for natural burndown, scale up SKU, or pause workloads. Spark capacity: each CU = 2 Spark VCores (F64 = 128 VCores). HTTP 430 error = capacity fully consumed.

Bursting

Temporary use of more CUs than the SKU provides — jobs finish faster but consume more momentarily. Smoothing spreads the accounting so bursts don't immediately throttle.

D3·2 Troubleshooting & Error Resolution
Item typeWhere to lookCommon errors
PipelinePipeline run history → activity details → Error tabSource/sink connectivity, schema mismatch, timeout, permission denied
Dataflow Gen2Monitoring hub → Dataflow → refresh history → error detailsStaging lakehouse issues, connector auth, output destination schema mismatch
NotebookSpark job details → executor logs → event timelineOOM errors (increase executor memory), SparkException, AnalysisException
EventhouseDiagnostics logs → ingestion errors; .show ingestion failures KQL commandSchema mismatch, mapping errors, bad data format
EventstreamMonitoring metrics tab → error rate; stream health indicatorsSource/destination connectivity loss, throughput limits
T-SQLsys.dm_exec_sessions, sys.dm_exec_requests, Query Insights viewsDeadlocks, timeout, unsupported syntax, permission errors
OneLake ShortcutLakehouse explorer → shortcut → properties; check credential validityInvalid credentials, source path changed, ADLS firewall blocking
"Find active running queries in Warehouse" → Use sys.dm_exec_sessions filtered by session status = 'running'
D3·3 Performance Optimization
Optimize a Lakehouse Table
  • Run OPTIMIZE to compact small files (bin-packing)
  • Run OPTIMIZE ZORDER BY (col) to co-locate data for selective filter queries
  • Enable Optimize Write (default ON): auto-coalesces small writes at write time
  • Run VACUUM periodically to remove stale files
  • Use partitioning for very large tables (partition by date for time-series)
  • V-Order is enabled by default; disable for write-only staging tables to save compute.
Optimize a Data Warehouse
  • V-Order enabled by default — benefits Direct Lake and SQL queries
  • Use CREATE TABLE AS CLONE OF for zero-copy snapshots (testing, backups)
  • Use sys.dm_exec_sessions and Query Insights (DMVs) to find slow queries
  • Use views to abstract complex joins and pre-filter data
  • Statistics are auto-maintained — no manual UPDATE STATISTICS needed
Optimize Spark Performance
  • Cache DataFrames used multiple times: df.cache() or df.persist()
  • Use broadcast joins for small lookup tables: broadcast(small_df)
  • Filter early before joins to reduce shuffle size
  • Avoid collect() on large datasets (brings all data to driver)
  • Use columnar formats (Delta/Parquet) — avoid CSV for large Spark jobs
  • Tune spark.sql.shuffle.partitions (default 200 — often too high or too low)
  • Enable Adaptive Query Execution (AQE): spark.sql.adaptive.enabled=true
  • High Concurrency mode: shares Spark cluster across multiple notebooks (reduces startup time)
Optimize Eventstreams & Eventhouses
  • Use batching in eventstream to reduce ingestion overhead for high-volume streams
  • Apply update policies in KQL DB to transform data at ingestion time (instead of at query time)
  • Use materialized views in KQL DB for pre-aggregated data
  • Enable query acceleration on OneLake shortcuts used frequently in KQL
  • Filter in KQL using where as the first operator to leverage auto-indexing
REF KQL Syntax Reference
📌
KQL is pipe-based and read-only. Pattern: Filter early → transform → aggregate → render. render must ALWAYS be the LAST operator.
KQL — Core Operators
// where — filter rows (put FIRST for best performance)
StormEvents | where State == "FLORIDA" | where DamageProperty > 1000000

// project — select columns (REDUCES output; more efficient than extend)
StormEvents | project State, EventType, DamageProperty

// extend — add computed columns (KEEPS all existing columns)
StormEvents | extend TotalInjuries = InjuriesDirect + InjuriesIndirect

// summarize — aggregation (replaces GROUP BY)
StormEvents | summarize EventCount = count(), AvgDamage = avg(DamageProperty) by State
StormEvents | summarize count() by bin(StartTime, 1h)   // time-bucket (1 hour bins)

// join — combine datasets (filter BEFORE join for performance!)
Table1 | where Region == "US"
       | join kind=inner (Table2) on CommonColumn

// sort by — ordering (default is DESCENDING)
StormEvents | sort by DamageProperty desc | take 10

// render — MUST BE LAST
StormEvents | summarize count() by State
            | render barchart with (title="Events by State")

Key KQL Functions

count()sum()avg() dcount()countif()sumif() percentile()arg_max()arg_min() bin()ago()prev()next() parse_json()case()strcat()

Join Kinds

KindReturns
innerOnly matching rows from both tables
leftouterAll left rows + matching right (NULLs for non-matches)
rightouterAll right rows + matching left
fullouterAll rows from both sides
anti / leftantiLeft rows with NO match in right
semi / leftsemiLeft rows that HAVE a match in right (no right columns returned)
🔴
KQL exam traps:
sort by is valid KQL (order by also works but sort by is preferred KQL style)
project is more efficient than extend — use project when you don't need all columns
• Moving a where filter BEFORE a join significantly improves performance
kind=outer join INCREASES dataset size and WORSENS performance — never the "optimization" answer
REF PySpark & T-SQL Snippets
PySpark — Common DataFrame Operations
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Read Delta table
df = spark.read.format("delta").load("Tables/sales")
df = spark.table("lakehouse.sales")

# Common transforms
df = df.filter(F.col("amount") > 100)
df = df.withColumn("year", F.year(F.col("order_date")))
df = df.groupBy("category").agg(F.sum("amount").alias("total"))
df = df.dropDuplicates(["id", "date"])
df = df.fillna({"amount": 0, "category": "Unknown"})

# Window function (rank)
w = Window.partitionBy("region").orderBy(F.desc("sales"))
df = df.withColumn("rank", F.dense_rank().over(w))

# Write managed Delta table (visible in SQL endpoint)
df.write.mode("overwrite").saveAsTable("gold_sales")
df.write.mode("append").saveAsTable("gold_sales")

# Write with schema evolution
df.write.option("mergeSchema", "true").mode("append").saveAsTable("gold_sales")
T-SQL — Warehouse Operations
-- COPY INTO (bulk load)
COPY INTO dbo.fact_sales
FROM 'https://storage.dfs.core.windows.net/container/data/*.parquet'
WITH (FILE_TYPE = 'PARQUET');

-- Zero-copy clone (snapshot for testing)
CREATE TABLE dbo.fact_sales_backup AS CLONE OF dbo.fact_sales;

-- Column-level security
GRANT SELECT ON dbo.customers (CustomerName, Region) TO [analyst@contoso.com];

-- Dynamic data masking
ALTER TABLE dbo.customers
ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()');

-- Cross-database query (same workspace only)
INSERT INTO WarehouseA.dbo.Table
SELECT * FROM LakehouseB.dbo.Table;

-- Time travel
SELECT * FROM dbo.dim_customer
OPTION (FOR TIMESTAMP AS OF '2024-05-23T08:00:00');
🔴 Exam Traps & Gotchas — Memorize These
🔴
1. Parallel pipeline activities: Activities with no dependency arrows run simultaneously, not sequentially. Max 120 activities per pipeline.
🔴
2. ReadAll bypasses security: Users with ReadAll permission bypass ALL warehouse-level security (RLS, CLS, DDM) because they access raw Parquet. Give only ReadData for SQL-only access.
🔴
3. External tables invisible to SQL endpoint: Unmanaged Spark external tables with custom LOCATION (e.g., /Files) are NOT visible in the SQL analytics endpoint. Use managed tables or shortcuts.
🔴
4. SQL analytics endpoint is READ-ONLY: Only SELECT, CREATE VIEW, CREATE FUNCTION, CREATE SECURITY POLICY are supported. No DML. Sync halts after 15 min of inactivity.
🔴
5. V-Order is the standard: It is enabled by default across all Fabric compute engines (Warehouses and Lakehouses). Manually disable it only for write-heavy, read-rarely staging tables.
🔴
6. After VACUUM, time travel is lost: VACUUM removes files older than retention — those versions can no longer be queried. Lakehouse default = 7 days. Warehouse max = 30 days.
🔴
7. Dataflows Gen2 = append or replace only: No native MERGE/upsert. Staging is enabled by default for all destinations. For SCD Type 2 or upsert logic → use a Spark Notebook with MERGE.
🔴
8. Admin/Member/Contributor bypass RLS: Row-level security only restricts Viewer-level and item-share users. Workspace contributors can see all data regardless of RLS policies.
🔴
9. Cross-database queries = same workspace only: Three-part naming (DB.Schema.Table) works only within the same workspace. Cross-workspace and cross-region queries are NOT supported.
🔴
10. Azure SQL → Eventstream requires CDC: Must enable Change Data Capture on the source database. NOT Extended Events, NOT Stream Analytics, NOT read-only replica.
🔴
11. Git integration with Azure DevOps needs 4 things: Organization + Project + Git repository + Branch. All four required.
🔴
12. Eventhouse deployment = Fabric deployment pipeline: NOT GitHub Actions, NOT Azure DevOps pipeline, NOT ARM templates.
🔴
13. Shortcuts in /Tables = top-level only: Cannot create shortcuts in subdirectories of /Tables. In /Files, any folder level is allowed.
🔴
14. KQL outer join worsens performance: kind=outer increases dataset size. It is NEVER the answer to "optimize a KQL query" questions.
🔴
15. SCD Type 2 hash tip: For tables with many columns, use a hash function on source attributes to detect changes efficiently rather than comparing each column individually.
Question 1 of 30