04 — Quick Reference Cheatsheet

📁 ← Back to Home

⚡ Key Numbers to Remember

Item	Value	Notes
Passing score	700 / 1000	Scaled score — not 70%
Exam duration	100 minutes	120 min seat time total
Domains	3	Equally weighted (30–35% each)
VACUUM default retention	7 days (168 hours)	Shorter retention breaks time travel
Delta checkpoint interval	Every 10 versions	Transaction log compaction
Deployment pipeline stages	Up to 10	Typically Dev → Test → Prod
Broadcast join limit	~100 MB	Larger tables cause OOM
Eventhouse hot cache default	Unlimited (policy-configurable)	Set explicitly for cost control
OneLake shortcut sources	5 types	OneLake, ADLS Gen2, S3, GCS, Dataverse

🧭 Master Decision Trees

“Where Should I Store This Data?”

flowchart TD
    START([Data Storage Decision]) --> Q1{Primary query\nengine?}
    Q1 -->|"T-SQL read+write\nstored procs"| WH["🏢 Warehouse"]
    Q1 -->|"Spark + SQL\nmixed workloads"| LH["🏠 Lakehouse"]
    Q1 -->|"KQL\nreal-time analytics"| EH["⚡ Eventhouse"]
    Q1 -->|"Don't copy\njust reference"| SC["🔗 Shortcut"]
    Q1 -->|"Near real-time\nreplica needed"| MR["🪞 Mirroring"]

“How Should I Transform This Data?”

flowchart TD
    START([Transform Decision]) --> Q1{Who is doing it?}
    Q1 -->|"Analyst\nno-code"| DF["🔄 Dataflow Gen2\n(Power Query M)"]
    Q1 -->|"Data engineer\ncode-first"| Q2{Scale?}
    Q2 -->|"Large scale\ncomplex logic"| NB["📓 Notebook\n(PySpark)"]
    Q2 -->|"SQL preferred"| Q3{Where's the data?}
    Q3 -->|"Warehouse"| TSQL["T-SQL in Warehouse"]
    Q3 -->|"Lakehouse"| SPARKSQL["Spark SQL in Notebook"]
    Q1 -->|"Real-time\nstreaming"| Q4{Style?}
    Q4 -->|"No-code routing"| ES["⚡ Eventstream"]
    Q4 -->|"Code, joins with batch"| SSS["📓 Spark Structured Streaming"]
    Q4 -->|"KQL analytics"| KQL["KQL in Eventhouse"]

“How Should I Orchestrate This?”

flowchart TD
    START([Orchestration Decision]) --> Q1{Single transform\nor multi-step?}
    Q1 -->|"Single transform"| Q2{Type?}
    Q2 -->|"No-code"| DF["🔄 Dataflow Gen2"]
    Q2 -->|"Code"| NB["📓 Notebook"]
    Q1 -->|"Multi-step workflow"| PIPE["🔧 Pipeline"]
    PIPE --> ACT["Activities:\n• Copy → Notebook → Dataflow\n• ForEach / If / Until\n• Parameters & expressions"]

🔒 Security Layer Reference

graph TD
    T["🏢 Tenant Level\n(Admin portal)"] --> W["📁 Workspace Level\n(Admin, Member, Contributor, Viewer)"]
    W --> I["📄 Item Level\n(Read, ReadAll, ReadData, Write, Reshare)"]
    I --> D["📊 Data Level"]
    D --> RLS["🔐 RLS — Row-Level Security\n(T-SQL security policy / DAX filter)"]
    D --> CLS["🔐 CLS — Column-Level Security\n(T-SQL GRANT/DENY on columns)"]
    D --> OLS["🔐 OLS — Object-Level Security\n(Semantic Model roles)"]
    D --> DDM["🎭 DDM — Dynamic Data Masking\n(Mask functions on columns)"]
    D --> FLS["📁 Folder/File-Level\n(OneLake data access roles)"]

Security Layer	Where Supported	Method
Workspace roles	All items	Admin, Member, Contributor, Viewer
Item permissions	All items	Share dialog / API
RLS	Warehouse, SQL Endpoint, Semantic Model	T-SQL security policy / DAX
CLS	Warehouse, SQL Endpoint	T-SQL GRANT/DENY
OLS	Semantic Model	Model role with `none` permission
DDM	Warehouse	T-SQL masked column
Folder/file	Lakehouse	OneLake data access roles

🔄 Lakehouse vs Warehouse — Quick Compare

	Lakehouse	Warehouse
Engine	Spark (primary) + SQL Endpoint (read-only)	T-SQL (read/write)
Format	Delta Lake	Delta Lake
Write via SQL	❌ No (SQL Endpoint is read-only)	✅ Yes
Stored procs	❌ No	✅ Yes
Files section	✅ Yes (unstructured data)	❌ No
Cross-DB queries	Via shortcuts	✅ Native
Best persona	Data engineer, data scientist	SQL analyst, BI developer

📦 Delta Lake Operations Cheatsheet

Operation	Purpose	Command
OPTIMIZE	Compact small files	`OPTIMIZE table_name`
Z-ORDER	Co-locate data by columns	`OPTIMIZE table_name ZORDER BY (col)`
VACUUM	Remove old files	`VACUUM table_name RETAIN 168 HOURS`
V-Order	Columnar read optimization	Enabled by default in Fabric
MERGE	Upsert (insert/update)	Delta `MERGE INTO ... WHEN MATCHED ... WHEN NOT MATCHED`
Time Travel	Query past versions	`SELECT * FROM table VERSION AS OF 5`
Schema Evolution	Add columns on write	`.option("mergeSchema", "true")`
Change Data Feed	Stream changes from Delta	`.option("readChangeFeed", "true")`

⚡ Spark Performance Cheatsheet

Problem	Solution
Slow join (small + large table)	Broadcast join — `broadcast(small_df)`
Data skew in join/group	Salt keys or rely on AQE (auto)
Too many small output files	`coalesce(n)` before write
Too many shuffle partitions	AQE auto-coalesces; or set `spark.sql.shuffle.partitions`
Slow scan on filtered column	Z-ORDER on that column
General read perf	V-Order (default) + OPTIMIZE
Repeated intermediate use	`df.cache()` or `df.persist()`
Slow UDFs	Replace with built-in PySpark functions
Cold start delay	Starter pools (pre-warmed)
Need 4x faster on supported ops	Native execution engine

🌊 Streaming Quick Reference

Feature	Eventstream	Spark Structured Streaming	KQL
Code	No-code	PySpark	KQL
Latency	Seconds	Seconds–min	Seconds
Windowing	Basic	Full	Full
Joins with batch	Limited	✅ Full	✅ Via shortcuts/materialized views
Destination	Eventhouse, Lakehouse, Warehouse	Lakehouse (Delta)	KQL Database
Best for	Routing, simple transforms	Complex transforms, ML	Real-time analytics, dashboards

🚨 Top Exam Traps

Lakehouse SQL Endpoint is READ-ONLY — you cannot INSERT/UPDATE/DELETE via T-SQL on a Lakehouse. Use Warehouse for that.
OPTIMIZE ≠ VACUUM — OPTIMIZE compacts files; VACUUM deletes old files. They solve different problems.
VACUUM retention < 7 days breaks time travel — the default 168 hours exists for a reason.
V-Order is NOT Z-Order — V-Order optimizes within files (columnar sorting); Z-Order optimizes across files (data co-location).
Mirroring ≠ Shortcuts — Mirroring copies data (CDC-based); shortcuts point to data without copying.
Git syncs definitions, NOT data — Git integration version-controls item metadata, not the actual data in tables.
Deployment pipelines deploy definitions, NOT data — same as Git — item definitions are promoted, not data.
DDM is not a security boundary — users with UNMASK permission see the real data. Don’t rely on DDM for sensitive access control.
Workspace Viewer ≠ data access — Viewer role lets you see items but may not grant access to the underlying data.
Dataflow Gen2 memory issues → enable staging — when Dataflow Gen2 runs out of memory, configure a staging Lakehouse.
AQE is enabled by default — don’t manually tune what AQE handles automatically (skew joins, partition coalescing).
Broadcast join OOM — if the “small” table is > ~100MB, broadcast will fail. Check size first.
Shortcut data is NOT affected by OPTIMIZE/VACUUM — only native Lakehouse data is optimized.
KQL query acceleration on shortcuts adds CU cost — it caches data for performance but consumes additional capacity.

📋 Pre-Exam Checklist

📊 Full Scenario Decision Table

Scenario	Key Constraint	Answer
SQL analysts write stored procedures	T-SQL read/write	Warehouse
Data engineers use PySpark + SQL	Mixed engine	Lakehouse
Access ADLS Gen2 without copying	No data movement	OneLake Shortcut
Near real-time replica of Azure SQL	Automatic CDC	Mirroring
No-code CSV-to-table transform	Citizen developer	Dataflow Gen2
Complex PySpark ETL on 500M rows	Code-first, large scale	Notebook
Multi-step: Copy → Transform → Refresh	Orchestration	Data Pipeline
Real-time IoT event routing	Stream processing	Eventstream → Eventhouse
5-min tumbling window aggregation	Time-series	KQL `bin()` or Spark `window()`
Upsert changed rows into Delta	Incremental load	Delta MERGE
Compact small files for read perf	File management	OPTIMIZE
Remove old unreferenced files	Storage cleanup	VACUUM
Speed up filtered queries	Query optimization	Z-ORDER BY
Restrict rows by user’s region	Data security	RLS (T-SQL security policy)
Hide column from specific users	Column security	CLS (T-SQL GRANT/DENY)
Mask email in query output	Data masking	DDM
Promote Dev → Test → Prod	CI/CD	Deployment Pipelines
Version-control notebooks	Source control	Git Integration
Slow Dataflow Gen2, OOM	Memory issue	Enable staging Lakehouse
Spark join slow, small + large table	Join optimization	Broadcast join
Too many small output files	Write optimization	coalesce() before write
Slow KQL on historical data	Cache miss	Extend caching policy / query acceleration

← 03 — Monitor & Optimize an Analytics Solution

Back to Home →