✨ Appendix D — Spark SQL Reference & Exam Caveats

Based on: Apache Spark SQL & Delta Lake in Fabric (Microsoft Learn)

📁 ← Back to Home

Syntax reference for Spark SQL as tested on DP-600. Spark SQL runs in Notebooks and Spark job definitions against Lakehouse Delta tables. The exam tests where Spark SQL applies (vs T-SQL), Delta-specific commands, and the read/write boundaries — not authoring big ETL scripts from scratch.

Table of contents

🧭 1 — Where Spark SQL Shows Up in Fabric
🏗️ 2 — Creating Tables (Managed vs External)
🔺 3 — Delta Lake DML: MERGE, UPDATE, DELETE
🧹 4 — Delta Maintenance: OPTIMIZE, VACUUM, Z-ORDER
🕰️ 5 — Time Travel
👁️ 6 — Views, Temp Views & the %%sql Magic
🔤 7 — Spark SQL vs T-SQL (Key Differences)
⚠️ 8 — Spark SQL Exam Traps (Rapid Fire)

🧭 1 — Where Spark SQL Shows Up in Fabric

Surface	Spark SQL?	Notes
Notebook	✅	`%%sql` magic cell, or `spark.sql("...")` in Python
Spark Job Definition	✅	Batch script execution
Lakehouse	✅	Reads/writes managed & external Delta tables
Warehouse	❌	Uses T-SQL, not Spark
SQL analytics endpoint	❌	T-SQL read-only surface over the Lakehouse
KQL Database	❌	Uses KQL

Exam Caveat: Spark SQL writes to the Lakehouse (unlike the read-only SQL analytics endpoint). If a scenario needs to transform and persist Lakehouse Delta tables, the answer is a Notebook (Spark SQL / PySpark) or a Dataflow — not the SQL endpoint. See Appendix C — SQL for the T-SQL side.

🏗️ 2 — Creating Tables (Managed vs External)

-- Managed (data lives in the Lakehouse-managed Tables/ area)
CREATE TABLE sales USING DELTA AS
SELECT * FROM staging_sales;

-- External / unmanaged (you control the path)
CREATE TABLE ext_sales
USING DELTA
LOCATION 'Files/curated/sales';

Table type	Data location	`DROP TABLE` deletes data?
Managed	Lakehouse `Tables/` (engine-owned)	✅ Yes — metadata and files
External	A path you specify (`LOCATION`)	❌ No — drops metadata only

Exam Caveat: Dropping a managed table deletes the underlying files; dropping an external table leaves the data in place. This managed-vs-external distinction (and who owns the files) is a recurring exam question.

Exam Tip: In Fabric, USING DELTA is the default and preferred format — Delta enables ACID, time travel, and Direct Lake. Tables must be Delta (V-Order optimized) to be consumed by a Direct Lake semantic model.

🔺 3 — Delta Lake DML: MERGE, UPDATE, DELETE

Unlike the T-SQL Warehouse, Spark SQL on Delta supports MERGE — the upsert workhorse.

MERGE INTO target t
USING updates u  ON t.Id = u.Id
WHEN MATCHED THEN UPDATE SET t.Amount = u.Amount
WHEN NOT MATCHED THEN INSERT (Id, Amount) VALUES (u.Id, u.Amount)
WHEN NOT MATCHED BY SOURCE THEN DELETE;

Command	Purpose
`MERGE INTO`	Upsert (insert + update + optional delete)
`UPDATE ... SET ... WHERE`	In-place row updates
`DELETE FROM ... WHERE`	Row deletes
`INSERT INTO / OVERWRITE`	Append / replace

Exam Caveat: MERGE, UPDATE, and DELETE on individual rows are Delta/Spark capabilities — the classic scenario “upsert changed rows into a Lakehouse table” points to a Notebook, not the SQL analytics endpoint (read-only) and historically not the Warehouse.

🧹 4 — Delta Maintenance: OPTIMIZE, VACUUM, Z-ORDER

Command	What it does	Exam angle
`OPTIMIZE t`	Compacts many small files into fewer large ones	Fixes the “small files problem”
`OPTIMIZE t ZORDER BY (col)`	Co-locates data by column for faster filters	Query performance tuning
`VACUUM t RETAIN 168 HOURS`	Deletes files no longer referenced (default 7-day retention)	Reclaims storage
`DESCRIBE HISTORY t`	Shows the Delta transaction log versions	Auditing / time travel

Exam Caveat: VACUUM permanently removes old data files, which breaks time travel beyond the retention window. Running VACUUM RETAIN 0 HOURS can destroy the ability to roll back — the default 7-day retention exists to protect time travel. This trade-off is exam-favourite.

Exam Tip: The small-files problem (slow reads from thousands of tiny Parquet files) is solved with OPTIMIZE (compaction) and, for filter-heavy queries, ZORDER BY. Fabric also applies V-Order on write to boost Direct Lake/Power BI read speed.

🕰️ 5 — Time Travel

Delta keeps a versioned log, so you can query the past.

SELECT * FROM sales VERSION AS OF 12;
SELECT * FROM sales TIMESTAMP AS OF '2026-06-01T00:00:00';

-- Restore a table to an earlier version
RESTORE TABLE sales TO VERSION AS OF 12;

Exam Tip: Time travel answers “recover from a bad load” or “compare to last week” scenarios — as long as VACUUM hasn’t purged those versions. VERSION AS OF uses the integer log version; TIMESTAMP AS OF uses a point in time.

👁️ 6 — Views, Temp Views & the `%%sql` Magic

# In a PySpark cell — register a DataFrame as a queryable view
df.createOrReplaceTempView("recent_sales")

%%sql
-- Now query it with Spark SQL in a separate cell
SELECT Category, SUM(Amount) FROM recent_sales GROUP BY Category;

Object	Scope	Persisted?
Temp view (`createOrReplaceTempView`)	Current Spark session	❌
Global temp view (`global_temp.`)	All sessions on the cluster	❌
View (`CREATE VIEW`)	Catalog (metastore)	Definition only
Table (`saveAsTable`)	Lakehouse	✅ Data persisted

Exam Caveat: A temp view is session-scoped — it vanishes when the Spark session ends and is invisible to the SQL analytics endpoint or Power BI. To expose data to downstream tools you must write a managed/external Delta table (saveAsTable / CREATE TABLE), not a temp view.

🔤 7 — Spark SQL vs T-SQL (Key Differences)

Aspect	Spark SQL (Lakehouse)	T-SQL (Warehouse)
Upsert	✅ `MERGE`	Historically limited
Create-from-query	`CREATE TABLE ... USING DELTA AS SELECT`	`CREATE TABLE AS SELECT` (CTAS)
Bulk load	`spark.read...saveAsTable`	`COPY INTO`
String concat	`concat()` / `\|\|`	`+` or `CONCAT()`
Top N	`... LIMIT 5`	`SELECT TOP 5`
Identifier quoting	Back-ticks `col`	Brackets `[col]`
Case sensitivity	Case-insensitive by default	Depends on collation

Exam Caveat: LIMIT (Spark SQL) vs TOP (T-SQL), and back-ticks vs brackets, are surface-specific. Mixing SELECT TOP 5 into a Spark cell or LIMIT into a Warehouse query is a syntax-error distractor the exam uses.

⚠️ 8 — Spark SQL Exam Traps (Rapid Fire)

Spark SQL runs in Notebooks/Spark jobs, writing Lakehouse Delta — not in the Warehouse or SQL endpoint.
Managed table drop deletes files; external table drop keeps them.
Delta supports MERGE/UPDATE/DELETE — the upsert answer for Lakehouse tables.
OPTIMIZE fixes small files; ZORDER BY speeds filtered reads.
VACUUM breaks time travel past the retention window (default 7 days).
Temp views are session-scoped — write a Delta table to expose data downstream.
LIMIT not TOP; back-ticks not brackets in Spark SQL.
USING DELTA is required for Direct Lake consumption (plus V-Order).
Spark SQL and PySpark are interchangeable within a Notebook (spark.sql() bridges them) — see Appendix E — PySpark.
Time travel needs the versions to still exist — VERSION AS OF / TIMESTAMP AS OF.

Exam Tip: For “which tool writes/transforms Lakehouse data” questions, remember the hierarchy: Notebook (Spark) for pro-code transforms and upserts, Dataflow Gen2 for low-code, Pipeline Copy for bulk movement, SQL endpoint for read-only T-SQL queries.

← Appendix C — SQL Reference

Appendix E — PySpark Reference →