🐍 Appendix E — PySpark Reference & Exam Caveats
- Based on: PySpark DataFrame API in Fabric Notebooks (Microsoft Learn)
- 📁 ← Back to Home
Syntax reference for PySpark (the Python DataFrame API) as tested on DP-600. PySpark runs in Notebooks and Spark job definitions to ingest, transform, and write Lakehouse Delta tables. The exam tests reading a snippet, save modes, lazy evaluation, and when a Notebook beats a Dataflow — not writing large pipelines by hand.
Table of contents
- 🧭 1 — PySpark, Spark SQL & Where They Run
- 📥 2 — Reading Data
- 🔄 3 — Transformations (Lazy) vs Actions (Eager)
- 🧰 4 — Common DataFrame Operations
- 📤 5 — Writing Data & Save Modes
- 🔺 6 — Delta Upserts from PySpark (
DeltaTable.merge) - 🖥️ 7 — Displaying & Converting Results
- ⚙️ 8 — When to Use a Notebook (PySpark) at All
- ⚠️ 9 — PySpark Exam Traps (Rapid Fire)
🧭 1 — PySpark, Spark SQL & Where They Run
PySpark and Spark SQL are two front-ends over the same Spark engine. In a Notebook you can switch freely:
1
2
df = spark.sql("SELECT * FROM sales WHERE Amount > 100") # SQL → DataFrame
df.createOrReplaceTempView("big_sales") # DataFrame → SQL
| Front-end | Style | Best for |
|---|---|---|
| PySpark | Python, DataFrame methods | Complex logic, ML prep, reusable functions |
| Spark SQL | Declarative SQL | Set-based transforms, analysts |
Exam Tip: They are interchangeable and mixable in the same Notebook —
spark.sql()returns a DataFrame;createOrReplaceTempView()exposes a DataFrame to SQL. A question implying you must pick one exclusively is usually a distractor.
📥 2 — Reading Data
1
2
3
4
5
6
7
# Read a Lakehouse Delta table (by name)
df = spark.read.table("sales")
# Read files
df = spark.read.format("delta").load("Tables/sales")
csv = spark.read.option("header", True).option("inferSchema", True).csv("Files/raw/sales.csv")
pq = spark.read.parquet("Files/raw/sales.parquet")
| Source | Call |
|---|---|
| Managed table | spark.read.table("name") |
| Delta path | spark.read.format("delta").load(path) |
| CSV | spark.read.option("header", True).csv(path) |
| Parquet / JSON | spark.read.parquet(path) / .json(path) |
Exam Tip:
inferSchema=Truecosts an extra pass over the data. For large or stable files, supplying an explicitStructTypeschema is faster and safer than inference — a performance-oriented answer.
🔄 3 — Transformations (Lazy) vs Actions (Eager)
flowchart LR
R["read"] --> T1["select / filter / withColumn<br/>(transformations — lazy)"]
T1 --> T2["groupBy / join<br/>(lazy)"]
T2 --> A["show / count / write / collect<br/>(ACTION → triggers execution)"]
Key Point: PySpark is lazy. Transformations (
select,filter,withColumn,join,groupBy) only build a plan; nothing runs until an action (show(),count(),collect(),write,display()) forces evaluation. This is why “the cell was instant but the write took minutes” — the write triggered the whole chain.
| Transformations (lazy) | Actions (eager) |
|---|---|
select, filter/where, withColumn |
show, display, count |
groupBy().agg(), join, orderBy |
collect, take, first |
drop, distinct, withColumnRenamed |
write, save, saveAsTable |
Exam Caveat:
collect()pulls all rows to the driver and can crash it on big data — prefershow()/display()for inspection andwritefor output. “Why did the driver run out of memory?” is often acollect()on a large DataFrame.
🧰 4 — Common DataFrame Operations
1
2
3
4
5
6
7
8
9
10
11
from pyspark.sql import functions as F
result = (
df.filter(F.col("Amount") > 100)
.withColumn("Tax", F.col("Amount") * 0.2)
.groupBy("Category")
.agg(F.sum("Amount").alias("Total"),
F.countDistinct("OrderId").alias("Orders"))
.orderBy(F.desc("Total"))
)
result.show()
| Task | PySpark |
|---|---|
| Filter rows | df.filter(F.col("x") > 1) / df.where(...) |
| Add / change column | df.withColumn("c", F.col("a") + F.col("b")) |
| Select / rename | df.select("a", F.col("b").alias("bb")) |
| Group + aggregate | df.groupBy("g").agg(F.sum("v")) |
| Join | df1.join(df2, "key", "left") |
| Deduplicate | df.dropDuplicates(["k"]) |
| Sort | df.orderBy(F.desc("v")) |
Exam Caveat: Use
F.col("x")ordf["x"]for column references inside expressions — a bare string works for simpleselect/groupBybut not for arithmetic or conditions. And join type defaults toinner; specify"left","anti", etc., explicitly when needed.
📤 5 — Writing Data & Save Modes
1
2
3
4
5
6
7
8
# Save as a managed Lakehouse Delta table
df.write.format("delta").mode("overwrite").saveAsTable("sales_curated")
# Save to a path (external)
df.write.format("delta").mode("append").save("Files/curated/sales")
# Partitioned write
df.write.partitionBy("Year").mode("overwrite").saveAsTable("sales_by_year")
mode(...) |
Behaviour if table/data exists |
|---|---|
"overwrite" |
Replace existing data |
"append" |
Add new rows |
"error" / "errorifexists" (default) |
Fail |
"ignore" |
Silently skip the write |
Exam Caveat:
saveAsTable()registers a managed table in the Lakehouse metastore (visible to the SQL endpoint & Power BI);save(path)writes files without a catalog entry. If a scenario needs the data queryable from the SQL analytics endpoint or Direct Lake, usesaveAsTable()(or otherwise register the table) — a rawsave()alone won’t surface it as a table.
Exam Tip: The default write mode is error-if-exists. Re-running a Notebook that uses the default mode will fail on the second run — production loads use
overwrite(full) orappend/MERGE(incremental).
🔺 6 — Delta Upserts from PySpark (DeltaTable.merge)
For row-level upserts, PySpark uses the Delta API rather than a plain DataFrame write:
1
2
3
4
5
6
7
8
from delta.tables import DeltaTable
target = DeltaTable.forName(spark, "sales")
(target.alias("t")
.merge(updates_df.alias("u"), "t.Id = u.Id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
Exam Tip:
DeltaTable.merge(...)is the PySpark equivalent of Spark SQL’sMERGE INTO— the go-to for incremental/CDC upserts into a Lakehouse. A plainmode("append")write would duplicate rows instead of updating them.
🖥️ 7 — Displaying & Converting Results
| Call | Effect |
|---|---|
display(df) |
Rich, interactive Fabric grid + charts (preferred) |
df.show(n) |
Plain-text preview of n rows |
df.printSchema() |
Show column names & types |
df.toPandas() |
Convert to a pandas DataFrame (driver memory!) |
Exam Caveat:
toPandas()andcollect()both materialize all data on the driver node — fine for small results, dangerous at scale. For big data, keep everything distributed and write back to Delta rather than converting to pandas.
⚙️ 8 — When to Use a Notebook (PySpark) at All
flowchart TD
Q{"Ingest / transform data"} --> A{"Pro-code or<br/>large / complex?"}
A -- "Yes" --> N["Notebook<br/>(PySpark / Spark SQL)"]
A -- "No, low-code" --> D["Dataflow Gen2<br/>(Power Query)"]
Q --> B{"Just move files<br/>store-to-store?"}
B -- "Yes" --> C["Pipeline Copy activity"]
Exam Tip: Choose Notebook/PySpark for complex transformations, ML feature engineering, large-scale ELT, or reusable parameterized logic. Choose Dataflow Gen2 for low-code Power Query transforms, and Copy activity for straight bulk movement. Matching tool to scenario is heavily tested.
⚠️ 9 — PySpark Exam Traps (Rapid Fire)
- Transformations are lazy; actions trigger execution —
write/show/count/collectrun the plan. collect()/toPandas()pull all data to the driver — OOM risk at scale.saveAsTable()= managed table (catalog-visible);save(path)= files only.- Default write mode is error-if-exists — re-runs fail without
overwrite/append. - Use
F.col("x")for column expressions, not bare strings, in arithmetic/conditions. - Row-level upserts use
DeltaTable.merge, not anappendwrite. - PySpark and Spark SQL are interchangeable in one Notebook via
spark.sql()/ temp views. partitionByaffects file layout — helps large-table filter/prune performance.- Prefer
display()for inspection in Fabric; reserveshow()for plain text. - Notebook for pro-code/large transforms; Dataflow Gen2 for low-code; Copy for bulk movement.
Exam Tip: For snippet-reading questions, first find the action — that’s where work happens. Then check the write:
saveAsTablevssave, and themode, decide whether the data becomes a queryable table and whether a re-run succeeds.