00 — Microsoft Fabric Prerequisites & Core Concepts

  • Based on: Microsoft Fabric documentation (Microsoft Learn)
  • 📁 ← Back to Home

🏗️ Microsoft Fabric Architecture

What is Microsoft Fabric?

Microsoft Fabric is an end-to-end, unified analytics platform that brings together data engineering, data integration, data warehousing, real-time intelligence, data science, and business intelligence into a single SaaS product.

graph TD
    FABRIC["🏗️ Microsoft Fabric\n(Unified SaaS Analytics Platform)"] --> ONELAKE["🗄️ OneLake\n(Single unified data lake)"]
    FABRIC --> DE["🔧 Data Engineering\nLakehouse, Notebooks, Spark"]
    FABRIC --> DI["🔄 Data Integration\nPipelines, Dataflow Gen2"]
    FABRIC --> DW["🏢 Data Warehouse\nT-SQL Analytics"]
    FABRIC --> RTI["⚡ Real-Time Intelligence\nEventstreams, Eventhouse, KQL"]
    FABRIC --> DS["🧪 Data Science\nML Models, Experiments"]
    FABRIC --> PBI["📊 Power BI\nReports, Dashboards, Semantic Models"]
    ONELAKE --> DELTA["📁 Delta Lake Format\n(Open-source, ACID, Parquet-based)"]

OneLake — The Unified Data Lake

OneLake is Fabric’s single, unified, logical data lake for the entire organization — think of it as the “OneDrive for data.”

Attribute Detail
Storage format Delta Lake (Parquet + transaction log)
Protocol ADLS Gen2-compatible (abfss://)
Multi-cloud Supports OneLake shortcuts to Azure, AWS S3, GCS
Governance Centralized — one copy of data, one set of permissions
Hierarchy Tenant → Workspace → Lakehouse/Warehouse → Tables/Files
graph TD
    TENANT["🏢 Fabric Tenant"] --> WS1["📁 Workspace A"]
    TENANT --> WS2["📁 Workspace B"]
    WS1 --> LH1["🏠 Lakehouse 1"]
    WS1 --> WH1["🏢 Warehouse 1"]
    WS2 --> LH2["🏠 Lakehouse 2"]
    LH1 --> TABLES1["📊 Tables (Delta)"]
    LH1 --> FILES1["📁 Files (unmanaged)"]
    WH1 --> TABLES2["📊 Tables (T-SQL)"]
    LH2 --> SHORTCUT["🔗 Shortcut\n(to external storage)"]

Exam Caveat ⚠️: OneLake is not a separate Azure resource — it comes automatically with every Fabric tenant. There is no need to provision storage accounts separately.


🧊 Lakehouse vs Warehouse

Understanding when to use a Lakehouse vs a Warehouse is one of the most frequently tested concepts on the DP-700.

Decision Flow

flowchart TD
    Start([Data Engineering Need]) -->|"Primarily SQL-based\nT-SQL heavy workloads?"| WH["🏢 Warehouse\n(T-SQL endpoint)"]
    Start -->|"Mixed workloads?\nPySpark + SQL?"| LH["🏠 Lakehouse\n(Spark + SQL endpoint)"]
    Start -->|"Unstructured / semi-structured\ndata exploration?"| LH
    LH -->|"Need T-SQL access too?"| SQLEP["SQL Analytics Endpoint\n(auto-generated, read-only)"]
    WH -->|"Need Spark access?"| SHORTCUT2["Create shortcut\nfrom Lakehouse to Warehouse"]

Comparison Table

Feature Lakehouse Warehouse
Storage format Delta Lake (Parquet) Delta Lake (Parquet)
Primary engine Apache Spark (PySpark, Spark SQL) T-SQL
SQL access SQL Analytics Endpoint (read-only, auto-generated) Full T-SQL (read/write)
Schema support Schema-on-read + schema-on-write Schema-on-write
File support Managed tables + unmanaged files (Files/) Tables only
Stored procedures Not supported Supported
Cross-database queries Via shortcuts Native cross-database queries
Best for Data engineers, data scientists, mixed workloads SQL analysts, BI workloads, traditional DW

Exam Caveat ⚠️: The Lakehouse’s SQL Analytics Endpoint is read-only — you cannot INSERT, UPDATE, or DELETE via T-SQL on a Lakehouse. For write operations via T-SQL, use a Warehouse.


⚡ Fabric Capacity & Licensing

Capacity Units (CUs)

Fabric is licensed via capacities, measured in Capacity Units (CUs).

SKU CUs Spark vCores Typical Use
F2 2 8 Dev/test
F4 4 16 Small team
F8 8 32 Small production
F16 16 64 Medium production
F32 32 128 Large production
F64 64 256 Enterprise
F128+ 128+ 512+ Large enterprise

Exam Caveat ⚠️: Fabric uses a consumption-based model — different workloads (Spark, SQL, Dataflow Gen2, pipeline) consume CUs at different rates. Spark jobs tend to be the heaviest CU consumers.


🔄 Fabric Item Types

mindmap
  root((Fabric Items))
    Data Engineering
      Lakehouse
      Notebook
      Spark Job Definition
      Environment
    Data Integration
      Data Pipeline
      Dataflow Gen2
    Data Warehouse
      Warehouse
      SQL Analytics Endpoint
    Real-Time Intelligence
      Eventhouse
      KQL Database
      KQL Queryset
      Eventstream
    Data Science
      Experiment
      ML Model
    Power BI
      Semantic Model
      Report
      Dashboard

📁 Delta Lake Fundamentals

Delta Lake is the default storage format in Microsoft Fabric. It’s an open-source storage layer that adds ACID transactions to Apache Spark and big data workloads.

Key Features

Feature Description
ACID Transactions Atomic, consistent, isolated, durable operations on data lakes
Schema Enforcement Rejects writes that don’t match the table schema
Schema Evolution Supports adding new columns via mergeSchema option
Time Travel Query previous versions of data using VERSION AS OF or TIMESTAMP AS OF
OPTIMIZE Compacts small files into larger ones for better read performance
VACUUM Removes old files no longer referenced by the transaction log
V-Order Fabric-specific optimization — columnar sorting for faster reads
Z-Order Co-locates related data for faster filtering on specific columns

Delta Lake File Structure

graph TD
    TABLE["📊 Delta Table"] --> LOG["📋 _delta_log/\n(Transaction Log — JSON + Parquet checkpoints)"]
    TABLE --> PARTS["📁 Parquet Files\n(Actual data, partitioned)"]
    LOG --> V0["00000.json\n(Version 0)"]
    LOG --> V1["00001.json\n(Version 1)"]
    LOG --> CP["00010.checkpoint.parquet\n(Checkpoint every 10 versions)"]

Exam Caveat ⚠️:

  • OPTIMIZE compacts small files but does not remove old files — you need VACUUM for that
  • VACUUM default retention is 7 days — running VACUUM with a shorter retention can break time travel
  • V-Order is a Fabric-specific optimization that is enabled by default in Fabric lakehouses

🔗 OneLake Shortcuts

Shortcuts are pointers to data stored in other locations — they allow you to access external data without copying it into OneLake.

Supported Shortcut Sources

Source Protocol Notes
Another OneLake location Internal Cross-workspace, cross-lakehouse
Azure Data Lake Storage Gen2 abfss:// External Azure storage
Amazon S3 s3:// Cross-cloud
Google Cloud Storage gs:// Cross-cloud
Dataverse Dataverse API Power Platform integration
flowchart LR
    LH["🏠 Lakehouse\n(Fabric)"] -->|Shortcut| ADLS["☁️ ADLS Gen2\n(Azure)"]
    LH -->|Shortcut| S3["☁️ Amazon S3\n(AWS)"]
    LH -->|Shortcut| GCS["☁️ Google Cloud\n(GCP)"]
    LH -->|Shortcut| OTHER_LH["🏠 Other Lakehouse\n(Fabric)"]

Exam Caveat ⚠️:

  • Shortcuts provide read access to external data — the data is not copied into OneLake
  • Security on shortcut data is governed by both the source permissions and OneLake permissions
  • Shortcuts appear as regular folders/tables in the Lakehouse

🔧 Core Languages in Fabric

Language Where Used Best For
PySpark Notebooks, Spark Job Definitions Large-scale data transformation, complex ETL
Spark SQL Notebooks (via %%sql magic) SQL-based transforms on Lakehouse tables
T-SQL Warehouse, SQL Analytics Endpoint Traditional SQL workloads, stored procedures
KQL (Kusto Query Language) Eventhouse, KQL Database Real-time analytics, log analysis, time-series
DAX Semantic Models, Power BI Business intelligence calculations
M (Power Query) Dataflow Gen2 No-code/low-code data transformation

Exam Caveat ⚠️: The exam expects you to know when to use each language, not necessarily write complex code. Typical questions: “Which language should you use to transform streaming data in an Eventhouse?” → KQL.


🧭 Fabric vs Azure Data Services

Feature Microsoft Fabric Azure Data Factory + Synapse
Deployment SaaS (no Azure resources to manage) PaaS (you manage resources)
Data lake OneLake (automatic) ADLS Gen2 (you provision)
Compute Shared capacity (CUs) Dedicated pools / integration runtimes
Licensing Capacity-based (F SKUs) Per-resource pricing
Unified experience Single portal (app.fabric.microsoft.com) Multiple portals (portal.azure.com, Synapse Studio)
Real-time Eventstreams + Eventhouse (built-in) Event Hubs + Stream Analytics (separate services)
Governance Built-in (Purview integration) Separate Purview deployment

📊 Quick-Reference Scenario Table

Scenario Requirement Fabric Component
SQL analysts need to write stored procedures T-SQL read/write Warehouse
Data engineers need PySpark + SQL on same data Mixed engine Lakehouse
Access external ADLS Gen2 without copying Virtual access OneLake Shortcut
No-code data transformation Low-code ETL Dataflow Gen2
Real-time event processing and analytics Streaming + KQL Eventstream → Eventhouse
Orchestrate multi-step data pipelines Workflow engine Data Pipeline
Complex PySpark transformations Code-first ETL Notebook
Build BI reports on Lakehouse data Business intelligence Semantic Model + Power BI