02 — Design Data Storage Solutions

Official Exam Weight: 25–30% 📁 ← Back to Home


🗺️ Domain Overview

mindmap
  root((Data Storage Solutions))
    Relational Storage
      Azure SQL Database
      SQL Managed Instance
      SQL on Azure VM
      Azure DB for PostgreSQL
      Azure DB for MySQL
    Non-Relational
      Cosmos DB NoSQL
      Cosmos DB MongoDB API
      Cosmos DB Table API
      Cosmos DB Cassandra
      Cosmos DB Gremlin
    Object Storage
      Azure Blob Storage
      Azure Data Lake Gen2
      Azure Files
      Azure Queues
    Integration & Analytics
      Azure Data Factory
      Azure Databricks
      Azure Synapse Analytics
      Azure Stream Analytics

🗄️ 2.1 Design Relational Data Storage

Azure SQL Service Selection

flowchart TD
    START([Which SQL service?]) --> Q1{Need 100% SQL Server\nfeature compatibility?\nSQL Agent, CLR, SSISDB,\nlinked servers, etc.}
    Q1 -->|Yes| Q2{Need full OS control\nor third-party tools\nor BYOL for max control?}
    Q1 -->|No| SQLDB["✅ Azure SQL Database\n(Best PaaS — standard SQL workloads)"]
    Q2 -->|Yes| SQLVM["🖥️ SQL Server on Azure VM\n(IaaS — full control, 100% compat)"]
    Q2 -->|No| SQLMI["✅ Azure SQL Managed Instance\n(Best lift-and-shift PaaS\nnear-100% compatibility)"]

Azure SQL Service Comparison

Feature Azure SQL DB SQL Managed Instance SQL on Azure VM
Service Model PaaS PaaS IaaS
SQL Server compat ~95% ~99% 100%
SQL Agent ❌ (use Elastic Jobs)
CLR
Linked Servers
SSISDB
Cross-DB queries Same server only
Backup management Automatic Automatic Manual
Patching Automatic Automatic Manual
SLA 99.99% 99.99% 99.9% (Premium SSD)
Max DB size 4 TB (GP) / 100 TB (Hyperscale) 16 TB Limited by disk

Exam Caveats ⚠️:

  • SQL MI is the recommended PaaS target for on-premises SQL Server lift-and-shift migrations
  • SQL MI uses a dedicated VNet subnet — it’s network-isolated by design
  • SQL MI does NOT support Windows Authentication from Azure AD in the same way SQL DB does (check current docs)
  • Elastic Jobs replaces SQL Agent for Azure SQL Database

Azure SQL Database Tiers

graph LR
    subgraph "Purchasing Models"
        DTU["💳 DTU Model\n(bundled compute+storage)\nSimpler, less flexible"]
        VCORE["⚡ vCore Model\n(separate compute & storage)\nAzure Hybrid Benefit eligible\nRecommended for new workloads"]
    end

    subgraph "vCore Service Tiers"
        GP["🟢 General Purpose\nRemote storage (Azure Premium)\nGood price/performance\nSLA: 99.99%"]
        BC["🔵 Business Critical\nLocal SSD, built-in HA replica\nHighest IOPS\nReadable secondary\nSLA: 99.99%"]
        HS["🟣 Hyperscale\nUp to 100 TB\nRapid scale out\nMultiple replicas\nSLA: 99.9%–99.99%"]
    end

    VCORE --> GP
    VCORE --> BC
    VCORE --> HS

Business Critical tier — High Availability built in:

Tier HA Mechanism Readable Secondaries SLA
General Purpose Azure Storage replication 99.99%
Business Critical Always On AG (3 replicas) ✅ 1 free readable replica 99.99%
Hyperscale Multiple page server replicas ✅ Up to 4 99.9%–99.99%

Exam Caveats ⚠️:

  • Azure Hybrid Benefit only works with the vCore model — can save up to 55%
  • Business Critical includes a free readable secondary — useful for reporting workloads
  • Hyperscale supports up to 100 TB and rapid scaling but doesn’t support all SQL DB features

Active Geo-Replication vs Auto-Failover Groups

Feature Active Geo-Replication Auto-Failover Groups
Supported on SQL Database SQL Database + SQL MI
Max readable secondaries 4 1 (secondary region)
Failover type Manual only Manual or Automatic
Same connection string after failover ❌ String changes ✅ Listener endpoint stays same
RPO < 5 seconds < 5 seconds
RTO (auto) N/A < 30 seconds
Use case Custom failover logic Transparent app failover

Exam Caveats ⚠️:

  • Auto-Failover Groups are preferred in exam scenarios requiring transparent failover (same endpoint)
  • Active Geo-Replication gives you more control but requires application-level handling of the new endpoint

Open-Source Managed Databases

Service Engine SLA Flexible Server HA Zone-Redundant
Azure DB for PostgreSQL PostgreSQL 11–16 99.9% ✅ (99.99%)
Azure DB for MySQL MySQL 5.7, 8.0 99.9% ✅ (99.99%)

Exam Caveat ⚠️: “Flexible Server” is the current recommended deployment mode for both PostgreSQL and MySQL. The older “Single Server” mode is deprecated — do not recommend it in exam scenarios.


🌐 2.2 Design Non-Relational Data Storage (Cosmos DB)

Cosmos DB — API Selection

graph TD
    Q{Migrating existing\napp or starting fresh?}
    Q -->|Existing MongoDB app| MON["🍃 MongoDB API\n(wire-protocol compatible)"]
    Q -->|Existing Cassandra app| CASS["📦 Cassandra API\n(CQL compatible)"]
    Q -->|Existing Table Storage app| TABLE["📋 Table API\n(key-value, Table Storage compat)"]
    Q -->|Graph data / relationships| GREM["🕸️ Gremlin API\n(Apache TinkerPop)"]
    Q -->|New app — document store| NOSQL["✅ NoSQL API (Core/SQL)\nNative, richest feature set\nSQL-like query syntax"]
    Q -->|PostgreSQL + distributed scale| PG["🐘 PostgreSQL API\n(Citus distributed)"]

Cosmos DB Consistency Levels — Visualised

graph LR
    S["🔒 Strong\n(Highest consistency\nHighest latency/cost)"]
    BS["📊 Bounded Staleness\n(Lag by K ops or T time)"]
    SE["👤 Session (Default)\n(Per-session consistency)"]
    CP["📋 Consistent Prefix\n(Ordered, no gaps)"]
    EV["⚡ Eventual\n(Lowest latency/cost\nNo order guarantee)"]

    S --> BS --> SE --> CP --> EV
Level Guarantee Use Case SLA
Strong Linearizable — always read latest Financial transactions, inventory 99.999% (multi-region write)
Bounded Staleness Read within K versions or T interval Multi-region apps needing near-consistency 99.999%
Session (default) Consistent within a session User profiles, shopping carts 99.999%
Consistent Prefix No out-of-order reads Social media timelines 99.999%
Eventual No order guarantee Like counts, view counters 99.999%

Exam Caveats ⚠️:

  • The 99.999% SLA applies to multi-region write configurations
  • Single-region Cosmos DB SLA is 99.99%
  • Session consistency is the default — most apps don’t need to change it
  • Strong consistency is not available with multi-region write configurations

Cosmos DB Capacity Modes

Mode Billing Scale Best For
Provisioned Throughput Per RU/s provisioned Manual or autoscale Predictable production workloads
Autoscale Provisioned Scales between min and max RU/s Automatic Variable workloads
Serverless Per RU consumed Automatic Dev/test, spiky low-traffic apps

Request Unit (RU) rough reference:

Operation Approximate RU Cost
Read 1 KB item by point-read 1 RU
Write 1 KB item 5 RU
Query (depends on complexity) Varies widely
Stored procedure execution Varies

Exam Caveat ⚠️: A cross-partition query (no partition key filter) consumes significantly more RUs than a point-read. Partition key design directly impacts cost.

Cosmos DB Partition Key Design

Good partition keys:

  • ✅ High cardinality (many distinct values)
  • ✅ Even distribution of reads and writes
  • ✅ Used frequently in queries (enables point-reads)
  • ✅ Examples: /userId, /deviceId, /orderId, /sessionId

Bad partition keys:

  • ❌ Low cardinality: /status (Active/Inactive), /country (only 50 values)
  • ❌ Monotonically increasing: /timestamp (all writes go to the same hot partition)
  • ❌ Never queried: wastes the partition key benefit

🗃️ 2.3 Design Object / Blob Storage

Blob Storage — Access Tier Decision Flow

flowchart TD
    Q1{How frequently is\nthe data accessed?}
    Q1 -->|"Daily / very frequently"| HOT["🔥 Hot Tier\nLowest access cost\nHighest storage cost"]
    Q1 -->|"Occasionally (monthly)"| COOL["❄️ Cool Tier\nMin 30 days\nBalance of cost"]
    Q1 -->|"Rarely (quarterly)"| COLD["🧊 Cold Tier\nMin 90 days\nLower storage cost"]
    Q1 -->|"Almost never\n(compliance archive)"| ARCH["⬛ Archive Tier\nMin 180 days\nOffline — must rehydrate\nLowest storage cost"]

Blob type selection:

Blob Type Optimised For Examples
Block Blob Large sequential files Images, videos, documents, backups
Append Blob Append-only writes Log files, audit streams
Page Blob Random read/write (512-byte pages) Azure VM OS and data disks

Exam Caveats ⚠️:

  • Archive blobs are OFFLINE — you cannot read them without rehydrating first (hours to complete)
  • Early deletion fees apply if you delete a blob before minimum storage duration expires
  • Lifecycle Management Policies automate tier transitions and deletions — define rule conditions by daysAfterModificationGreaterThan or daysAfterCreationGreaterThan

Azure Data Lake Storage Gen2 (ADLS Gen2)

Feature Blob Storage ADLS Gen2
Hierarchical namespace ❌ Flat ✅ True directory structure
POSIX ACLs ✅ File and folder-level permissions
Hadoop-compatible ✅ (ABFS driver)
Analytics compatibility Limited ✅ Databricks, Synapse, HDInsight
Cost difference Baseline Same (just a feature toggle)
How to enable Enable “Hierarchical Namespace” at storage account creation

Exam Caveats ⚠️:

  • ADLS Gen2 = Blob Storage with Hierarchical Namespace enabled — NOT a separate service
  • Hierarchical Namespace cannot be disabled after the storage account is created
  • Use ADLS Gen2 for all big data / analytics workloads in Azure

Storage Security Model

Method Scope Granularity Recommended?
Entra ID + RBAC (Data roles) Container, blob Role-based ✅ Best for Azure resources
User Delegation SAS Container, blob Fine-grained, time-limited ✅ Best for external users
Service SAS Service-level Fine-grained, time-limited ✅ Good for external
Account SAS Entire account Coarse ⚠️ Use sparingly
Storage Account Keys Entire account Full access ❌ Avoid — too much access
Shared Access Policy Stored on server Can revoke server-side ✅ Use with Service SAS

Exam Caveats ⚠️:

  • User Delegation SAS is signed with Entra ID credentials (not account keys) — the most secure SAS type
  • Rotating a Storage Account Key immediately invalidates all SAS tokens signed with the old key
  • Private Endpoint for storage = traffic never leaves your VNet (preferred over Service Endpoints for on-prem access)

🔄 2.4 Design Data Integration Solutions

Integration Service Selection

graph TD
    Q{Primary need?}
    Q -->|"ETL / ELT pipelines\ndata movement / orchestration"| ADF["🔄 Azure Data Factory\nBest for: orchestration, copy data\nSupports SSIS lift-and-shift"]
    Q -->|"Large-scale Spark\ncomplex transformations + ML"| ADB["⚡ Azure Databricks\nApache Spark\nPython, Scala, SQL, R"]
    Q -->|"Unified warehouse + analytics\nsingle platform"| SYN["🏛️ Azure Synapse Analytics\nDedicated SQL Pool (DW)\nServerless SQL Pool\nSpark Pool + ADF built-in"]
    Q -->|"Real-time event\nstream processing"| ASA["📡 Azure Stream Analytics\nSQL-like query on streams\nLow-latency"]
    Q -->|"High-throughput\nevent ingestion"| EH["📥 Azure Event Hubs\nKafka-compatible\nRetention 1–90 days"]

Azure Synapse Analytics — Pools Comparison

Pool Type Purpose Billing Max Scale
Dedicated SQL Pool Enterprise data warehouse, high-performance SQL Per DWU provisioned 30,000 DWUs
Serverless SQL Pool Query ADLS Gen2 data without provisioning Per TB scanned Automatic
Apache Spark Pool Big data processing, ML, ETL Per vCore-hour while running 200 nodes

Exam Caveats ⚠️:

  • Serverless SQL Pool is free to provision — you only pay for queries run (TB scanned)
  • Dedicated SQL Pool must be paused when not in use to stop billing (a common cost management question)
  • Synapse Link provides near real-time analytical queries directly against Cosmos DB data — no ETL needed

Azure Data Factory — Key Components

Component Description
Pipeline Workflow of one or more activities
Activity A single step (Copy Data, Notebook, Stored Proc, etc.)
Dataset Named reference to data in a linked service
Linked Service Connection to data source (like a connection string)
Integration Runtime Compute used to execute activities
Trigger Schedules or events that start pipelines

Integration Runtime Types:

IR Type Use For Network
Azure IR Azure data stores and cloud services Public or Managed VNet
Self-hosted IR On-premises or other cloud (AWS, GCP) Requires agent on your network
Azure-SSIS IR Running SSIS packages in Azure Azure or Managed VNet

Exam Caveats ⚠️:

  • To run SSIS packages in Azure, use ADF with Azure-SSIS Integration Runtime
  • Self-hosted IR requires installing an agent — design for HA with multiple nodes
  • ADF itself has 99.9% SLA

🎯 Domain 2 — Exam Scenario Quick-Reference

Scenario Answer
Lift-and-shift on-prem SQL Server, use SQL Agent, CLR Azure SQL Managed Instance
New cloud-native SQL app, standard T-SQL Azure SQL Database (vCore, General Purpose)
Multiple DBs with variable workloads sharing resources Azure SQL Elastic Pool
Need transparent failover after SQL DB regional outage Auto-Failover Group (same endpoint listener)
Globally distributed app, sub-10ms latency, multi-region writes Cosmos DB (multi-region write, Session or Eventual consistency)
Existing MongoDB app migrating to Azure Cosmos DB MongoDB API
Store 200 TB of rarely accessed compliance archives Blob Storage Archive tier
Big data analytics on files stored in Azure Storage ADLS Gen2 + Azure Synapse Serverless SQL Pool
Run existing SSIS packages in the cloud ADF with Azure-SSIS Integration Runtime
Query data lake directly without provisioning a DW Synapse Serverless SQL Pool
Real-time event stream analysis (IoT sensor data) Azure Stream Analytics
High-throughput event ingestion (millions/sec) Azure Event Hubs
Need 99.999% availability for database Cosmos DB with multi-region write
SQL workload cost reduction with existing Windows licence Azure Hybrid Benefit (vCore model)