04 — Design Infrastructure Solutions

Official Exam Weight: 30–35% — Heaviest Domain 📁 ← Back to Home


🗺️ Domain Overview

mindmap
  root((Infrastructure Solutions))
    Compute Solutions
      Virtual Machines & VMSS
      Azure App Service
      Azure Kubernetes Service
      Azure Container Instances
      Azure Container Apps
      Azure Functions
      Azure Batch
    Application Architecture
      Service Bus & Event Grid
      Event Hubs
      API Management
      Azure Cache for Redis
      Azure CDN & Front Door
      Logic Apps
    Migration Solutions
      Azure Migrate
      6 Rs Strategy
      Azure DMS
      Azure Data Box
    Network Solutions
      VNet Design
      VPN Gateway & ExpressRoute
      Azure Firewall & NSGs
      Azure Bastion
      Private Endpoints
      Hub-and-Spoke
      Azure Virtual WAN

🖥️ 4.1 Design Compute Solutions

Compute Service Decision Tree

flowchart TD
    START([What is the workload?]) --> VM{Need full OS control\nor lift-and-shift IaaS?}
    VM -->|Yes| VMSERVICE["🖥️ Azure Virtual Machines\n+ VMSS for autoscaling"]
    VM -->|No| WEB{Web app or REST API\nno containers?}
    WEB -->|Yes| APP["🌐 Azure App Service\nPaaS, managed runtime"]
    WEB -->|No| CONT{Containerised?}
    CONT -->|"Short-lived,\nno orchestration"| ACI["📦 Azure Container Instances\nServerless containers"]
    CONT -->|"Production\nKubernetes"| AKS["☸️ Azure Kubernetes Service\nManaged K8s"]
    CONT -->|"Serverless\nmicroservices"| CAPP["🔲 Azure Container Apps\n(built on K8s + KEDA)"]
    CONT -->|No| EVENT{Event-triggered,\nshort execution?}
    EVENT -->|Yes| FUNC["⚡ Azure Functions\nServerless compute"]
    EVENT -->|No| BATCH["⚙️ Azure Batch\nHPC / parallel jobs"]

Azure VM Scale Sets (VMSS)

Key features:

  • 🔢 Scale from 0 to 1,000 VMs (custom images: 600)
  • ⚖️ Autoscale based on metrics (CPU, memory, queue depth) or schedule
  • 🔄 Rolling upgrades — update VMs in batches to avoid downtime
  • 🏗️ Two orchestration modes:
Mode Description Best For
Uniform All instances identical, same SKU Stateless workloads, AKS node pools
Flexible Mix of VM configs, better fault domain spread Workloads requiring VM-level flexibility

Exam Caveats ⚠️:

  • For new VMSS deployments, Flexible orchestration is the Microsoft-recommended mode
  • VMSS Autoscale requires Standard Load Balancer (not Basic)
  • Define autoscale rules with cool-down periods to prevent flapping (scale in/out too rapidly)

Azure Functions — Hosting Plans Comparison

Plan Scale Cold Start Timeout VNet Integration SLA
Consumption Auto (unlimited) ✅ Yes 5 min (max 10) ❌ No 99.9%
Flex Consumption Auto + pre-provisioned ✅ Reduced Unlimited ✅ Yes 99.9%
Premium (EP) Auto ❌ Pre-warmed Unlimited ✅ Yes 99.9%
Dedicated (App Svc) Manual / auto ❌ Always on Unlimited ✅ Yes 99.9%
Container Apps Auto (KEDA) ✅ Possible Unlimited ✅ Yes 99.9%

Exam Caveats ⚠️:

  • Consumption plan cannot use VNet Integration — use Premium or Dedicated if needed
  • Premium plan eliminates cold starts via pre-warmed instances
  • Durable Functions enables stateful workflows across multiple function executions

Azure Kubernetes Service (AKS)

Key design decisions:

Decision Options
Network plugin Kubenet (basic, no direct pod IP) vs Azure CNI (each pod gets VNet IP — required for AKS Ingress controllers, policy)
Node pools System (K8s system services) + User (your workloads) — separate for isolation
HA Availability Zones for node distribution (upgrades to 99.99% SLA)
Autoscaling Cluster Autoscaler (nodes) + Horizontal Pod Autoscaler (pods)
Cluster tier Free (no SLA) / Standard (99.95%) / Premium (99.95% + LTS)

AKS SLAs:

Tier SLA Notes
Free No SLA Dev/test only
Standard 99.95% Production recommended
Standard + Availability Zones 99.99% Mission-critical production

Exam Caveats ⚠️:

  • Azure CNI is required for: Advanced Networking policies, Windows node pools, Virtual nodes
  • Kubenet is simpler but has limitations — use for basic workloads only
  • AKS System node pools cannot be scaled to 0 — they must always have at least 1 node

Azure App Service — Deployment Slots

flowchart LR
    DEV["👨‍💻 Developer\npushes code"] --> STAGING["🔵 Staging Slot\n(pre-production)"]
    STAGING --> TEST["🧪 Testing &\nWarm-up"]
    TEST --> SWAP["🔄 Slot Swap\n(zero-downtime deploy)"]
    SWAP --> PROD["🟢 Production Slot\n(live traffic)"]
    SWAP -.->|"Rollback:\nswap back"| STAGING

Slot behaviour:

  • ✅ Slots share the same App Service Plan resources
  • 🔁 Swap exchanges slot content and settings with zero downtime
  • ⚙️ Some settings are “slot-sticky” (stay with the slot) vs non-sticky (swap with deployment)
  • 📊 Traffic routing — send % of traffic to a slot for A/B testing
  • Slots per tier: Standard = 5, Premium = 20, Isolated = 20

🏛️ 4.2 Design Application Architecture

Messaging Service Comparison

graph LR
    subgraph "MESSAGES (pull/push, reliable delivery)"
        SB["📬 Azure Service Bus\n————————\nQueues: point-to-point\nTopics: fan-out pub/sub\nMax message: 100 MB\nOrdered, FIFO, DLQ\nSLA: 99.9%"]
        SQ["📋 Azure Storage Queue\n————————\nSimple FIFO queue\nMax message: 64 KB\nHigh volume, low cost\nSLA: 99.9%"]
    end

    subgraph "EVENTS (fire-and-forget, reactive)"
        EG["⚡ Azure Event Grid\n————————\nEvent routing (push)\nFire-and-forget\nNo retention\nSLA: 99.9%"]
        EH["📡 Azure Event Hubs\n————————\nEvent streaming (pull)\nKafka-compatible\nRetention: 1–90 days\nMB/s to GB/s throughput\nSLA: 99.9%"]
    end

Detailed comparison — Service Bus vs Storage Queue:

Feature Azure Service Bus Azure Storage Queue
Max message size 256 KB–100 MB 64 KB
Max queue size 80 GB Unlimited (storage limit)
Guaranteed ordering (FIFO) ✅ (with sessions) ❌ Best-effort
Dead-letter queue (DLQ)
Duplicate detection
Transactions
Message lock / peek-lock
Pub/sub topics
SLA 99.9% 99.9%
Cost Higher Lower
Use when Enterprise messaging, ordering, reliability Simple, high-volume, cheap queuing

Event Grid vs Event Hubs:

Feature Azure Event Grid Azure Event Hubs
Pattern Event routing (push) Event streaming (pull/checkpoint)
Retention None (fire-and-forget) 1–90 days
Consumer model Push to subscribers Pull via consumer groups
Throughput Millions of events/sec Millions of events/sec
Ordering ✅ (within partition)
Replay
Kafka compat
SLA 99.9% 99.9%
Use when Trigger serverless functions, webhooks Log aggregation, telemetry, analytics

Exam Caveats ⚠️:

  • Service Bus = enterprise messaging (reliability, ordering, DLQ) — use when message delivery guarantee matters
  • Event Hubs = high-throughput streaming (think: Apache Kafka use cases, IoT telemetry)
  • Event Grid = reactive programming trigger (e.g., blob created → trigger a function)
  • Storage Queue = simplest, cheapest; use only when Service Bus features are not needed

API Management (APIM) — SKU Comparison

Tier Units SLA VNet Integration Self-hosted Gateway Approx Cost/mo
Developer 1 No SLA ~€45
Basic 1–2 99.9% ~€145
Standard 1–4 99.9% External VNet ~€725
Premium 1–31 per region 99.99% Internal & External ~€3,000+
Consumption Serverless 99.9% Pay-per-call

Exam Caveats ⚠️:

  • Premium tier is required for: Internal VNet mode, multi-region deployments, self-hosted gateway
  • Consumption tier has no built-in cache and no VNet support — for low-traffic/dev scenarios
  • 99.99% SLA requires Premium with multi-region (2+ regions) deployment

Azure Cache for Redis — Tier Comparison

Tier Max Memory Clustering Persistence Geo-Replication SLA
Basic 53 GB No SLA
Standard 53 GB 99.9%
Premium 1.2 TB ✅ (RDB/AOF) ✅ (passive) 99.9%
Enterprise 2 TB ✅ (active-active) 99.9%
Enterprise Flash 13 TB 99.9%

Exam Caveats ⚠️:

  • Basic has no SLA and no replication — dev/test only
  • Active geo-replication (multi-region write for Redis) requires Enterprise tier
  • Persistence (RDB/AOF for data durability) requires Premium or higher

🚀 4.3 Design Migration Solutions

Migration Strategy — The 6 Rs (+ Retire)

graph LR
    CURRENT["🏢 On-premises\nWorkload"]

    CURRENT -->|"Low effort\nLift & Shift"| REHOST["🔁 Rehost\n(IaaS)\nVMs in Azure"]
    CURRENT -->|"Minor changes\nfor PaaS"| REPLATFORM["🔼 Replatform\n(Lift & Optimize)\nApp Service, SQL MI"]
    CURRENT -->|"Redesign for cloud"| REFACTOR["🏗️ Refactor / Re-architect\nMicroservices, PaaS"]
    CURRENT -->|"Rewrite from scratch"| REBUILD["🔨 Rebuild\nCloud-native app"]
    CURRENT -->|"Replace with SaaS"| REPLACE["🔄 Replace\nSaaS alternative"]
    CURRENT -->|"Keep on-prem"| RETAIN["⏸️ Retain\nNot ready to migrate"]
    CURRENT -->|"Decommission"| RETIRE["🗑️ Retire\nNo longer needed"]
Strategy Cloud Benefit Effort Time to Cloud
Rehost (Lift & Shift) Low Low Fastest
Replatform (Lift & Optimize) Medium Medium Moderate
Refactor / Re-architect High High Slower
Rebuild Very High Very High Slowest
Replace Variable Low Fast
Retain None None N/A
Retire Cost eliminated Minimal Immediate

Exam Caveat ⚠️: Exam scenarios frequently have budget or time constraints. Rehost = fastest, lowest risk. Refactor = best long-term cloud benefit but most expensive and risky.

Azure Migrate — Core Tools

Tool Purpose
Azure Migrate: Discovery & Assessment Discover on-prem VMs, assess Azure readiness, estimate costs
Azure Migrate: Server Migration Migrate VMware, Hyper-V, physical servers, and cloud VMs to Azure
Azure Database Migration Service (DMS) Migrate databases to Azure SQL, MySQL, PostgreSQL
Azure Data Box Offline bulk data transfer (40 TB – petabytes)
Azure Data Box Disk Offline transfer (up to 35 TB per disk set)
Storage Mover Migrate file shares to Azure Files or ADLS Gen2

Azure Migrate Workflow:

flowchart LR
    P1["1️⃣ Create\nAzure Migrate\nProject"] --> P2["2️⃣ Deploy\nAppliance\n(on-premises)"]
    P2 --> P3["3️⃣ Discover\nVMs, servers,\ndatabases, apps"]
    P3 --> P4["4️⃣ Assess\nReadiness + sizing\n+ cost estimation"]
    P4 --> P5["5️⃣ Replicate\n(continuous\nreplication)"]
    P5 --> P6["6️⃣ Test\nMigration\n(isolated test)"]
    P6 --> P7["7️⃣ Migrate\n(cutover)"]
    P7 --> P8["8️⃣ Optimise\n& Monitor"]

🌐 4.4 Design Network Solutions

Connectivity Options Comparison

graph LR
    ONPREM["🏢 On-premises"]

    subgraph "Over Public Internet (encrypted)"
        VPN["🔒 VPN Gateway\nSite-to-Site IPsec\nUp to 10 Gbps\nSLA: 99.9%–99.95%"]
        P2S["📱 Point-to-Site VPN\nIndividual client\nUp to 1 Gbps"]
    end

    subgraph "Private Circuit (not internet)"
        ER["🔷 ExpressRoute\nDedicated private circuit\n50 Mbps – 100 Gbps\nSLA: 99.95%"]
        ERD["⚡ ExpressRoute Direct\nDirect to Microsoft peering\n10/100 Gbps\nSLA: 99.95%"]
    end

    subgraph "Azure-to-Azure"
        PEER["🔗 VNet Peering\nMicrosoft backbone\nLow latency\nNot transitive"]
        VWAN["🌐 Azure Virtual WAN\nManaged hub-spoke\nAny-to-any routing"]
    end

    ONPREM --> VPN
    ONPREM --> P2S
    ONPREM --> ER
    ONPREM --> ERD
Option Private? Bandwidth Latency SLA Cost
Site-to-Site VPN ✅ Encrypted Up to 10 Gbps Variable 99.9%–99.95% Low
ExpressRoute ✅ True private 50 Mbps–100 Gbps Consistent, low 99.95% High
ExpressRoute + VPN ✅ Both ExpressRoute primary Low + failover 99.95%+ Highest
VNet Peering ✅ (Azure backbone) Limited by VMs Very low Depends on VMs Low
Azure Virtual WAN Aggregated Low 99.9% Medium-High

Exam Caveats ⚠️:

  • ExpressRoute = private circuit, NOT over public internet — for sensitive data, regulatory requirements
  • VPN Gateway = encrypted over the public internet (still passes through internet infrastructure)
  • VNet Peering is NOT transitive — use Azure Virtual WAN or custom route tables for hub-and-spoke transitivity
  • ExpressRoute does NOT have built-in failover — pair with VPN Gateway for resilience

Network Security Layering (Defence in Depth)

graph TD
    INTERNET["🌍 Internet\nInbound Traffic"]
    DDOS["🛡️ Azure DDoS Protection\n(Network Standard — ~€2,940/mo for first 100 IPs)"]
    AFD["🌐 Azure Front Door / App Gateway\n(WAF, SSL Termination)"]
    FW["🔥 Azure Firewall\n(L3-L7, FQDN filtering, threat intel)"]
    NSG_SUB["📋 NSG — Subnet Level\n(L3-L4 filter)"]
    NSG_NIC["📋 NSG — NIC Level\n(L3-L4 filter)"]
    HOST["💻 Host-based (Defender for Endpoint)"]
    APP["🔒 Application / Data Layer\n(encryption, Key Vault, auth)"]

    INTERNET --> DDOS --> AFD --> FW --> NSG_SUB --> NSG_NIC --> HOST --> APP

Azure Firewall vs NSG:

Feature Azure Firewall NSG
OSI Layer L3–L7 L3–L4
FQDN filtering ✅ (e.g., allow *.microsoft.com)
Threat intelligence
TLS inspection ✅ (Premium)
Stateful
Scope Centralised (hub) Per subnet or NIC
SLA 99.99% N/A (free service)
Approx cost ~€1,100+/month Free

Exam Caveats ⚠️:

  • NSGs are free but operate only at L3-L4 (IP/port) — cannot filter by domain name
  • Azure Firewall is required when you need FQDN-based rules or threat intelligence
  • In hub-and-spoke, place Azure Firewall in the hub and route all spoke traffic through it

Private Endpoints vs Service Endpoints

Feature Service Endpoints Private Endpoints
Resource gets private IP in VNet ❌ (traffic routes via Azure backbone) ✅ Yes — private IP in your VNet
Accessible from on-premises (VPN/ER)
DNS required No ✅ (private DNS zone)
Cost Free ~€7–10/month per endpoint
Network path Azure backbone (still exits VNet) Stays entirely in VNet
SLA impact None None (follows service SLA)

Exam Caveats ⚠️:

  • Private Endpoints are required when on-premises resources need to access Azure PaaS services privately (over VPN or ExpressRoute)
  • Service Endpoints do NOT give on-premises resources access to Azure PaaS services
  • Private Endpoints require a Private DNS Zone for name resolution to work correctly

Hub-and-Spoke Network Topology

graph TD
    HUB["🏛️ HUB VNet\n————————\n🔥 Azure Firewall\n🔒 VPN / ER Gateway\n🖥️ Azure Bastion\n📊 Shared Services\n🔐 AD DS / Entra ID"]

    SPOKE1["🔵 Spoke 1 — Production\nApp + DB subnets\nPeered to Hub"]
    SPOKE2["🟡 Spoke 2 — Development\nDev workloads\nPeered to Hub"]
    SPOKE3["🟢 Spoke 3 — DMZ\nInternet-facing\nPeered to Hub"]

    ONPREM["🏢 On-premises\n(via VPN or ExpressRoute)"]

    HUB <-->|"VNet Peering"| SPOKE1
    HUB <-->|"VNet Peering"| SPOKE2
    HUB <-->|"VNet Peering"| SPOKE3
    ONPREM <-->|"VPN / ExpressRoute"| HUB

    note["⚠️ Spokes do NOT peer\nto each other directly\nAll inter-spoke traffic\nroutes via Hub Firewall"]

Hub-and-Spoke design rules:

  • ✅ Spokes peer only to the hub, never directly to each other
  • 🔥 Use a User Defined Route (UDR) in each spoke to force traffic through the Hub Firewall
  • 🔒 Hub contains all shared security services (firewall, gateway, Bastion, monitoring)
  • 💰 Centralising shared services in the hub reduces cost compared to per-spoke duplication

🎯 Domain 4 — Exam Scenario Quick-Reference

Scenario Answer
Modernise .NET app — minimal code changes, PaaS Replatform → Azure App Service
Run containerised microservices at enterprise scale AKS (Standard tier + Availability Zones, 99.99% SLA)
Serverless API, no cold start tolerance, needs VNet Azure Functions Premium (EP) Plan
Decouple order processing, guarantee message ordering Azure Service Bus (queues with sessions)
IoT telemetry ingestion at millions of events/sec Azure Event Hubs
Route events from blob upload to trigger a function Azure Event Grid
Expose internal APIs to external partners securely API Management Premium (Internal VNet mode)
Lift-and-shift 500 VMware VMs to Azure Azure Migrate (Server Migration)
Migrate 20 TB on-prem SQL Server, minimal downtime Azure DMS (online migration) to SQL MI
Transfer 200 TB of files to Azure, internet too slow Azure Data Box (offline transfer)
Private connection to Azure — not over public internet ExpressRoute
Centrally control all Azure network security (FQDN rules) Azure Firewall in Hub VNet
Allow RDP/SSH to VMs without public IPs Azure Bastion
Block all traffic except from specific subnets NSG with Deny All + specific Allow rules
DDoS protection with analytics and response team Azure DDoS Network Protection Standard
Connect multiple Azure regions with any-to-any routing Azure Virtual WAN (Standard)
Storage account access from on-premises via VPN Private Endpoint for Storage (not Service Endpoint)