04 — Design Infrastructure Solutions
Official Exam Weight: 30–35% — Heaviest Domain 📁 ← Back to Home
🗺️ Domain Overview
mindmap
root((Infrastructure Solutions))
Compute Solutions
Virtual Machines & VMSS
Azure App Service
Azure Kubernetes Service
Azure Container Instances
Azure Container Apps
Azure Functions
Azure Batch
Application Architecture
Service Bus & Event Grid
Event Hubs
API Management
Azure Managed Redis
Azure CDN & Front Door
Logic Apps
Migration Solutions
Azure Migrate
6 Rs Strategy
Azure DMS
Azure Data Box
Network Solutions
VNet Design
VPN Gateway & ExpressRoute
Azure Firewall & NSGs
Azure Bastion
Private Endpoints
Hub-and-Spoke
Azure Virtual WAN
🖥️ 4.1 Design Compute Solutions
Compute Service Decision Tree
flowchart TD
START([What is the workload?]) --> VM{Need full OS control\nor lift-and-shift IaaS?}
VM -->|Yes| VMSERVICE["🖥️ Azure Virtual Machines\n+ VMSS for autoscaling"]
VM -->|No| WEB{Web app or REST API\nno containers?}
WEB -->|Yes| APP["🌐 Azure App Service\nPaaS, managed runtime"]
WEB -->|No| CONT{Containerised?}
CONT -->|"Short-lived,\nno orchestration"| ACI["📦 Azure Container Instances\nServerless containers"]
CONT -->|"Production\nKubernetes"| AKS["☸️ Azure Kubernetes Service\nManaged K8s"]
CONT -->|"Serverless\nmicroservices"| CAPP["🔲 Azure Container Apps\n(built on K8s + KEDA)"]
CONT -->|No| EVENT{Event-triggered,\nshort execution?}
EVENT -->|Yes| FUNC["⚡ Azure Functions\nServerless compute"]
EVENT -->|No| BATCH["⚙️ Azure Batch\nHPC / parallel jobs"]
Azure VM Scale Sets (VMSS)
Key features:
- 🔢 Scale from 0 to 1,000 VMs (custom images: 600)
- ⚖️ Autoscale based on metrics (CPU, memory, queue depth) or schedule
- 🔄 Rolling upgrades — update VMs in batches to avoid downtime
- 🏗️ Two orchestration modes:
| Mode | Description | Best For |
|---|---|---|
| Uniform | All instances identical, same SKU | Stateless workloads, AKS node pools |
| Flexible | Mix of VM configs, better fault domain spread | Workloads requiring VM-level flexibility |
Exam Caveats ⚠️:
- For new VMSS deployments, Flexible orchestration is the Microsoft-recommended mode
- VMSS Autoscale requires Standard Load Balancer (not Basic)
- Define autoscale rules with cool-down periods to prevent flapping (scale in/out too rapidly)
Azure Functions — Hosting Plans Comparison
| Plan | Scale | Cold Start | Timeout (Default / Max) | VNet Integration | SLA |
|---|---|---|---|---|---|
| Consumption | Auto (unlimited) | ✅ Yes | 5m / 10m | ❌ No | 99.9% |
| Flex Consumption | Auto + pre-provisioned | ✅ Reduced | 30m / Unbounded | ✅ Yes | 99.9% |
| Premium (EP) | Auto | ❌ Pre-warmed | 30m / Unbounded | ✅ Yes | 99.9% |
| Dedicated (App Svc) | Manual / auto | ❌ Always on | 30m / Unbounded | ✅ Yes | 99.9% |
| Container Apps | Auto (KEDA) | ✅ Possible | 30m / Unbounded | ✅ Yes | 99.9% |
Exam Caveats ⚠️:
- Consumption plan cannot use VNet Integration — use Premium or Dedicated if needed
- Premium plan eliminates cold starts via pre-warmed instances
- Durable Functions enables stateful workflows across multiple function executions
Azure Kubernetes Service (AKS)
Key design decisions:
| Decision | Options |
|---|---|
| Network plugin | Kubenet (basic, no direct pod IP) vs Azure CNI (each pod gets VNet IP — required for AKS Ingress controllers, policy) |
| Node pools | System (K8s system services) + User (your workloads) — separate for isolation |
| HA | Availability Zones for node distribution (upgrades to 99.99% SLA) |
| Autoscaling | Cluster Autoscaler (nodes) + Horizontal Pod Autoscaler (pods) |
| Cluster tier | Free (no SLA) / Standard (99.95%) / Premium (99.95% + LTS) |
AKS SLAs:
| Tier | SLA | Notes |
|---|---|---|
| Free | No SLA | Dev/test only |
| Standard | 99.95% | Production recommended |
| Standard + Availability Zones | 99.99% | Mission-critical production |
Exam Caveats ⚠️:
- Azure CNI is required for: Advanced Networking policies, Windows node pools, Virtual nodes
- Kubenet is simpler but has limitations — use for basic workloads only
- AKS System node pools cannot be scaled to 0 — they must always have at least 1 node
Azure App Service — Deployment Slots
flowchart LR
DEV["👨💻 Developer\npushes code"] --> STAGING["🔵 Staging Slot\n(pre-production)"]
STAGING --> TEST["🧪 Testing &\nWarm-up"]
TEST --> SWAP["🔄 Slot Swap\n(zero-downtime deploy)"]
SWAP --> PROD["🟢 Production Slot\n(live traffic)"]
SWAP -.->|"Rollback:\nswap back"| STAGING
Slot behaviour:
- ✅ Slots share the same App Service Plan resources
- 🔁 Swap exchanges slot content and settings with zero downtime
- ⚙️ Some settings are “slot-sticky” (stay with the slot) vs non-sticky (swap with deployment)
- 📊 Traffic routing — send % of traffic to a slot for A/B testing
- Slots per tier: Standard = 5, Premium = 20, Isolated = 20
🏛️ 4.2 Design Application Architecture
Messaging Service Comparison
graph LR
subgraph "MESSAGES (pull/push, reliable delivery)"
SB["📬 Azure Service Bus\n————————\nQueues: point-to-point\nTopics: fan-out pub/sub\nMax message: 100 MB\nOrdered, FIFO, DLQ\nSLA: 99.9%"]
SQ["📋 Azure Storage Queue\n————————\nSimple FIFO queue\nMax message: 64 KB\nHigh volume, low cost\nSLA: 99.9%"]
end
subgraph "EVENTS (fire-and-forget, reactive)"
EG["⚡ Azure Event Grid\n————————\nEvent routing (push)\nFire-and-forget\nNo retention\nSLA: 99.9%"]
EH["📡 Azure Event Hubs\n————————\nEvent streaming (pull)\nKafka-compatible\nRetention: 1–90 days\nMB/s to GB/s throughput\nSLA: 99.9%"]
end
Detailed comparison — Event Grid vs Event Hubs:
| Feature | Azure Event Grid | Azure Event Hubs |
|---|---|---|
| Pattern | Event routing (push) | Event streaming (pull/checkpoint) |
| Retention | None (fire-and-forget) | 1–90 days |
| Consumer model | Push to subscribers | Pull via consumer groups |
| Throughput | Millions of events/sec | Millions of events/sec |
| Ordering | ❌ | ✅ (within partition) |
| Replay | ❌ | ✅ |
| Kafka compat | ❌ | ✅ |
| SLA | 99.9% | 99.9% |
| Use when | Trigger serverless functions, webhooks | Log aggregation, telemetry, analytics |
Detailed comparison — Service Bus vs Storage Queue:
| Feature | Azure Service Bus | Azure Storage Queue |
|---|---|---|
| Max message size | 256 KB–100 MB | 64 KB |
| Max queue size | 80 GB | Unlimited (storage limit) |
| Guaranteed ordering (FIFO) | ✅ (with sessions) | ❌ Best-effort |
| Dead-letter queue (DLQ) | ✅ | ❌ |
| Duplicate detection | ✅ | ❌ |
| Transactions | ✅ | ❌ |
| Message lock / peek-lock | ✅ | ✅ |
| Pub/sub topics | ✅ | ❌ |
| SLA | 99.9% | 99.9% |
| Cost | Higher | Lower |
| Use when | Enterprise messaging, ordering, reliability | Simple, high-volume, cheap queuing |
Exam Caveats ⚠️:
- Event Grid = reactive programming trigger (e.g., blob created → trigger a function)
- Event Hubs = high-throughput streaming (think: Apache Kafka use cases, IoT telemetry)
- Storage Queue = simplest, cheapest; use only when Service Bus features are not needed
- Service Bus = enterprise messaging (reliability, ordering, DLQ, AMQP) — use when message delivery guarantee matters
API Management (APIM) — SKU Comparison
| Tier | Units | SLA | VNet Integration | Self-hosted Gateway | Approx Cost/mo |
|---|---|---|---|---|---|
| Developer | 1 | No SLA | ❌ | ❌ | ~€45 |
| Basic | 1–2 | 99.9% | ❌ | ❌ | ~€145 |
| Standard | 1–4 | 99.9% | External VNet | ❌ | ~€725 |
| Premium | 1–31 per region | 99.99% | Internal & External | ✅ | ~€3,000+ |
| Consumption | Serverless | 99.9% | ❌ | ❌ | Pay-per-call |
Exam Caveats ⚠️:
- Premium tier is required for: Internal VNet mode, multi-region deployments, self-hosted gateway
- Consumption tier has no built-in cache and no VNet support — for low-traffic/dev scenarios
- 99.99% SLA requires Premium with multi-region (2+ regions) deployment
Azure Managed Redis — Tier Comparison
⚠️ Azure Cache for Redis is retiring. Azure Managed Redis (AMR) is now GA and the recommended replacement. Basic/Standard/Premium retire September 30, 2028; Enterprise/Enterprise Flash retire March 31, 2027. Microsoft recommends migrating now rather than waiting.
Azure Managed Redis is built on Redis Enterprise software (not the community OSS fork). All tiers support clustering, persistence, and active geo-replication by default — enterprise features that were previously gated behind higher tiers in Azure Cache for Redis.
Performance tier selection — the new model:
| Tier | vCPU:Memory Ratio | Storage | Best For | Status |
|---|---|---|---|---|
| Memory Optimized | 1:8 (most memory per vCPU) | RAM only | Memory-intensive workloads; large datasets, lower throughput needs | GA (≤235 GB); Preview (>235 GB) |
| Balanced | 1:4 | RAM only | Standard workloads; healthy mix of memory and compute | GA (≤235 GB); Preview (>235 GB) |
| Compute Optimized | 1:2 (most vCPU per GB) | RAM only | Throughput-intensive, high-performance, mission-critical | GA (≤235 GB); Preview (>235 GB) |
| Flash Optimized | — | 20% RAM + 80% NVMe | Large datasets at lower cost; read-heavy, infrequently accessed data | Preview |
SLA and availability:
| Configuration | SLA |
|---|---|
| High availability disabled (no replication) | No SLA — dev/test only |
| HA enabled (primary + replica across 2 nodes) | 99.99% |
| HA + deployment across 3+ regions with 3+ AZs | 99.999% |
Key architectural details:
- Zone redundant by default when HA is enabled — primary and replica shards placed across AZs automatically
- ~20% memory reserved per instance as a buffer for replication, failover, and active geo-replication operations
- Active geo-replication available on all tiers (was Enterprise-only in Azure Cache for Redis)
- Clustering on by default — clients must support Redis Cluster API; nonclustered option available up to 25 GB
- Flash Optimized: keys always in RAM, values spill to NVMe; well-suited for read-heavy workloads with large values and a “hot” key subset; not suitable for write-heavy or uniformly random access patterns
Legacy Azure Cache for Redis — tier mapping for migration:
| Legacy Tier | Retiring | Recommended Migration Target |
|---|---|---|
| Basic | Sep 30, 2028 | AMR Balanced (smallest SKU, disable HA for dev/test) |
| Standard | Sep 30, 2028 | AMR Balanced |
| Premium | Sep 30, 2028 | AMR Memory Optimized or Balanced |
| Enterprise | Mar 31, 2027 | AMR Compute Optimized or Balanced |
| Enterprise Flash | Mar 31, 2027 | AMR Flash Optimized |
Exam Caveats ⚠️:
- Active geo-replication is no longer Enterprise-only — it is available on all Azure Managed Redis tiers. The old rule (Enterprise required) applied to Azure Cache for Redis, which is now retiring.
- SLA requires HA to be enabled. Disabling HA halves cost but gives no SLA and risks data loss — dev/test only.
- Flash Optimized is still in Public Preview — avoid it as the exam answer when “production” or “GA” is a constraint.
- Tiers select by workload profile, not by feature gates. In AMR, all tiers get clustering, persistence, and geo-replication; tier choice is about the memory-to-compute ratio you need.
- The exam may still reference Azure Cache for Redis tiers until materials are updated — know both models and watch for context clues (retirement notices, “new deployment”, “Azure Managed Redis”) to determine which tier model a question uses.
🚀 4.3 Design Migration Solutions
Migration Strategy — The 6 Rs (+ Retire)
graph LR
CURRENT["🏢 On-premises\nWorkload"]
CURRENT -->|"Low effort\nLift & Shift"| REHOST["🔁 Rehost\n(IaaS)\nVMs in Azure"]
CURRENT -->|"Minor changes\nfor PaaS"| REPLATFORM["🔼 Replatform\n(Lift & Optimize)\nApp Service, SQL MI"]
CURRENT -->|"Redesign for cloud"| REFACTOR["🏗️ Refactor / Re-architect\nMicroservices, PaaS"]
CURRENT -->|"Rewrite from scratch"| REBUILD["🔨 Rebuild\nCloud-native app"]
CURRENT -->|"Replace with SaaS"| REPLACE["🔄 Replace\nSaaS alternative"]
CURRENT -->|"Keep on-prem"| RETAIN["⏸️ Retain\nNot ready to migrate"]
CURRENT -->|"Decommission"| RETIRE["🗑️ Retire\nNo longer needed"]
| Strategy | Cloud Benefit | Effort | Time to Cloud |
|---|---|---|---|
| Rehost (Lift & Shift) | Low | Low | Fastest |
| Replatform (Lift & Optimize) | Medium | Medium | Moderate |
| Refactor / Rearchitect | High | High | Slower |
| Rebuild | Very High | Very High | Slowest |
| Replace | Variable | Low | Fast |
| Retain | None | None | N/A |
| Retire | Cost eliminated | Minimal | Immediate |
Exam Caveat ⚠️: Exam scenarios frequently have budget or time constraints. Rehost = fastest, lowest risk. Refactor = best long-term cloud benefit but most expensive and risky.
Azure Migrate — Core Tools
| Tool | Purpose |
|---|---|
| Azure Migrate: Discovery & Assessment | Discover on-prem VMs, assess Azure readiness, estimate costs |
| Azure Migrate: Server Migration | Migrate VMware, Hyper-V, physical servers, and cloud VMs to Azure |
| Azure Database Migration Service (DMS) | Migrate databases to Azure SQL, MySQL, PostgreSQL |
| Azure Data Box | Offline bulk data transfer (40 TB – petabytes) |
| Azure Data Box Disk | Offline transfer (up to 35 TB per disk set) |
| Storage Mover | Migrate file shares to Azure Files or ADLS Gen2 |
Azure Migrate Workflow:
flowchart LR
P1["1️⃣ Create\nAzure Migrate\nProject"] --> P2["2️⃣ Deploy\nAppliance\n(on-premises)"]
P2 --> P3["3️⃣ Discover\nVMs, servers,\ndatabases, apps"]
P3 --> P4["4️⃣ Assess\nReadiness + sizing\n+ cost estimation"]
P4 --> P5["5️⃣ Replicate\n(continuous\nreplication)"]
P5 --> P6["6️⃣ Test\nMigration\n(isolated test)"]
P6 --> P7["7️⃣ Migrate\n(cutover)"]
P7 --> P8["8️⃣ Optimise\n& Monitor"]
🌐 4.4 Design Network Solutions
Connectivity Options Comparison
graph LR
ONPREM["🏢 On-premises"]
subgraph "Over Public Internet (encrypted)"
VPN["🔒 VPN Gateway\nSite-to-Site IPsec\nUp to 10 Gbps\nSLA: 99.9%–99.95%"]
P2S["📱 Point-to-Site VPN\nIndividual client\nUp to 1 Gbps"]
end
subgraph "Private Circuit (not internet)"
ER["🔷 ExpressRoute\nDedicated private circuit\n50 Mbps – 100 Gbps\nSLA: 99.95%"]
ERD["⚡ ExpressRoute Direct\nDirect to Microsoft peering\n10/100 Gbps\nSLA: 99.95%"]
end
subgraph "Azure-to-Azure"
PEER["🔗 VNet Peering\nMicrosoft backbone\nLow latency\nNot transitive"]
VWAN["🌐 Azure Virtual WAN\nManaged hub-spoke\nAny-to-any routing"]
end
ONPREM --> VPN
ONPREM --> P2S
ONPREM --> ER
ONPREM --> ERD
| Option | Private? | Bandwidth | Latency | SLA | Cost |
|---|---|---|---|---|---|
| Site-to-Site VPN | ✅ Encrypted | Up to 10 Gbps | Variable | 99.9%–99.95% | Low |
| ExpressRoute | ✅ True private | 50 Mbps–100 Gbps | Consistent, low | 99.95% | High |
| ExpressRoute + VPN | ✅ Both | ExpressRoute primary | Low + failover | 99.95%+ | Highest |
| VNet Peering | ✅ (Azure backbone) | Limited by VMs | Very low | Depends on VMs | Low |
| Azure Virtual WAN | ✅ | Aggregated | Low | 99.9% | Medium-High |
Exam Caveats ⚠️:
- ExpressRoute = private circuit, NOT over public internet — for sensitive data, regulatory requirements
- VPN Gateway = encrypted over the public internet (still passes through internet infrastructure)
- VNet Peering is NOT transitive — use Azure Virtual WAN or custom route tables for hub-and-spoke transitivity
- ExpressRoute does NOT have built-in failover — pair with VPN Gateway for resilience
Network Security Layering (Defence in Depth)
graph TD
INTERNET["🌍 Internet\nInbound Traffic"]
DDOS["🛡️ Azure DDoS Protection\n(Network Standard — ~€2,940/mo for first 100 IPs)"]
AFD["🌐 Azure Front Door / App Gateway\n(WAF, SSL Termination)"]
FW["🔥 Azure Firewall\n(L3-L7, FQDN filtering, threat intel)"]
NSG_SUB["📋 NSG — Subnet Level\n(L3-L4 filter)"]
NSG_NIC["📋 NSG — NIC Level\n(L3-L4 filter)"]
HOST["💻 Host-based (Defender for Endpoint)"]
APP["🔒 Application / Data Layer\n(encryption, Key Vault, auth)"]
INTERNET --> DDOS --> AFD --> FW --> NSG_SUB --> NSG_NIC --> HOST --> APP
Azure Firewall vs NSG:
| Feature | Azure Firewall | NSG |
|---|---|---|
| OSI Layer | L3–L7 | L3–L4 |
| FQDN filtering | ✅ (e.g., allow *.microsoft.com) | ❌ |
| Threat intelligence | ✅ | ❌ |
| TLS inspection | ✅ (Premium) | ❌ |
| Stateful | ✅ | ✅ |
| Scope | Centralised (hub) | Per subnet or NIC |
| SLA | 99.99% | N/A (free service) |
| Approx cost | ~€1,100+/month | Free |
Exam Caveats ⚠️:
- NSGs are free but operate only at L3-L4 (IP/port) — cannot filter by domain name
- Azure Firewall is required when you need FQDN-based rules or threat intelligence
- In hub-and-spoke, place Azure Firewall in the hub and route all spoke traffic through it
Private Endpoints vs Service Endpoints
| Feature | Service Endpoints | Private Endpoints |
|---|---|---|
| Resource gets private IP in VNet | ❌ (traffic routes via Azure backbone) | ✅ Yes — private IP in your VNet |
| Accessible from on-premises (VPN/ER) | ❌ | ✅ |
| DNS required | ❌ | ✅ (private DNS zone) |
| Cost | Free | ~€7–10/month per endpoint |
| Network path | Azure backbone (still exits VNet) | Stays entirely in VNet |
| SLA impact | None | None (follows service SLA) |
Exam Caveats ⚠️:
- Private Endpoints are required when on-premises resources need to access Azure PaaS services privately (over VPN or ExpressRoute)
- Service Endpoints do NOT give on-premises resources access to Azure PaaS services
- Private Endpoints require a Private DNS Zone for name resolution to work correctly
Hub-and-Spoke Network Topology
graph TD
HUB["🏛️ HUB VNet\n————————\n🔥 Azure Firewall\n🔒 VPN / ER Gateway\n🖥️ Azure Bastion\n📊 Shared Services\n🔐 AD DS / Entra ID"]
SPOKE1["🔵 Spoke 1 — Production\nApp + DB subnets\nPeered to Hub"]
SPOKE2["🟡 Spoke 2 — Development\nDev workloads\nPeered to Hub"]
SPOKE3["🟢 Spoke 3 — DMZ\nInternet-facing\nPeered to Hub"]
ONPREM["🏢 On-premises\n(via VPN or ExpressRoute)"]
HUB <-->|"VNet Peering"| SPOKE1
HUB <-->|"VNet Peering"| SPOKE2
HUB <-->|"VNet Peering"| SPOKE3
ONPREM <-->|"VPN / ExpressRoute"| HUB
note["⚠️ Spokes do NOT peer\nto each other directly\nAll inter-spoke traffic\nroutes via Hub Firewall"]
Hub-and-Spoke design rules:
- ✅ Spokes peer only to the hub, never directly to each other
- 🔥 Use a User Defined Route (UDR) in each spoke to force traffic through the Hub Firewall
- 🔒 Hub contains all shared security services (firewall, gateway, Bastion, monitoring)
- 💰 Centralising shared services in the hub reduces cost compared to per-spoke duplication
🎯 Domain 4 — Exam Scenario Quick-Reference
| Scenario | Answer |
|---|---|
| Modernise .NET app — minimal code changes, PaaS | Replatform → Azure App Service |
| Run containerised microservices at enterprise scale | AKS (Standard tier + Availability Zones, 99.99% SLA) |
| Serverless API, no cold start tolerance, needs VNet | Azure Functions Premium (EP) Plan |
| Decouple order processing, guarantee message ordering | Azure Service Bus (queues with sessions) |
| IoT telemetry ingestion at millions of events/sec | Azure Event Hubs |
| Route events from blob upload to trigger a function | Azure Event Grid |
| Expose internal APIs to external partners securely | API Management Premium (Internal VNet mode) |
| Lift-and-shift 500 VMware VMs to Azure | Azure Migrate (Server Migration) |
| Migrate 20 TB on-prem SQL Server, minimal downtime | Azure DMS (online migration) to SQL MI |
| Transfer 200 TB of files to Azure, internet too slow | Azure Data Box (offline transfer) |
| Private connection to Azure — not over public internet | ExpressRoute |
| Centrally control all Azure network security (FQDN rules) | Azure Firewall in Hub VNet |
| Allow RDP/SSH to VMs without public IPs | Azure Bastion |
| Block all traffic except from specific subnets | NSG with Deny All + specific Allow rules |
| DDoS protection with analytics and response team | Azure DDoS Network Protection Standard |
| Connect multiple Azure regions with any-to-any routing | Azure Virtual WAN (Standard) |
| Storage account access from on-premises via VPN | Private Endpoint for Storage (not Service Endpoint) |