03 — Design Business Continuity Solutions

Official Exam Weight: 15–20% 📁 ← Back to Home


🗺️ Domain Overview

mindmap
  root((Business Continuity))
    Core Concepts
      RTO – Recovery Time Objective
      RPO – Recovery Point Objective
      SLA – Service Level Agreement
      Composite SLA calculation
    High Availability
      Availability Sets
      Availability Zones
      Load Balancers
      Zone-redundant services
    Backup
      Azure Backup
      Recovery Services Vault
      Backup Vault
      Soft Delete
    Disaster Recovery
      Azure Site Recovery
      Recovery Plans
      Geo-replication
      Auto-Failover Groups

⏱️ 3.1 Core BC/DR Concepts

Key Metrics — Definitions

graph LR
    subgraph BEFORE["⏪ Before Incident"]
        BACKUP["💾 Last backup\nor replica sync"]
    end

    subgraph INCIDENT["💥 Incident Occurs"]
        FAIL["❌ Service\nFails"]
    end

    subgraph WINDOW_RPO["📊 RPO Window"]
        RPO_NOTE["Data written between\nlast backup and failure\n= LOST DATA\n⟵ RPO measures this gap ⟶"]
    end

    subgraph RECOVERY["🔧 Recovery Phase"]
        DETECT["Detect\nincident"]
        RESTORE["Restore\nservice"]
        ONLINE["✅ Service\nback online"]
    end

    subgraph WINDOW_RTO["📊 RTO Window"]
        RTO_NOTE["Time from failure\nto full recovery\n= DOWNTIME\n⟵ RTO measures this gap ⟶"]
    end

    BACKUP -->|"time passes"| FAIL
    FAIL --> DETECT --> RESTORE --> ONLINE
    BACKUP -.->|"RPO gap"| RPO_NOTE
    FAIL -.->|"RTO gap"| RTO_NOTE
Metric Full Name Question it Answers Lower = ?
RPO Recovery Point Objective “How much data can we afford to lose?” More expensive
RTO Recovery Time Objective “How long can we be offline?” More expensive
SLA Service Level Agreement “What uptime does Azure guarantee?” Higher = better
MTTR Mean Time to Repair “How long to fix on average?” Lower = better
MTBF Mean Time Between Failures “How often does it break?” Higher = better

Exam Caveats ⚠️:

  • In exam scenarios, lower RPO and RTO = more redundancy = higher cost
  • The exam will often ask you to choose the cheapest solution that meets the RTO/RPO requirement
  • Never confuse RPO (data loss) with RTO (downtime)

SLA Calculations

Single-service SLA examples:

Service Tier / Config SLA Uptime Downtime / Month
Azure VM Single VM, Premium SSD 99.9% ~43 min/month
Azure VM Availability Set (2+ VMs) 99.95% ~22 min/month
Azure VM Availability Zones (2+ VMs) 99.99% ~4.4 min/month
Azure SQL DB General Purpose 99.99% ~4.4 min/month
Azure SQL DB Business Critical 99.99% ~4.4 min/month
Cosmos DB Single region 99.99% ~4.4 min/month
Cosmos DB Multi-region write 99.999% ~26 sec/month
App Service Standard/Premium 99.95% ~22 min/month
Azure Storage LRS / ZRS / GRS 99.9% ~43 min/month
Azure Storage RA-GRS / RA-GZRS 99.99% (read) ~4.4 min/month
Azure Kubernetes Service Standard tier 99.95% ~22 min/month
AKS + Availability Zones Standard tier + AZs 99.99% ~4.4 min/month

Composite SLA — Serial Dependencies

When services depend on each other in series, multiply the SLAs:

graph LR
    WEB["🌐 Web App\nSLA: 99.95%"] --> SQL["🗄️ SQL Database\nSLA: 99.99%"] --> STOR["💾 Storage\nSLA: 99.9%"]

    RESULT["📊 Composite SLA\n= 99.95% × 99.99% × 99.9%\n= 99.84%"]

⚠️ Composite SLA is always lower than the weakest component in a serial chain.

Composite SLA — Parallel (Redundant) Services

When services are parallel (either one can serve the request), availability increases:

1
2
3
4
5
Parallel availability = 1 - (probability both fail)
= 1 - ((1 - 0.999) × (1 - 0.999))
= 1 - (0.001 × 0.001)
= 1 - 0.000001
= 99.9999%

Exam Caveats ⚠️:

  • Adding redundancy in parallel improves availability
  • Adding dependencies in series always lowers the composite SLA
  • The exam frequently tests composite SLA math with 2–3 services in a dependency chain

🏗️ 3.2 High Availability Design

Availability Sets vs Availability Zones

graph TD
    subgraph "Availability Set"
        direction LR
        DC1["🏢 Single Datacenter"]
        DC1 --> FD1["🔧 Fault Domain 1\n(Rack A)"]
        DC1 --> FD2["🔧 Fault Domain 2\n(Rack B)"]
        DC1 --> FD3["🔧 Fault Domain 3\n(Rack C)"]
        FD1 --> VM1["VM 1"]
        FD2 --> VM2["VM 2"]
        FD3 --> VM3["VM 3"]
    end

    subgraph "Availability Zones"
        direction LR
        REGION["🌍 Azure Region"]
        REGION --> Z1["🏢 Zone 1\n(DC A)"]
        REGION --> Z2["🏢 Zone 2\n(DC B)"]
        REGION --> Z3["🏢 Zone 3\n(DC C)"]
        Z1 --> VMZ1["VM 1"]
        Z2 --> VMZ2["VM 2"]
        Z3 --> VMZ3["VM 3"]
    end
Feature Availability Set Availability Zones
Protects against Rack / hardware failure Entire datacenter failure
Scope Single datacenter Multiple DCs in same region
Fault Domains 2–3 (separate racks) 1 per zone (separate DCs)
Update Domains 5–20 (rolling updates) Independent per zone
SLA 99.95% 99.99%
Extra cost ❌ Free ⚠️ Minor inter-zone data transfer cost
VM type required Any Zone-compatible SKU

Exam Caveats ⚠️:

  • Availability Sets do NOT protect against datacenter failure — only Availability Zones do
  • You cannot convert an existing VM into an Availability Set or Zone — must redeploy
  • Not all regions support Availability Zones — always check regional availability
  • Azure Load Balancer Standard + Availability Zones = zone-redundant load balancing (99.99% SLA)

Zone-Redundant Services SLAs

Service Zone-Redundant Mode SLA
Azure Load Balancer Standard Zone-redundant frontend 99.99%
Application Gateway v2 Zone-redundant deployment 99.95%
Azure SQL Database Zone-redundant config (BC/GP) 99.99%
Azure Storage ZRS Automatic across zones 99.9%
Azure Kubernetes Service Zone-spread node pools 99.99%
Azure Cache for Redis Zone-redundant (Premium) 99.9%
Azure Service Bus Zone-redundant (Premium) 99.9%

💾 3.3 Design Azure Backup Solutions

Backup Vault Types

graph LR
    subgraph "Recovery Services Vault"
        RSV1["✅ Azure VMs"]
        RSV2["✅ SQL on Azure VM"]
        RSV3["✅ SAP HANA on Azure VM"]
        RSV4["✅ Azure Files"]
        RSV5["✅ On-prem (MARS agent)"]
        RSV6["✅ On-prem VMware/Hyper-V (MABS)"]
    end

    subgraph "Backup Vault (newer)"
        BV1["✅ Azure Managed Disks"]
        BV2["✅ Azure Blob Storage"]
        BV3["✅ Azure Database for PostgreSQL"]
        BV4["✅ Azure Kubernetes Service (preview)"]
    end

Exam Caveat ⚠️: Know which vault type supports which workload. Azure VMs = Recovery Services Vault. Managed Disks / Blobs = Backup Vault.

Backup Redundancy Options

Redundancy Copies SLA Cross-Region Restore? Use Case
LRS 3 in same DC Cost-optimised, no geo requirement
ZRS 3 across zones Zone failure protection
GRS 6 (3+3 paired region) ✅ (must be enabled) Cross-region DR

Exam Caveat ⚠️: For Cross-Region Restore (CRR), the Recovery Services Vault must use GRS redundancy AND the CRR feature must be explicitly enabled. This approximately doubles storage cost.

Soft Delete Protection

Feature Default State Retention After Delete Cost
Soft Delete (Azure VMs) ✅ Always-on (cannot disable) 14 additional days Free
Soft Delete (SQL / SAP HANA) Must be enabled 14 additional days Free
Soft Delete (Azure Files) Must be enabled 14 additional days Free

Exam Caveats ⚠️:

  • Soft delete protects against accidental deletion and ransomware that deletes backups
  • VM soft delete is always-on — you cannot disable it
  • During the soft delete period, data counts toward storage billing

Enhanced Backup Policy (Hourly Backups)

  • Standard policy: Daily backups only
  • Enhanced policy: Up to hourly backups (every 1–24 hours)
  • Required for: SQL Server (log backups), SAP HANA, and Azure VM hourly snapshots
  • Lower RPO achieved with enhanced policy

🔄 3.4 Azure Site Recovery (ASR)

ASR Supported Scenarios

graph LR
    subgraph "Sources"
        VM1["☁️ Azure VMs\n(Region A)"]
        VM2["🖥️ VMware VMs\n(On-premises)"]
        VM3["🖥️ Hyper-V VMs\n(On-premises)"]
        VM4["🖥️ Physical Servers\n(Windows / Linux)"]
    end

    subgraph "Target"
        AZ["☁️ Azure\n(Secondary Region\nor different Availability Zone)"]
    end

    VM1 -->|"Azure-to-Azure"| AZ
    VM2 -->|"VMware-to-Azure"| AZ
    VM3 -->|"Hyper-V-to-Azure"| AZ
    VM4 -->|"Physical-to-Azure"| AZ

ASR Key Metrics

Metric Value Notes
RPO (Azure-to-Azure) < 30 seconds Continuous replication
RTO ~1–2 hours Time to complete failover + VM boot
Crash-consistent snapshot Every 5 minutes Data as if power was cut
App-consistent snapshot Every 1–12 hours (configurable) VSS snapshot, application-aware
Replication frequency Continuous Not scheduled
SLA 99.9% ASR service SLA

Exam Caveats ⚠️:

  • Crash-consistent = no guarantee app is in a clean state (like power failure)
  • App-consistent = VSS-based, ensures app is in a recoverable state (fewer per day, higher RPO)
  • Test Failover uses an isolated VNet — does NOT impact production replication
  • ASR doesn’t back up data — it replicates the VM state for DR purposes (use Azure Backup for data protection)

ASR Recovery Plans

What Recovery Plans provide:

  • 📋 Ordered failover — define which VMs boot first (e.g., AD → DB → App → Web)
  • 🤖 Automation — include Azure Automation runbooks for scripted steps
  • 🧪 Test failover — DR drills without impacting production
  • 📊 RTO estimation — test and measure actual failover time

🎯 Domain 3 — Exam Scenario Quick-Reference

Scenario Answer
Protect Azure VMs from rack-level failure Availability Set (99.95% SLA)
Protect Azure VMs from full datacenter failure Availability Zones (99.99% SLA)
Lowest cost option that still gets a VM SLA Availability Set (free, 99.95%)
Replicate Azure VMs to another Azure region for DR Azure Site Recovery (Azure-to-Azure)
Back up on-premises files to Azure MARS Agent + Recovery Services Vault
Back up Azure VM with cross-region restore capability Recovery Services Vault + GRS + enable CRR
SQL DB failover without changing connection strings Auto-Failover Group (same listener endpoint)
Cosmos DB survives a regional outage automatically Enable automatic failover in Cosmos DB multi-region config
Protect backup data from accidental deletion Soft Delete (14-day retention, free)
Conduct DR drill without impacting production ASR Test Failover in isolated VNet
App needs 99.99% SLA, currently on a single VM Redeploy across Availability Zones + Standard Load Balancer
Calculate composite SLA for web + SQL + storage Multiply all SLAs (result always lower than weakest link)
Need hourly backups for Azure VM Enable Enhanced Backup Policy
Migrate on-premises Hyper-V VMs to Azure for DR ASR Hyper-V-to-Azure replication

📊 HA Tier Quick-Reference Ladder

graph TD
    L1["❌ No redundancy\nSingle point of failure\nNever for production"]
    L2["🟡 Availability Set\n99.95% SLA\nFree — protects against rack failure"]
    L3["🟢 Availability Zones\n99.99% SLA\nProtects against datacenter failure"]
    L4["🔵 Region Pair + ASR / Geo-Replication\nProtects against full region outage\nHigher cost + complexity"]
    L5["🏆 Multi-Region Active-Active\n99.999% SLA (Cosmos DB)\nHighest cost — mission-critical only"]

    L1 --> L2 --> L3 --> L4 --> L5