03 — Design Business Continuity Solutions
Official Exam Weight: 15–20% 📁 ← Back to Home
🗺️ Domain Overview
mindmap
root((Business Continuity))
Core Concepts
RTO – Recovery Time Objective
RPO – Recovery Point Objective
SLA – Service Level Agreement
Composite SLA calculation
High Availability
Availability Sets
Availability Zones
Load Balancers
Zone-redundant services
Backup
Azure Backup
Recovery Services Vault
Backup Vault
Soft Delete
Disaster Recovery
Azure Site Recovery
Recovery Plans
Geo-replication
Auto-Failover Groups
⏱️ 3.1 Core BC/DR Concepts
Key Metrics — Definitions
graph LR
subgraph BEFORE["⏪ Before Incident"]
BACKUP["💾 Last backup\nor replica sync"]
end
subgraph INCIDENT["💥 Incident Occurs"]
FAIL["❌ Service\nFails"]
end
subgraph WINDOW_RPO["📊 RPO Window"]
RPO_NOTE["Data written between\nlast backup and failure\n= LOST DATA\n⟵ RPO measures this gap ⟶"]
end
subgraph RECOVERY["🔧 Recovery Phase"]
DETECT["Detect\nincident"]
RESTORE["Restore\nservice"]
ONLINE["✅ Service\nback online"]
end
subgraph WINDOW_RTO["📊 RTO Window"]
RTO_NOTE["Time from failure\nto full recovery\n= DOWNTIME\n⟵ RTO measures this gap ⟶"]
end
BACKUP -->|"time passes"| FAIL
FAIL --> DETECT --> RESTORE --> ONLINE
BACKUP -.->|"RPO gap"| RPO_NOTE
FAIL -.->|"RTO gap"| RTO_NOTE
| Metric | Full Name | Question it Answers | Lower = ? |
|---|---|---|---|
| RPO | Recovery Point Objective | “How much data can we afford to lose?” | More expensive |
| RTO | Recovery Time Objective | “How long can we be offline?” | More expensive |
| SLA | Service Level Agreement | “What uptime does Azure guarantee?” | Higher = better |
| MTTR | Mean Time to Repair | “How long to fix on average?” | Lower = better |
| MTBF | Mean Time Between Failures | “How often does it break?” | Higher = better |
Exam Caveats ⚠️:
- In exam scenarios, lower RPO and RTO = more redundancy = higher cost
- The exam will often ask you to choose the cheapest solution that meets the RTO/RPO requirement
- Never confuse RPO (data loss) with RTO (downtime)
SLA Calculations
Single-service SLA examples:
| Service | Tier / Config | SLA Uptime | Downtime / Month |
|---|---|---|---|
| Azure VM | Single VM, Premium SSD | 99.9% | ~43 min/month |
| Azure VM | Availability Set (2+ VMs) | 99.95% | ~22 min/month |
| Azure VM | Availability Zones (2+ VMs) | 99.99% | ~4.4 min/month |
| Azure SQL DB | General Purpose | 99.99% | ~4.4 min/month |
| Azure SQL DB | Business Critical | 99.99% | ~4.4 min/month |
| Cosmos DB | Single region | 99.99% | ~4.4 min/month |
| Cosmos DB | Multi-region write | 99.999% | ~26 sec/month |
| App Service | Standard/Premium | 99.95% | ~22 min/month |
| Azure Storage | LRS / ZRS / GRS | 99.9% | ~43 min/month |
| Azure Storage | RA-GRS / RA-GZRS | 99.99% (read) | ~4.4 min/month |
| Azure Kubernetes Service | Standard tier | 99.95% | ~22 min/month |
| AKS + Availability Zones | Standard tier + AZs | 99.99% | ~4.4 min/month |
Composite SLA — Serial Dependencies
When services depend on each other in series, multiply the SLAs:
graph LR
WEB["🌐 Web App\nSLA: 99.95%"] --> SQL["🗄️ SQL Database\nSLA: 99.99%"] --> STOR["💾 Storage\nSLA: 99.9%"]
RESULT["📊 Composite SLA\n= 99.95% × 99.99% × 99.9%\n= 99.84%"]
⚠️ Composite SLA is always lower than the weakest component in a serial chain.
Composite SLA — Parallel (Redundant) Services
When services are parallel (either one can serve the request), availability increases:
1
2
3
4
5
Parallel availability = 1 - (probability both fail)
= 1 - ((1 - 0.999) × (1 - 0.999))
= 1 - (0.001 × 0.001)
= 1 - 0.000001
= 99.9999%
Exam Caveats ⚠️:
- Adding redundancy in parallel improves availability
- Adding dependencies in series always lowers the composite SLA
- The exam frequently tests composite SLA math with 2–3 services in a dependency chain
🏗️ 3.2 High Availability Design
Availability Sets vs Availability Zones
graph TD
subgraph "Availability Set"
direction LR
DC1["🏢 Single Datacenter"]
DC1 --> FD1["🔧 Fault Domain 1\n(Rack A)"]
DC1 --> FD2["🔧 Fault Domain 2\n(Rack B)"]
DC1 --> FD3["🔧 Fault Domain 3\n(Rack C)"]
FD1 --> VM1["VM 1"]
FD2 --> VM2["VM 2"]
FD3 --> VM3["VM 3"]
end
subgraph "Availability Zones"
direction LR
REGION["🌍 Azure Region"]
REGION --> Z1["🏢 Zone 1\n(DC A)"]
REGION --> Z2["🏢 Zone 2\n(DC B)"]
REGION --> Z3["🏢 Zone 3\n(DC C)"]
Z1 --> VMZ1["VM 1"]
Z2 --> VMZ2["VM 2"]
Z3 --> VMZ3["VM 3"]
end
| Feature | Availability Set | Availability Zones |
|---|---|---|
| Protects against | Rack / hardware failure | Entire datacenter failure |
| Scope | Single datacenter | Multiple DCs in same region |
| Fault Domains | 2–3 (separate racks) | 1 per zone (separate DCs) |
| Update Domains | 5–20 (rolling updates) | Independent per zone |
| SLA | 99.95% | 99.99% |
| Extra cost | ❌ Free | ⚠️ Minor inter-zone data transfer cost |
| VM type required | Any | Zone-compatible SKU |
Exam Caveats ⚠️:
- Availability Sets do NOT protect against datacenter failure — only Availability Zones do
- You cannot convert an existing VM into an Availability Set or Zone — must redeploy
- Not all regions support Availability Zones — always check regional availability
- Azure Load Balancer Standard + Availability Zones = zone-redundant load balancing (99.99% SLA)
Zone-Redundant Services SLAs
| Service | Zone-Redundant Mode | SLA |
|---|---|---|
| Azure Load Balancer Standard | Zone-redundant frontend | 99.99% |
| Application Gateway v2 | Zone-redundant deployment | 99.95% |
| Azure SQL Database | Zone-redundant config (BC/GP) | 99.99% |
| Azure Storage ZRS | Automatic across zones | 99.9% |
| Azure Kubernetes Service | Zone-spread node pools | 99.99% |
| Azure Cache for Redis | Zone-redundant (Premium) | 99.9% |
| Azure Service Bus | Zone-redundant (Premium) | 99.9% |
💾 3.3 Design Azure Backup Solutions
Backup Vault Types
graph LR
subgraph "Recovery Services Vault"
RSV1["✅ Azure VMs"]
RSV2["✅ SQL on Azure VM"]
RSV3["✅ SAP HANA on Azure VM"]
RSV4["✅ Azure Files"]
RSV5["✅ On-prem (MARS agent)"]
RSV6["✅ On-prem VMware/Hyper-V (MABS)"]
end
subgraph "Backup Vault (newer)"
BV1["✅ Azure Managed Disks"]
BV2["✅ Azure Blob Storage"]
BV3["✅ Azure Database for PostgreSQL"]
BV4["✅ Azure Kubernetes Service (preview)"]
end
Exam Caveat ⚠️: Know which vault type supports which workload. Azure VMs = Recovery Services Vault. Managed Disks / Blobs = Backup Vault.
Backup Redundancy Options
| Redundancy | Copies | SLA | Cross-Region Restore? | Use Case |
|---|---|---|---|---|
| LRS | 3 in same DC | — | ❌ | Cost-optimised, no geo requirement |
| ZRS | 3 across zones | — | ❌ | Zone failure protection |
| GRS | 6 (3+3 paired region) | — | ✅ (must be enabled) | Cross-region DR |
Exam Caveat ⚠️: For Cross-Region Restore (CRR), the Recovery Services Vault must use GRS redundancy AND the CRR feature must be explicitly enabled. This approximately doubles storage cost.
Soft Delete Protection
| Feature | Default State | Retention After Delete | Cost |
|---|---|---|---|
| Soft Delete (Azure VMs) | ✅ Always-on (cannot disable) | 14 additional days | Free |
| Soft Delete (SQL / SAP HANA) | Must be enabled | 14 additional days | Free |
| Soft Delete (Azure Files) | Must be enabled | 14 additional days | Free |
Exam Caveats ⚠️:
- Soft delete protects against accidental deletion and ransomware that deletes backups
- VM soft delete is always-on — you cannot disable it
- During the soft delete period, data counts toward storage billing
Enhanced Backup Policy (Hourly Backups)
- Standard policy: Daily backups only
- Enhanced policy: Up to hourly backups (every 1–24 hours)
- Required for: SQL Server (log backups), SAP HANA, and Azure VM hourly snapshots
- Lower RPO achieved with enhanced policy
🔄 3.4 Azure Site Recovery (ASR)
ASR Supported Scenarios
graph LR
subgraph "Sources"
VM1["☁️ Azure VMs\n(Region A)"]
VM2["🖥️ VMware VMs\n(On-premises)"]
VM3["🖥️ Hyper-V VMs\n(On-premises)"]
VM4["🖥️ Physical Servers\n(Windows / Linux)"]
end
subgraph "Target"
AZ["☁️ Azure\n(Secondary Region\nor different Availability Zone)"]
end
VM1 -->|"Azure-to-Azure"| AZ
VM2 -->|"VMware-to-Azure"| AZ
VM3 -->|"Hyper-V-to-Azure"| AZ
VM4 -->|"Physical-to-Azure"| AZ
ASR Key Metrics
| Metric | Value | Notes |
|---|---|---|
| RPO (Azure-to-Azure) | < 30 seconds | Continuous replication |
| RTO | ~1–2 hours | Time to complete failover + VM boot |
| Crash-consistent snapshot | Every 5 minutes | Data as if power was cut |
| App-consistent snapshot | Every 1–12 hours (configurable) | VSS snapshot, application-aware |
| Replication frequency | Continuous | Not scheduled |
| SLA | 99.9% | ASR service SLA |
Exam Caveats ⚠️:
- Crash-consistent = no guarantee app is in a clean state (like power failure)
- App-consistent = VSS-based, ensures app is in a recoverable state (fewer per day, higher RPO)
- Test Failover uses an isolated VNet — does NOT impact production replication
- ASR doesn’t back up data — it replicates the VM state for DR purposes (use Azure Backup for data protection)
ASR Recovery Plans
What Recovery Plans provide:
- 📋 Ordered failover — define which VMs boot first (e.g., AD → DB → App → Web)
- 🤖 Automation — include Azure Automation runbooks for scripted steps
- 🧪 Test failover — DR drills without impacting production
- 📊 RTO estimation — test and measure actual failover time
🎯 Domain 3 — Exam Scenario Quick-Reference
| Scenario | Answer |
|---|---|
| Protect Azure VMs from rack-level failure | Availability Set (99.95% SLA) |
| Protect Azure VMs from full datacenter failure | Availability Zones (99.99% SLA) |
| Lowest cost option that still gets a VM SLA | Availability Set (free, 99.95%) |
| Replicate Azure VMs to another Azure region for DR | Azure Site Recovery (Azure-to-Azure) |
| Back up on-premises files to Azure | MARS Agent + Recovery Services Vault |
| Back up Azure VM with cross-region restore capability | Recovery Services Vault + GRS + enable CRR |
| SQL DB failover without changing connection strings | Auto-Failover Group (same listener endpoint) |
| Cosmos DB survives a regional outage automatically | Enable automatic failover in Cosmos DB multi-region config |
| Protect backup data from accidental deletion | Soft Delete (14-day retention, free) |
| Conduct DR drill without impacting production | ASR Test Failover in isolated VNet |
| App needs 99.99% SLA, currently on a single VM | Redeploy across Availability Zones + Standard Load Balancer |
| Calculate composite SLA for web + SQL + storage | Multiply all SLAs (result always lower than weakest link) |
| Need hourly backups for Azure VM | Enable Enhanced Backup Policy |
| Migrate on-premises Hyper-V VMs to Azure for DR | ASR Hyper-V-to-Azure replication |
📊 HA Tier Quick-Reference Ladder
graph TD
L1["❌ No redundancy\nSingle point of failure\nNever for production"]
L2["🟡 Availability Set\n99.95% SLA\nFree — protects against rack failure"]
L3["🟢 Availability Zones\n99.99% SLA\nProtects against datacenter failure"]
L4["🔵 Region Pair + ASR / Geo-Replication\nProtects against full region outage\nHigher cost + complexity"]
L5["🏆 Multi-Region Active-Active\n99.999% SLA (Cosmos DB)\nHighest cost — mission-critical only"]
L1 --> L2 --> L3 --> L4 --> L5