Domain 5: Implement an Instrumentation Strategy
Exam Weight: 5–10% 📁 ← Back to Home
📑 Table of Contents
5.1 Configure Monitoring for a DevOps Environment
Azure Monitor Overview
Azure Monitor is the central monitoring platform for Azure. All telemetry flows through it.
1
2
3
4
5
6
7
8
┌──────────────────────┐
Applications ──────► │ │
VMs/Containers ────► │ Azure Monitor │──► Dashboards
Azure Resources ───► │ (Data Platform) │──► Alerts
Azure DevOps ──────► │ │──► Log Analytics
GitHub ────────────► │ Metrics | Logs | │──► Workbooks
│ Traces | Changes │──► Power BI
└──────────────────────┘
Azure Monitor data types: | Type | Storage | Examples | |——|———|———| | Metrics | Azure Monitor Metrics (time-series) | CPU %, response time, request count | | Logs | Log Analytics workspace | Application logs, audit logs, events | | Traces | Application Insights | Distributed traces across services | | Changes | Change Analysis | Resource configuration changes |
Application Insights
Application Insights is an APM (Application Performance Monitoring) service within Azure Monitor.
Key capabilities: | Feature | Description | |———|————-| | Live Metrics | Real-time stream of performance data | | Application Map | Visual diagram of connected services and their health | | Failures | Exception tracking and failed request analysis | | Performance | Response time analysis, slowest operations | | Users/Sessions | Usage analytics | | Availability Tests | Proactive monitoring from multiple locations | | Smart Detection | AI-based anomaly detection | | Distributed Tracing | End-to-end trace across microservices |
Instrumentation approaches:
Option 1: SDK-based (Code-based)
1
2
3
4
5
// .NET — add Application Insights SDK
builder.Services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString = builder.Configuration["APPLICATIONINSIGHTS_CONNECTION_STRING"];
});
1
2
3
4
5
# Python
from applicationinsights import TelemetryClient
tc = TelemetryClient('<instrumentation_key>')
tc.track_event('UserLogin', {'userId': user.id})
tc.flush()
Option 2: Auto-instrumentation (Codeless)
- Enabled via Azure App Service → Application Insights → Enable
- No code changes required
- Supports .NET, Java, Node.js, Python (limited)
Custom telemetry:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Track custom events
_telemetryClient.TrackEvent("OrderPlaced",
new Dictionary<string, string> { { "orderId", order.Id } },
new Dictionary<string, double> { { "orderValue", order.Total } });
// Track custom metrics
_telemetryClient.TrackMetric("ActiveUsers", userCount);
// Track custom exceptions
try { ... }
catch (Exception ex)
{
_telemetryClient.TrackException(ex,
new Dictionary<string, string> { { "context", "OrderService" } });
throw;
}
VM Insights
Monitors performance and health of Azure VMs and VM Scale Sets.
Collects:
- CPU, memory, disk I/O, network I/O
- Running processes and their performance
- TCP connections between processes
Enable VM Insights:
1
2
3
4
5
6
7
az vm extension set \
--resource-group myRG \
--vm-name myVM \
--name MicrosoftMonitoringAgent \
--publisher Microsoft.EnterpriseCloud.Monitoring \
--settings '{"workspaceId": "<workspace-id>"}' \
--protected-settings '{"workspaceKey": "<workspace-key>"}'
Or use Azure Monitor Agent (AMA) — the modern replacement for Log Analytics agent:
1
2
3
4
5
az vm extension set \
--resource-group myRG \
--vm-name myVM \
--name AzureMonitorWindowsAgent \
--publisher Microsoft.Azure.Monitor
Container Insights
Monitors performance of AKS (Azure Kubernetes Service) and Azure Container Instances.
Collects:
- Node-level: CPU, memory, disk, network
- Pod-level: Resource requests vs limits, restarts
- Container-level: CPU, memory per container
- Kubernetes events (OOMKilled, ImagePullBackOff, etc.)
Enable Container Insights:
1
2
3
4
5
6
# Enable on existing AKS cluster
az aks enable-addons \
--resource-group myRG \
--name myAKS \
--addons monitoring \
--workspace-resource-id <log-analytics-workspace-id>
Storage Insights, Network Insights
Storage Insights:
- Dashboard for Azure Storage accounts (Blob, Files, Queue, Table)
- Tracks: transactions, availability, performance, capacity
- Auto-enabled — no agent needed
Network Insights:
- Topology view of network resources
- Tracks: traffic flows, health, connectivity
- Includes: Network Watcher, Traffic Analytics, Connection Monitor
Configure Monitoring in GitHub
GitHub Insights (Organization level):
- Insights → Pulse: Recent activity summary (PRs, issues, commits)
- Traffic: Repo views, clones, top referrers
- Community: README, contributing guide, code of conduct health
- Dependency graph: All dependencies (enables Dependabot)
Creating custom charts:
- GitHub Projects → Insights tab → Create charts from project data
- Track: items by status, items by assignee, burnup chart, velocity
Alerts for DevOps Events
Azure Pipelines Alerts
1
2
3
4
5
6
7
8
9
10
11
12
13
# Notify on pipeline failure via email
# Configure in: Pipeline → Notifications → Subscriptions
# OR use Service Hooks → Slack/Teams/Email
# In YAML — alert on specific condition
- script: |
curl -X POST $TEAMS_WEBHOOK \
-H 'Content-Type: application/json' \
-d '{"text": "❌ Build FAILED: $(Build.DefinitionName) - $(Build.SourceBranch)"}'
condition: failed()
displayName: 'Notify Teams on failure'
env:
TEAMS_WEBHOOK: $(TeamsWebhookUrl)
Azure Monitor Alert → Azure Pipelines:
- Create an Azure Monitor action group that triggers a webhook
- Webhook calls Azure DevOps REST API to run a pipeline (auto-remediation)
GitHub Actions Alerts
1
2
3
4
5
6
7
8
9
10
11
12
# Notify Slack on workflow failure
- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": "❌ Workflow failed: $ on $",
"channel": "#alerts"
}
env:
SLACK_BOT_TOKEN: $
5.2 Analyze Metrics from Instrumentation
Infrastructure Performance Indicators
| Metric | Concern Level | Description |
|---|---|---|
| CPU % | > 80% sustained | High utilization — scale up or out |
| Memory % | > 85% sustained | Possible memory leak or undersized |
| Disk I/O | Latency > 20ms | Storage bottleneck |
| Disk Space | > 85% used | Risk of running out |
| Network In/Out | Bandwidth saturation | Throttling or cost concerns |
| Network Errors | Any non-zero | Connectivity issues |
Alert rules in Azure Monitor:
1
2
3
4
5
6
7
8
9
10
# Create metric alert for high CPU
az monitor metrics alert create \
--name "High CPU Alert" \
--resource-group myRG \
--scopes "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM" \
--condition "avg Percentage CPU > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--action "/subscriptions/<sub>/resourceGroups/myRG/providers/microsoft.insights/actionGroups/myActionGroup" \
--description "CPU over 80% for 5 minutes"
Application Metrics and Usage
Application Insights — key metrics to monitor:
| Metric | Description | Alert Threshold Example |
|---|---|---|
| Server response time | Average request duration | > 2 seconds |
| Failed requests % | HTTP 5xx / exceptions | > 5% |
| Request rate | Requests per second | Monitor for anomalies |
| Page load time | Client-side load duration | > 3 seconds |
| Browser exceptions | JavaScript errors | Monitor for spikes |
| Availability | % of uptime (from availability tests) | < 99.9% |
Application Map analysis:
- Identify slow dependencies (external APIs, databases)
- Find cascading failures
- Spot services with high error rates
Distributed Tracing with Application Insights
How distributed tracing works:
1
2
3
4
5
6
Request ──► Service A ──► Service B ──► Database
│ │
└── Span 1 └── Span 2
(50ms) (200ms)
Operation ID: abc-123 ← Links all spans together
Correlation headers:
- Application Insights automatically propagates
traceparent(W3C standard) orRequest-Idheaders - Enables end-to-end trace reconstruction across services
Viewing traces:
- Application Insights → Transaction Search → Filter by operation ID
- Application Insights → End-to-end Transaction Details → Waterfall view
Sampling:
- Telemetry can be sampled to reduce costs and volume
- Types: Adaptive sampling (default), Fixed-rate, Ingestion sampling
- Sampling preserves complete traces (not random events within a trace)
1
2
3
4
5
6
7
8
// Configure adaptive sampling
builder.Services.AddApplicationInsightsTelemetry();
builder.Services.Configure<TelemetryConfiguration>(config =>
{
var sampler = new AdaptiveSamplingTelemetryProcessor(config);
sampler.MaxTelemetryItemsPerSecond = 20;
sampler.InitialSamplingPercentage = 100;
});
KQL — Kusto Query Language
KQL is the query language for Log Analytics (and Application Insights, Microsoft Sentinel, ADX).
KQL Basics
// Basic query structure:
// TableName
// | operator1
// | operator2
// View all Application Insights requests in the last hour
requests
| where timestamp > ago(1h)
| order by timestamp desc
| take 100
Core operators:
| Operator | Description | Example |
|---|---|---|
where |
Filter rows | where statusCode == 500 |
project |
Select columns | project name, duration, success |
summarize |
Aggregate | summarize count() by bin(timestamp, 1h) |
order by |
Sort | order by duration desc |
take / limit |
Return N rows | take 10 |
extend |
Add computed column | extend durationSec = duration / 1000 |
join |
Join two tables | | join kind=inner (exceptions) on operation_Id |
render |
Visualize | | render timechart |
parse |
Extract from string | parse message with "User: " userId " logged in" |
Common DevOps KQL Queries
// --- Failed requests in last 24h ---
requests
| where timestamp > ago(24h)
| where success == false
| summarize failedCount = count() by bin(timestamp, 1h), resultCode
| render timechart
// --- Slowest operations ---
requests
| where timestamp > ago(1h)
| summarize
avgDuration = avg(duration),
p95Duration = percentile(duration, 95),
requestCount = count()
by name
| order by p95Duration desc
| take 20
// --- Exception count by type ---
exceptions
| where timestamp > ago(24h)
| summarize count() by type
| order by count_ desc
// --- Azure Pipelines builds (via Log Analytics) ---
AzureDiagnostics
| where Category == "PipelineRuns"
| where status_s == "Failed"
| summarize count() by pipelineName_s, bin(TimeGenerated, 1d)
| render columnchart
// --- Application availability ---
availabilityResults
| where timestamp > ago(7d)
| summarize
availability = round(100.0 * countif(success == 1) / count(), 2)
by location, bin(timestamp, 1d)
| render timechart
// --- Container CPU by pod ---
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| extend PodName = extract("pod=([^,]+)", 1, InstanceName)
| summarize avgCPU = avg(CounterValue) by PodName, bin(TimeGenerated, 5m)
| render timechart
Availability Tests
Proactively monitor your application’s availability from multiple global locations.
Types:
| Type | Description |
|——|————-|
| URL Ping Test | Simple HTTP GET check |
| Standard Test | HTTP with SSL cert validation, custom headers |
| Multi-Step Web Test | Sequence of HTTP requests (deprecated, use custom Track Availability) |
| Custom Track Availability | Call TrackAvailability() from Azure Function or pipeline |
Create via Azure Portal or Bicep:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
resource availabilityTest 'microsoft.insights/webtests@2022-06-15' = {
name: 'ping-test-homepage'
location: location
tags: {
'hidden-link:${appInsights.id}': 'Resource'
}
kind: 'ping'
properties: {
SyntheticMonitorId: 'ping-test-homepage'
Name: 'Homepage Ping Test'
Enabled: true
Frequency: 300 // Every 5 minutes
Timeout: 30
Kind: 'ping'
Locations: [
{ Id: 'emea-nl-ams-azr' } // Amsterdam
{ Id: 'us-va-ash-azr' } // East US
{ Id: 'apac-sg-sin-azr' } // Singapore
]
Configuration: {
WebTest: '<WebTest ...>...'
}
}
}
🧠 Key Exam Tips for Instrumentation
| Scenario | Answer |
|---|---|
| Monitor AKS pod resource usage | Container Insights |
| Monitor Azure VMs CPU and memory | VM Insights |
| Track custom business events in application | Application Insights TrackEvent() |
| Find the slowest database queries in your app | Application Insights → Performance → Dependencies |
| Trace a request across 5 microservices | Distributed Tracing via Application Insights Application Map |
| Query failed requests in the last hour | KQL: requests | where success == false | where timestamp > ago(1h) |
| Get 95th percentile response time | KQL: summarize percentile(duration, 95) |
| Alert when error rate > 5% | Azure Monitor metric alert on requests/failed |
| Monitor app from 5 global locations | Application Insights Availability Tests |
| No code changes — enable monitoring on App Service | Auto-instrumentation (codeless) via App Service portal |
| Group KQL results by hour | | summarize count() by bin(timestamp, 1h) |
| Visualize KQL results as a time-series chart | | render timechart |