🔀 Azure Data Factory
Cloud-scale data integration service — code-free and code-first ETL/ELT pipelines
Table of Contents
- Product Overview
- Core Concepts
- Mapping Data Flows
- Monitoring & Management
- Security
- ADF vs Synapse Pipelines
- Pricing Model
- Common Exam Scenarios
Product Overview
Azure Data Factory (ADF) is a fully managed, serverless data integration and orchestration service. It enables you to create data-driven workflows (pipelines) that orchestrate and automate data movement and transformation across on-premises, cloud, and SaaS sources.
ADF is the primary ETL/ELT tool in Azure, replacing the need for on-premises SSIS packages in most cloud architectures. It also serves as the orchestration layer for Azure Synapse Analytics workloads (Synapse Pipelines is ADF embedded in a Synapse workspace).
flowchart LR
subgraph Sources["Data Sources"]
ODB["On-premises DB\n(SQL, Oracle, SAP)"]
SaaS["SaaS Apps\n(Salesforce, ServiceNow)"]
Cloud["Cloud Stores\n(S3, GCS, REST APIs)"]
Azure["Azure Services\n(Blob, ADLS, SQL)"]
end
subgraph ADF["Azure Data Factory"]
LS["Linked Services\n(connection definitions)"]
DS["Datasets\n(data shape + location)"]
PL["Pipelines\n(workflow + orchestration)"]
ACT["Activities\n(copy, transform, control)"]
IR["Integration Runtime\n(execution engine)"]
TRG["Triggers\n(schedule, event, tumbling)"]
end
subgraph Sinks["Destinations"]
ADLS2["ADLS Gen2 / Blob"]
SQLDW["Azure Synapse / SQL"]
COSMOS["Cosmos DB"]
EH["Event Hubs"]
end
Sources --> LS --> DS --> PL --> ACT --> IR --> Sinks
TRG --> PL
Core Concepts
Linked Services
Connection definitions that store connection strings and credentials for data sources and sinks — analogous to ODBC DSNs. A linked service points to an external store or compute resource.
Datasets
Named references to data within a linked service — they define the structure, location, and format of the data (e.g., a specific Blob container folder, a SQL table, a CSV schema).
Activities
The steps inside a pipeline. ADF has three activity categories:
| Category | Examples |
|---|---|
| Data movement | Copy Activity (the main workhorse) |
| Data transformation | Data Flow, Databricks Notebook, HDInsight Hive, Stored Procedure, U-SQL |
| Control flow | ForEach, If Condition, Until, Wait, Execute Pipeline, Set Variable, Web Activity |
Pipelines
A logical grouping of activities that together perform a unit of work. Pipelines support branching, looping, parallelism, and error handling via control flow activities.
Integration Runtime (IR)
The execution infrastructure for ADF activities. This is one of the most exam-tested concepts.
| IR Type | Location | Use Case | Available in Synapse Pipelines? |
|---|---|---|---|
| Azure IR | Azure (managed) | Cloud-to-cloud data movement and Data Flows; no setup required | ✅ Yes |
| Self-hosted IR | Customer premises or VM | Access on-premises or private-network data sources | ✅ Yes (but not shareable — see below) |
| Azure-SSIS IR | Azure (managed) | Lift-and-shift SSIS packages to run natively in ADF | ❌ ADF only |
⚠️ Exam Caveat — IR Type Selection:
- On-premises source → Self-hosted IR (installed on a machine that can reach the source)
- Cloud-to-cloud → Azure IR
- Migrating SSIS packages without rewriting → Azure-SSIS IR (ADF only)
- Self-hosted IR can be shared across multiple ADF instances — this sharing capability is ADF only and does not apply to Synapse Pipelines
⚠️ Exam Caveat — ADF-Only IR Features: Two IR capabilities are exclusive to ADF and unavailable in Synapse Pipelines:
- Shared self-hosted IR: ADF allows one self-hosted IR to be shared (linked) across multiple data factories. Synapse Pipelines does not support this — each workspace must deploy its own self-hosted IR independently.
- Azure-SSIS IR: Only ADF can host an Azure-SSIS IR for running SSIS packages natively. If the scenario involves lifting SSIS packages to the cloud, the answer is ADF, not Synapse Pipelines.
Triggers
Define when a pipeline runs:
| Trigger Type | Description |
|---|---|
| Schedule | Cron-based schedule (e.g., every day at 02:00 UTC) |
| Tumbling Window | Fixed, non-overlapping time slices; supports dependency chaining |
| Storage Event | Fires when a blob is created or deleted in Azure Blob/ADLS |
| Custom Event | Fires when a custom event arrives via Azure Event Grid |
⚠️ Exam Caveat — Tumbling Window vs Schedule: Tumbling Window triggers have retry and dependency features — they guarantee that each window is processed exactly once and in order. Schedule triggers do not have this guarantee. Use Tumbling Window for time-partitioned pipelines where backfill and ordering matter.
Mapping Data Flows
Mapping Data Flows are visually designed, code-free transformations that run on Azure Databricks Spark clusters under the hood (fully managed by ADF, no cluster management needed). They support:
- Joins, aggregations, pivots, lookups, conditional splits
- Schema drift handling (dynamic schema evolution)
- Data quality and cleansing rules
- Debug mode for interactive testing
⚠️ Exam Caveat: Mapping Data Flows use Spark as the execution engine — they are not suitable for small datasets or latency-sensitive scenarios. They are designed for batch transformations on large datasets. For low-latency transformations, use a Stored Procedure activity or an Azure Function activity.
Monitoring & Management
| Feature | Detail |
|---|---|
| Monitor tab | Visual pipeline run history, activity status, duration, errors |
| Azure Monitor integration | Pipeline metrics → Log Analytics, alerts on failure |
| Diagnostic logs | Activity runs, trigger runs, pipeline runs to Log Analytics |
| Email alerts | Configured via Azure Monitor action groups |
| Git integration | ADF supports GitHub or Azure DevOps Git for CI/CD of pipeline definitions |
Security
| Feature | Detail |
|---|---|
| Managed Identity | Preferred for authenticating to Azure services without storing credentials |
| Key Vault integration | Linked service credentials stored in Key Vault; ADF fetches at runtime |
| Managed VNet | ADF managed virtual network for private connectivity to data sources |
| Private Endpoints | Managed private endpoints from ADF managed VNet to sources/sinks |
| RBAC roles | Data Factory Contributor, Data Factory Operator |
| Encryption at rest | AES-256; CMK supported via Key Vault |
ADF vs Synapse Pipelines
| Aspect | Azure Data Factory | Synapse Pipelines |
|---|---|---|
| Location | Standalone service | Embedded in Synapse workspace |
| Feature parity | Full feature set | Near-identical (shared codebase) |
| Best for | Standalone ETL, cross-workspace orchestration | ETL within a Synapse analytics project |
| Licensing | Separate resource | Included with Synapse workspace |
| Integration | Via linked services to Synapse | Native — pipelines can trigger Spark/SQL pool jobs directly |
| Data sharing across instances | ✅ Share data across data factories | ❌ Not supported |
| Cross-region data flows | ✅ Supported | ❌ Not supported |
| Azure-SSIS IR | ✅ Full support | ❌ Not supported |
| Shared self-hosted IR | ✅ One IR can be linked across multiple factories | ❌ Not supported — each workspace needs its own |
⚠️ Exam Caveat — When ADF Is Required Over Synapse Pipelines: Despite sharing the same underlying engine, there are four scenarios where the answer must be ADF rather than Synapse Pipelines:
- SSIS packages: Azure-SSIS IR only exists in ADF. If the scenario mentions lifting SSIS packages to Azure, Synapse Pipelines cannot do it.
- Shared IR across workloads: If multiple teams or factories need to share one self-hosted IR node, ADF supports linked/shared self-hosted IRs; Synapse does not.
- Cross-region data flows: ADF supports running Data Flows in a different Azure region from the source data; Synapse Pipelines does not.
- Multi-workspace data sharing: ADF pipelines can share datasets and linked services across factories; Synapse workspaces are isolated in this regard.
Pricing Model
| Component | Billing |
|---|---|
| Orchestration | Per pipeline activity run (DIU-hours for Copy, vCore-hours for Data Flow) |
| Copy Activity | Per Data Integration Unit (DIU) hour |
| Data Flow | Per vCore-hour (cluster startup + execution time) |
| Azure IR | Per DIU-hour |
| Self-hosted IR | No ADF charge (customer pays for the VM) |
| Azure-SSIS IR | Per vCore-hour while running |
⚠️ Exam Caveat: Azure-SSIS IR is billed per hour while running, even if no packages execute. It should be started just before SSIS package execution and stopped immediately after to control costs.
Common Exam Scenarios
| Scenario | Answer |
|---|---|
| Move data from on-premises Oracle to ADLS Gen2 | ADF pipeline with Self-hosted IR |
| Lift-and-shift existing SSIS packages to cloud | ADF with Azure-SSIS IR (Synapse cannot do this) |
| Share one self-hosted IR node across multiple pipelines in different factories | ADF shared self-hosted IR (not available in Synapse Pipelines) |
| Schedule-based batch ETL, cloud sources only | ADF pipeline with Azure IR + Schedule trigger |
| Trigger pipeline when a file lands in Blob storage | ADF Storage Event trigger |
| Code-free large-scale data transformation | ADF Mapping Data Flow |
| Store connection string securely in ADF | Key Vault-backed Linked Service |
| ETL within a Synapse Analytics workspace | Synapse Pipelines (same as ADF, when no ADF-only features needed) |
| Incremental data load with time-window ordering | ADF Tumbling Window trigger |
| ADF pipeline credentials using managed identity | Managed Identity on ADF linked service |