🔀 Azure Data Factory

Cloud-scale data integration service — code-free and code-first ETL/ELT pipelines


Table of Contents

  1. Product Overview
  2. Core Concepts
    1. Linked Services
    2. Datasets
    3. Activities
    4. Pipelines
    5. Integration Runtime (IR)
    6. Triggers
  3. Mapping Data Flows
  4. Monitoring & Management
  5. Security
  6. ADF vs Synapse Pipelines
  7. Pricing Model
  8. Common Exam Scenarios

Product Overview

Azure Data Factory (ADF) is a fully managed, serverless data integration and orchestration service. It enables you to create data-driven workflows (pipelines) that orchestrate and automate data movement and transformation across on-premises, cloud, and SaaS sources.

ADF is the primary ETL/ELT tool in Azure, replacing the need for on-premises SSIS packages in most cloud architectures. It also serves as the orchestration layer for Azure Synapse Analytics workloads (Synapse Pipelines is ADF embedded in a Synapse workspace).

flowchart LR
    subgraph Sources["Data Sources"]
        ODB["On-premises DB\n(SQL, Oracle, SAP)"]
        SaaS["SaaS Apps\n(Salesforce, ServiceNow)"]
        Cloud["Cloud Stores\n(S3, GCS, REST APIs)"]
        Azure["Azure Services\n(Blob, ADLS, SQL)"]
    end
    subgraph ADF["Azure Data Factory"]
        LS["Linked Services\n(connection definitions)"]
        DS["Datasets\n(data shape + location)"]
        PL["Pipelines\n(workflow + orchestration)"]
        ACT["Activities\n(copy, transform, control)"]
        IR["Integration Runtime\n(execution engine)"]
        TRG["Triggers\n(schedule, event, tumbling)"]
    end
    subgraph Sinks["Destinations"]
        ADLS2["ADLS Gen2 / Blob"]
        SQLDW["Azure Synapse / SQL"]
        COSMOS["Cosmos DB"]
        EH["Event Hubs"]
    end
    Sources --> LS --> DS --> PL --> ACT --> IR --> Sinks
    TRG --> PL

Core Concepts

Linked Services

Connection definitions that store connection strings and credentials for data sources and sinks — analogous to ODBC DSNs. A linked service points to an external store or compute resource.

Datasets

Named references to data within a linked service — they define the structure, location, and format of the data (e.g., a specific Blob container folder, a SQL table, a CSV schema).

Activities

The steps inside a pipeline. ADF has three activity categories:

Category Examples
Data movement Copy Activity (the main workhorse)
Data transformation Data Flow, Databricks Notebook, HDInsight Hive, Stored Procedure, U-SQL
Control flow ForEach, If Condition, Until, Wait, Execute Pipeline, Set Variable, Web Activity

Pipelines

A logical grouping of activities that together perform a unit of work. Pipelines support branching, looping, parallelism, and error handling via control flow activities.

Integration Runtime (IR)

The execution infrastructure for ADF activities. This is one of the most exam-tested concepts.

IR Type Location Use Case Available in Synapse Pipelines?
Azure IR Azure (managed) Cloud-to-cloud data movement and Data Flows; no setup required ✅ Yes
Self-hosted IR Customer premises or VM Access on-premises or private-network data sources ✅ Yes (but not shareable — see below)
Azure-SSIS IR Azure (managed) Lift-and-shift SSIS packages to run natively in ADF ADF only

⚠️ Exam Caveat — IR Type Selection:

  • On-premises source → Self-hosted IR (installed on a machine that can reach the source)
  • Cloud-to-cloud → Azure IR
  • Migrating SSIS packages without rewriting → Azure-SSIS IR (ADF only)
  • Self-hosted IR can be shared across multiple ADF instances — this sharing capability is ADF only and does not apply to Synapse Pipelines

⚠️ Exam Caveat — ADF-Only IR Features: Two IR capabilities are exclusive to ADF and unavailable in Synapse Pipelines:

  • Shared self-hosted IR: ADF allows one self-hosted IR to be shared (linked) across multiple data factories. Synapse Pipelines does not support this — each workspace must deploy its own self-hosted IR independently.
  • Azure-SSIS IR: Only ADF can host an Azure-SSIS IR for running SSIS packages natively. If the scenario involves lifting SSIS packages to the cloud, the answer is ADF, not Synapse Pipelines.

Triggers

Define when a pipeline runs:

Trigger Type Description
Schedule Cron-based schedule (e.g., every day at 02:00 UTC)
Tumbling Window Fixed, non-overlapping time slices; supports dependency chaining
Storage Event Fires when a blob is created or deleted in Azure Blob/ADLS
Custom Event Fires when a custom event arrives via Azure Event Grid

⚠️ Exam Caveat — Tumbling Window vs Schedule: Tumbling Window triggers have retry and dependency features — they guarantee that each window is processed exactly once and in order. Schedule triggers do not have this guarantee. Use Tumbling Window for time-partitioned pipelines where backfill and ordering matter.


Mapping Data Flows

Mapping Data Flows are visually designed, code-free transformations that run on Azure Databricks Spark clusters under the hood (fully managed by ADF, no cluster management needed). They support:

  • Joins, aggregations, pivots, lookups, conditional splits
  • Schema drift handling (dynamic schema evolution)
  • Data quality and cleansing rules
  • Debug mode for interactive testing

⚠️ Exam Caveat: Mapping Data Flows use Spark as the execution engine — they are not suitable for small datasets or latency-sensitive scenarios. They are designed for batch transformations on large datasets. For low-latency transformations, use a Stored Procedure activity or an Azure Function activity.


Monitoring & Management

Feature Detail
Monitor tab Visual pipeline run history, activity status, duration, errors
Azure Monitor integration Pipeline metrics → Log Analytics, alerts on failure
Diagnostic logs Activity runs, trigger runs, pipeline runs to Log Analytics
Email alerts Configured via Azure Monitor action groups
Git integration ADF supports GitHub or Azure DevOps Git for CI/CD of pipeline definitions

Security

Feature Detail
Managed Identity Preferred for authenticating to Azure services without storing credentials
Key Vault integration Linked service credentials stored in Key Vault; ADF fetches at runtime
Managed VNet ADF managed virtual network for private connectivity to data sources
Private Endpoints Managed private endpoints from ADF managed VNet to sources/sinks
RBAC roles Data Factory Contributor, Data Factory Operator
Encryption at rest AES-256; CMK supported via Key Vault

ADF vs Synapse Pipelines

Aspect Azure Data Factory Synapse Pipelines
Location Standalone service Embedded in Synapse workspace
Feature parity Full feature set Near-identical (shared codebase)
Best for Standalone ETL, cross-workspace orchestration ETL within a Synapse analytics project
Licensing Separate resource Included with Synapse workspace
Integration Via linked services to Synapse Native — pipelines can trigger Spark/SQL pool jobs directly
Data sharing across instances ✅ Share data across data factories ❌ Not supported
Cross-region data flows ✅ Supported ❌ Not supported
Azure-SSIS IR ✅ Full support ❌ Not supported
Shared self-hosted IR ✅ One IR can be linked across multiple factories ❌ Not supported — each workspace needs its own

⚠️ Exam Caveat — When ADF Is Required Over Synapse Pipelines: Despite sharing the same underlying engine, there are four scenarios where the answer must be ADF rather than Synapse Pipelines:

  • SSIS packages: Azure-SSIS IR only exists in ADF. If the scenario mentions lifting SSIS packages to Azure, Synapse Pipelines cannot do it.
  • Shared IR across workloads: If multiple teams or factories need to share one self-hosted IR node, ADF supports linked/shared self-hosted IRs; Synapse does not.
  • Cross-region data flows: ADF supports running Data Flows in a different Azure region from the source data; Synapse Pipelines does not.
  • Multi-workspace data sharing: ADF pipelines can share datasets and linked services across factories; Synapse workspaces are isolated in this regard.

Pricing Model

Component Billing
Orchestration Per pipeline activity run (DIU-hours for Copy, vCore-hours for Data Flow)
Copy Activity Per Data Integration Unit (DIU) hour
Data Flow Per vCore-hour (cluster startup + execution time)
Azure IR Per DIU-hour
Self-hosted IR No ADF charge (customer pays for the VM)
Azure-SSIS IR Per vCore-hour while running

⚠️ Exam Caveat: Azure-SSIS IR is billed per hour while running, even if no packages execute. It should be started just before SSIS package execution and stopped immediately after to control costs.


Common Exam Scenarios

Scenario Answer
Move data from on-premises Oracle to ADLS Gen2 ADF pipeline with Self-hosted IR
Lift-and-shift existing SSIS packages to cloud ADF with Azure-SSIS IR (Synapse cannot do this)
Share one self-hosted IR node across multiple pipelines in different factories ADF shared self-hosted IR (not available in Synapse Pipelines)
Schedule-based batch ETL, cloud sources only ADF pipeline with Azure IR + Schedule trigger
Trigger pipeline when a file lands in Blob storage ADF Storage Event trigger
Code-free large-scale data transformation ADF Mapping Data Flow
Store connection string securely in ADF Key Vault-backed Linked Service
ETL within a Synapse Analytics workspace Synapse Pipelines (same as ADF, when no ADF-only features needed)
Incremental data load with time-window ordering ADF Tumbling Window trigger
ADF pipeline credentials using managed identity Managed Identity on ADF linked service