How to Build an Azure Data Pipeline from Scratch , Step-by-Step from Ingestion to Reporting

From Zero to Data Hero: Ingestion → Processing → Reporting

Oct 07, 2025

End-to-End Explanation

Use case : Retail e-commerce company wants a daily and near-real-time analytics platform that ingests transactional data (on-Prem POS), clickstream events (API), and CSV product feeds.
Consumers: analysts (Power BI), ML team, product owners.

Design principles & architecture overview

Goal : Design an architecture that is:

Scalable: separate compute from storage; use Databricks/Synapse compute that autoscale.
Reliable & idempotent: use Delta Lake (ACID) for retries; use watermarking for incremental loads.
Observable: logging, metrics, and alerting for each pipeline.
Secure: Secrets in Key Vault, managed identities, private endpoints.
Cost-aware: right-size compute, leverage serverless Synapse where possible.

High level flow:

Source systems → ADF ingestion (batch and streaming ingestion)
Raw data lands in ADLS Gen2 (Bronze / Delta)
Databricks cleans & enriches to Silver (Delta)
Aggregations / business logic produce Gold (Delta or Synapse tables)
Synapse serves analytical models (dedicated SQL pool or serverless SQL)
Power BI consumes Synapse for dashboards and reports
Monitoring, CI/CD, and data governance wrap the pipeline

Components & how they work

1. Data Sources

On-Premise SQL DB: transactional system (orders, customers). Use Self-Hosted Integration Runtime to move data securely.
APIs: Marketing/Ad platforms, clickstream - ADF can call REST APIs, or use Event Hubs for streaming.
Files (CSV, Parquet): Vendor product catalogs or batch feeds. Can be uploaded to ADLS via SFTP connector or mapped drive.
Streaming (optional): IoT or clickstream through Azure Event Hubs or Kafka.

2. Azure Data Factory (ADF) - ingestion & orchestration

Roles: schedule triggers, orchestrate data flows, move raw data, and kick Databricks jobs.
Patterns:
- Copy Activity: moves data to ADLS Bronze folder. Use schema drift settings and column mapping.
- Mapping Data Flow: if light transformations are enough in ADF (no Spark), use for simple cleanses.
- Wrangling: for business users to profile/transform CSVs.
- Integration Runtime: use self-hosted IR for on-prem or VNet IR for private networks.
Best practices:
- Use parameterized pipelines for reuse (env, table).
- Implement incremental loads (watermark columns, change tracking).
- Keep idempotency - landing raw files with unique run IDs & ingestion logs.

3. ADLS Gen2 + Delta Lake (Medallion)

Storage: ADLS Gen2 container with folders for bronze/silver/gold. Use Delta Lake format for ACID, schema evolution, time travel, and fast merges.
Bronze: raw, unaltered ingest (keep source schema + metadata like ingestion_timestamp, source_file, run_id). Good for lineage and reprocessing.
Silver: cleaned and enhanced (type conversions, dedup, surrogate keys, basic joins). This is where business rule enforcement lives.
Gold: aggregated, business-ready tables/views for reporting (denormalized star schemas or aggregated views). Gold may be physically persisted in Synapse dedicated SQL pool for performance.
Partitioning: partition large tables by ingestion_date or event_date; maintain small files by compaction.

4. Azure Databricks (Processing & Transformations)

Why Databricks: best for large scale Spark jobs, Delta Lake integration, collaborative notebooks, ML pipelines.
Jobs: scheduled via ADF or Databricks Jobs API:
- Bronze → Silver transformations (data quality rules, joins, enrichment).
- Silver → Gold aggregations (hourly/daily), slowly changing dimension (SCD) logic.
Development workflow: use git integration (GitHub/Azure Repos), CI for notebooks and jobs.
Techniques:
- Use Delta MERGE for SCD and upserts.
- Use structured streaming for near-real-time ingestion into Bronze then micro-batch processing.
- Implement schema validation with Great Expectations or custom checks.

5. Azure Synapse Analytics (Warehousing & Serving)

Serverless SQL pool for ad hoc queries on data lake (good for analysts).
Dedicated SQL pool (formerly SQL DW) if you need fast, concurrent BI workloads — load Gold tables into dedicated pools.
Synapse Workspace: orchestration notebooks, pipelines, integrated security, and data exploration.
Materialized views or aggregated tables for Power BI performance.

6. Power BI (Reporting)

Connectivity: Direct Query to Synapse for fresh data; Import for fast dashboards. Use incremental refresh for large datasets.
Models: Star schema from Gold layer; use calculation groups and role-level security (RLS) if needed.
Deployment: Power BI Service workspaces, deployment pipeline, and extractor for lineage.

7. Security (Key Vault, Managed Identities, Private Endpoints)

Key Vault: store DB passwords, service principals, and secrets. Grant ADF/Databricks/Synapse access via managed identity.
Managed Identity: prefer system-assigned identities for services to access storage and other resources.
Private Endpoints & VNet: use private endpoints for ADLS, Databricks workspace, and Synapse to restrict public access.
RBAC & Access Policies: least privilege principle; separate dev/test/prod.
Encryption: use Customer-Managed Keys (CMK) if required.

8. Observability & Alerting

Azure Monitor & Log Analytics: collect metrics and logs from ADF, Databricks, Synapse.
ADF Monitoring: pipeline run history, activity logs. Configure alerts for failures/SLAs.
Databricks: job logs, Spark UI integration, cluster metrics. Send alerts on job failures.
Logic Apps / Email / Teams: on alert, send notifications, or invoke remediation playbooks.
Data Quality Dashboard: show failed rows, null counts, skew, throughput.

9. CI/CD & Automation

Source Control: store Terraform/ARM templates, Databricks notebooks, ADF JSON in Git.
CI Pipeline: run tests (notebook static checks, unit tests).
CD Pipeline: deploy ARM templates for infra, ADF deployment, Databricks job deployment (use Databricks CLI or API), and Power BI deployment scripts.
Environments: dev/test/staging/prod with parameterized deployments.

10. Data Governance & Lineage

Data Catalog: Microsoft Purview for metadata, lineage, and data classification.
Lineage: capture source → transformation → target lineage for audits.
Policies: retention, access control, PII masking.

Example daily processing flow

Nightly batch: ADF pipeline triggered at 01:00.
- Copy Activity: on-prem orders table (incremental using last_modified watermark) → ADLS /bronze/orders/date=YYYYMMDD/*.parquet (+ ingestion metadata).
- Copy: vendor CSVs → ADLS /bronze/vendors/
Databricks job: kicked by ADF after copy success. Databricks reads bronze orders for date, applies dedup/cleaning, enriches with product master from silver/cache, writes to /silver/orders/.
Aggregate job: another Databricks job computes daily sales aggregates and writes to /gold/daily_sales/, and updates Synapse dedicated pool via PolyBase or COPY INTO.
Power BI refresh: dataset refresh scheduled or incremental refresh querying Synapse; analysts can view dashboards.

Data quality & testing

Unit tests for transformation logic (PySpark unit tests using pytest).
Assertion checks: row counts sanity, null thresholds, referential integrity.
Contract tests: enforce schema expectations at ingestion (column types, presence).
Automated reprocessing: track failed files in a recovery folder and retry with root cause logs.

Cost optimization & performance tips

Use Databricks job clusters (ephemeral clusters) rather than all-day interactive clusters.
Use serverless Synapse where concurrency is low; use dedicated pool for heavy BI with resource governance.
Partitioning + file sizing: compact small files using Delta OPTIMIZE.
Cache frequently used tables in Databricks or Synapse materialized views.
Use spot instances where appropriate (Databricks supports spot).
Monitor costs with Azure Cost Management; set budgets and alerts.

Operations & Runbook examples

Pipeline failure: auto-alert → run basic retry policy → escalate to on-call if continues.
Schema change: detect with ADF schema drift or Databricks schema checks → create migration plan and run in dev.
Data breach: Key Vault rotation, disable compromised credentials, notify security team, and run forensic queries.

Advanced patterns

Streaming architecture: Event Hubs → Azure Stream Analytics or Structured Streaming into Delta tables (bronze streaming) → incremental enrichments.
Feature store: store ML features in Delta tables; integrate MLflow in Databricks for experiments.
RAG / ML inference: batch or online inference that writes predictions back into Gold for reports.

Implementation checklist (practical steps)

Create subscription + resource groups for dev/test/prod.
Provision ADLS Gen2 with hierarchical namespace.
Create Key Vault, set policies.
Provision ADF (Git integration) and create self-hosted IR if needed.
Create Databricks workspace (enable VNet injection for private).
Create Synapse workspace (serverless and dedicated as needed).
Define Bronze/Silver/Gold folder structure and naming conventions.
Implement ingestion pipelines in ADF (copy activities).
Develop Databricks notebooks & jobs (bronze→silver→gold).
Configure Synapse tables and decide Direct Query vs Import for Power BI.
Set up monitoring, alerts, and CI/CD pipelines (DevOps).
Run end-to-end test with sample data, validate SLAs.

Team & roles

Data Engineer: build ETL/ELT, pipelines, Databricks jobs.
Platform Engineer: infra, networking, Key Vault, security.
Analyst / BI: define reporting needs, validate gold models.
Data QA: test suites and data quality.
DevOps: CI/CD pipelines and release management.

Example technologies matrix

Ingestion: Azure Data Factory, Event Hubs
Storage: ADLS Gen2, Delta Lake
Processing: Azure Databricks (PySpark), Databricks Delta
Warehouse: Azure Synapse (Serverless + Dedicated)
BI: Power BI
Security: Azure Key Vault, Managed Identities, Private Endpoints
Monitoring: Azure Monitor, Log Analytics, Application Insights
Governance: Microsoft Purview

Summary & outcomes

This architecture provides a robust, secure, and scalable pipeline that supports both batch and streaming workloads, preserves raw data for audit, gives business-ready outputs for reporting, and supports machine learning needs. With appropriate governance, CI/CD, monitoring, and cost controls, this design can be productionized and operated by cross-functional teams.

Abhisek Sahu

Discussion about this post

Ready for more?