How to Build an Azure Data Pipeline from Scratch , Step-by-Step from Ingestion to Reporting
From Zero to Data Hero: Ingestion → Processing → Reporting
End-to-End Explanation
Use case : Retail e-commerce company wants a daily and near-real-time analytics platform that ingests transactional data (on-Prem POS), clickstream events (API), and CSV product feeds.
Consumers: analysts (Power BI), ML team, product owners.
Design principles & architecture overview
Goal : Design an architecture that is:
Scalable: separate compute from storage; use Databricks/Synapse compute that autoscale.
Reliable & idempotent: use Delta Lake (ACID) for retries; use watermarking for incremental loads.
Observable: logging, metrics, and alerting for each pipeline.
Secure: Secrets in Key Vault, managed identities, private endpoints.
Cost-aware: right-size compute, leverage serverless Synapse where possible.
High level flow:
Source systems → ADF ingestion (batch and streaming ingestion)
Raw data lands in ADLS Gen2 (Bronze / Delta)
Databricks cleans & enriches to Silver (Delta)
Aggregations / business logic produce Gold (Delta or Synapse tables)
Synapse serves analytical models (dedicated SQL pool or serverless SQL)
Power BI consumes Synapse for dashboards and reports
Monitoring, CI/CD, and data governance wrap the pipeline
Components & how they work
1. Data Sources
On-Premise SQL DB: transactional system (orders, customers). Use Self-Hosted Integration Runtime to move data securely.
APIs: Marketing/Ad platforms, clickstream - ADF can call REST APIs, or use Event Hubs for streaming.
Files (CSV, Parquet): Vendor product catalogs or batch feeds. Can be uploaded to ADLS via SFTP connector or mapped drive.
Streaming (optional): IoT or clickstream through Azure Event Hubs or Kafka.
2. Azure Data Factory (ADF) - ingestion & orchestration
Roles: schedule triggers, orchestrate data flows, move raw data, and kick Databricks jobs.
Patterns:
Copy Activity: moves data to ADLS Bronze folder. Use schema drift settings and column mapping.
Mapping Data Flow: if light transformations are enough in ADF (no Spark), use for simple cleanses.
Wrangling: for business users to profile/transform CSVs.
Integration Runtime: use self-hosted IR for on-prem or VNet IR for private networks.
Best practices:
Use parameterized pipelines for reuse (env, table).
Implement incremental loads (watermark columns, change tracking).
Keep idempotency - landing raw files with unique run IDs & ingestion logs.
3. ADLS Gen2 + Delta Lake (Medallion)
Storage: ADLS Gen2 container with folders for bronze/silver/gold. Use Delta Lake format for ACID, schema evolution, time travel, and fast merges.
Bronze: raw, unaltered ingest (keep source schema + metadata like ingestion_timestamp, source_file, run_id). Good for lineage and reprocessing.
Silver: cleaned and enhanced (type conversions, dedup, surrogate keys, basic joins). This is where business rule enforcement lives.
Gold: aggregated, business-ready tables/views for reporting (denormalized star schemas or aggregated views). Gold may be physically persisted in Synapse dedicated SQL pool for performance.
Partitioning: partition large tables by ingestion_date or event_date; maintain small files by compaction.
4. Azure Databricks (Processing & Transformations)
Why Databricks: best for large scale Spark jobs, Delta Lake integration, collaborative notebooks, ML pipelines.
Jobs: scheduled via ADF or Databricks Jobs API:
Bronze → Silver transformations (data quality rules, joins, enrichment).
Silver → Gold aggregations (hourly/daily), slowly changing dimension (SCD) logic.
Development workflow: use git integration (GitHub/Azure Repos), CI for notebooks and jobs.
Techniques:
Use Delta MERGE for SCD and upserts.
Use structured streaming for near-real-time ingestion into Bronze then micro-batch processing.
Implement schema validation with Great Expectations or custom checks.
5. Azure Synapse Analytics (Warehousing & Serving)
Serverless SQL pool for ad hoc queries on data lake (good for analysts).
Dedicated SQL pool (formerly SQL DW) if you need fast, concurrent BI workloads — load Gold tables into dedicated pools.
Synapse Workspace: orchestration notebooks, pipelines, integrated security, and data exploration.
Materialized views or aggregated tables for Power BI performance.
6. Power BI (Reporting)
Connectivity: Direct Query to Synapse for fresh data; Import for fast dashboards. Use incremental refresh for large datasets.
Models: Star schema from Gold layer; use calculation groups and role-level security (RLS) if needed.
Deployment: Power BI Service workspaces, deployment pipeline, and extractor for lineage.
7. Security (Key Vault, Managed Identities, Private Endpoints)
Key Vault: store DB passwords, service principals, and secrets. Grant ADF/Databricks/Synapse access via managed identity.
Managed Identity: prefer system-assigned identities for services to access storage and other resources.
Private Endpoints & VNet: use private endpoints for ADLS, Databricks workspace, and Synapse to restrict public access.
RBAC & Access Policies: least privilege principle; separate dev/test/prod.
Encryption: use Customer-Managed Keys (CMK) if required.
8. Observability & Alerting
Azure Monitor & Log Analytics: collect metrics and logs from ADF, Databricks, Synapse.
ADF Monitoring: pipeline run history, activity logs. Configure alerts for failures/SLAs.
Databricks: job logs, Spark UI integration, cluster metrics. Send alerts on job failures.
Logic Apps / Email / Teams: on alert, send notifications, or invoke remediation playbooks.
Data Quality Dashboard: show failed rows, null counts, skew, throughput.
9. CI/CD & Automation
Source Control: store Terraform/ARM templates, Databricks notebooks, ADF JSON in Git.
CI Pipeline: run tests (notebook static checks, unit tests).
CD Pipeline: deploy ARM templates for infra, ADF deployment, Databricks job deployment (use Databricks CLI or API), and Power BI deployment scripts.
Environments: dev/test/staging/prod with parameterized deployments.
10. Data Governance & Lineage
Data Catalog: Microsoft Purview for metadata, lineage, and data classification.
Lineage: capture source → transformation → target lineage for audits.
Policies: retention, access control, PII masking.
Example daily processing flow
Nightly batch: ADF pipeline triggered at 01:00.
Copy Activity: on-prem orders table (incremental using last_modified watermark) → ADLS /bronze/orders/date=YYYYMMDD/*.parquet (+ ingestion metadata).
Copy: vendor CSVs → ADLS /bronze/vendors/
Databricks job: kicked by ADF after copy success. Databricks reads bronze orders for date, applies dedup/cleaning, enriches with product master from silver/cache, writes to /silver/orders/.
Aggregate job: another Databricks job computes daily sales aggregates and writes to /gold/daily_sales/, and updates Synapse dedicated pool via PolyBase or COPY INTO.
Power BI refresh: dataset refresh scheduled or incremental refresh querying Synapse; analysts can view dashboards.
Data quality & testing
Unit tests for transformation logic (PySpark unit tests using pytest).
Assertion checks: row counts sanity, null thresholds, referential integrity.
Contract tests: enforce schema expectations at ingestion (column types, presence).
Automated reprocessing: track failed files in a recovery folder and retry with root cause logs.
Cost optimization & performance tips
Use Databricks job clusters (ephemeral clusters) rather than all-day interactive clusters.
Use serverless Synapse where concurrency is low; use dedicated pool for heavy BI with resource governance.
Partitioning + file sizing: compact small files using Delta OPTIMIZE.
Cache frequently used tables in Databricks or Synapse materialized views.
Use spot instances where appropriate (Databricks supports spot).
Monitor costs with Azure Cost Management; set budgets and alerts.
Operations & Runbook examples
Pipeline failure: auto-alert → run basic retry policy → escalate to on-call if continues.
Schema change: detect with ADF schema drift or Databricks schema checks → create migration plan and run in dev.
Data breach: Key Vault rotation, disable compromised credentials, notify security team, and run forensic queries.
Advanced patterns
Streaming architecture: Event Hubs → Azure Stream Analytics or Structured Streaming into Delta tables (bronze streaming) → incremental enrichments.
Feature store: store ML features in Delta tables; integrate MLflow in Databricks for experiments.
RAG / ML inference: batch or online inference that writes predictions back into Gold for reports.
Implementation checklist (practical steps)
Create subscription + resource groups for dev/test/prod.
Provision ADLS Gen2 with hierarchical namespace.
Create Key Vault, set policies.
Provision ADF (Git integration) and create self-hosted IR if needed.
Create Databricks workspace (enable VNet injection for private).
Create Synapse workspace (serverless and dedicated as needed).
Define Bronze/Silver/Gold folder structure and naming conventions.
Implement ingestion pipelines in ADF (copy activities).
Develop Databricks notebooks & jobs (bronze→silver→gold).
Configure Synapse tables and decide Direct Query vs Import for Power BI.
Set up monitoring, alerts, and CI/CD pipelines (DevOps).
Run end-to-end test with sample data, validate SLAs.
Team & roles
Data Engineer: build ETL/ELT, pipelines, Databricks jobs.
Platform Engineer: infra, networking, Key Vault, security.
Analyst / BI: define reporting needs, validate gold models.
Data QA: test suites and data quality.
DevOps: CI/CD pipelines and release management.
Example technologies matrix
Ingestion: Azure Data Factory, Event Hubs
Storage: ADLS Gen2, Delta Lake
Processing: Azure Databricks (PySpark), Databricks Delta
Warehouse: Azure Synapse (Serverless + Dedicated)
BI: Power BI
Security: Azure Key Vault, Managed Identities, Private Endpoints
Monitoring: Azure Monitor, Log Analytics, Application Insights
Governance: Microsoft Purview
Summary & outcomes
This architecture provides a robust, secure, and scalable pipeline that supports both batch and streaming workloads, preserves raw data for audit, gives business-ready outputs for reporting, and supports machine learning needs. With appropriate governance, CI/CD, monitoring, and cost controls, this design can be productionized and operated by cross-functional teams.
