An end-to-end data engineering solution implementing the Medallion Architecture (Bronze-Silver-Gold) for processing and analyzing trip transaction data using Azure cloud services. This project automates data ingestion, transformation, and analytics workflows with built-in monitoring and email notifications.
- Automated Data Pipeline: Seamless data flow from Azure SQL Database to structured storage layers
- Real-Time Processing: Continuous data ingestion and transformation for up-to-date insights
- Business Intelligence: Advanced analytics on customer behavior, driver performance, and operational metrics
- Scalability: Cloud-native architecture handles growing data volumes efficiently
- Data Quality: Medallion Architecture ensures progressive data refinement and validation
Azure SQL Database → ADF → Bronze Layer (Raw Data)
↓
Silver Layer (Cleaned & Transformed)
↓
Gold Layer (Business Insights)
↓
Hive Metastore + Email Notifications
| Component | Technology |
|---|---|
| Languages | Python, SQL, PySpark |
| Data Integration | Azure Data Factory (ADF) |
| Data Storage | Azure Data Lake Storage Gen2 (ADLS) |
| Data Processing | Azure Databricks (Apache Spark) |
| Database | Azure SQL Database |
| Orchestration | ADF Pipelines, Logic Apps |
| Storage Format | Delta Lake |
- ✅ ACID Transactions - Reliable data operations with atomicity and consistency
- 📚 Data Versioning - Track changes and enable time travel queries
- 🔒 Schema Enforcement - Ensure data quality and consistency
- ⚡ Optimized Performance - Advanced indexing for faster queries
- 📥 Full & Incremental Loads - Efficient data synchronization
- 🔄 Automated Scheduling - Hands-free pipeline execution
- 📧 Email Notifications - Real-time alerts via Logic Apps
- 📊 Monitoring Dashboard - Track pipeline health and performance
-
Ride Analytics
- Identify top-rated drivers and peak demand periods
- Track trip delays and payment statuses
- Analyze customer travel patterns
-
Driver Performance
- Highest number of trips per driver
- Driver ratings and customer satisfaction metrics
-
Financial Insights
- Highest-spending customers
- Fare trends and revenue analysis
- Distance-based pricing optimization
- Azure Subscription (Free Account)
- Azure CLI (optional)
- Basic knowledge of SQL and Python
-
Create Azure Resources
- Resource Group - Azure SQL Database - Azure Data Lake Storage Gen2 - Azure Databricks Workspace - Azure Data Factory
-
Configure Data Source
- Upload sample CSV files to Bronze folder
- Execute PySpark script to populate Azure SQL Database
- Configure firewall rules for Databricks access
-
Set Up Databricks
- Create compute cluster
- Import notebooks (Bronze/Silver/Gold zones)
- Configure ADLS access credentials
-
Build ADF Pipeline
- Create dataflows for data extraction
- Configure incremental load filters
- Set up Logic Apps for email notifications
-
Execute & Monitor
- Run the pipeline manually or schedule
- Monitor execution in ADF dashboard
- Verify data in each layer (Bronze → Silver → Gold)
├── Bronze_Zone/
│ └── Delta_Lake_Validation.py # Raw data validation
├── Silver_Zone/
│ ├── Customer_Dimension.py # Customer dimension table
│ ├── Driver_Dimension.py # Driver dimension table
│ ├── Date_Dimension.py # Date dimension table
│ ├── Trip_Transaction_Fact.py # Trip fact table
│ └── Rewards_Points_Fact.py # Rewards fact table
├── Gold_Zone/
│ └── Final_Reports.py # Business insights & analytics
└── PySpark_upload_data_to_DB_Script.py
- Bronze Layer - Raw data ingestion from Azure SQL Database in Delta format
- Silver Layer - Data cleansing, transformation, and creation of fact/dimension tables
- Gold Layer - Aggregated analytics and business insights stored in Hive Metastore
- Multi-Factor Authentication (MFA) enabled
- Least privilege access control
- Budget alerts and cost monitoring
- Soft delete disabled for real-time access
- Secrets management for credentials
- Resource naming conventions
Contributions are welcome! Please feel free to submit issues or pull requests.
This project is licensed under the MIT License.
- Built with Azure cloud services
- Powered by Apache Spark and Delta Lake
- Inspired by modern data engineering practices
Note: Remember to delete Azure resources after use to avoid unnecessary charges. Always monitor your Azure cost analysis dashboard.