Skip to content

A collection of modern Spark Declarative Pipeline examples and implementations demonstrating different data processing paradigms and frameworks.

Notifications You must be signed in to change notification settings

dmatrix/spark-declarative-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spark Declarative Pipelines - Comprehensive Data Engineering Examples

Python PySpark SDP Databricks UV License Code Style

Maintenance Made with Love Open Source


A collection of modern Spark Declarative Pipeline (SDP) implementations demonstrating different data processing paradigms. This repository showcases both open source Spark Declarative Pipelines (SDP) for analytics workloads and Spark Declarative Pipelines on Databricks, formely called LDP, for data processing.

πŸ“ Project Structure

etl-pipelines/
β”œβ”€β”€ README.md                          # This overview document
β”œβ”€β”€ CLAUDE.md                          # Claude Code configuration
└── src/py/
    β”œβ”€β”€ sdp/                          # OSS Spark Declarative Pipelines examples
    β”‚   β”œβ”€β”€ README.md                 # Comprehensive SDP documentation
    β”‚   β”œβ”€β”€ daily_orders/             # E-commerce analytics pipeline
    β”‚   β”œβ”€β”€ oil_rigs/                 # Industrial IoT monitoring pipeline
    β”‚   └── utils/                    # Shared data generation utilities
    β”œβ”€β”€ lsdp/                         # Lakeflow Spark Declarative Pipelines (on Databricks)
    β”‚   └── music_analytics/          # Million Song Dataset analytics pipeline
    β”‚       β”œβ”€β”€ README.md             # Music analytics documentation
    β”‚       β”œβ”€β”€ images/               # Pipeline visualization assets
    β”‚       └── transformations/      # SDP transformation definitions
    └── generators/                   # Cross-framework data generators

πŸš€ Getting Started

SDP - Spark Declarative Pipelines

Perfect for batch analytics and data science workloads using PySpark.

# Navigate to SDP examples
cd src/py/sdp

# Install dependencies with UV
uv sync

# Run Daily Orders e-commerce pipeline
python main.py daily-orders

# Run Oil Rigs sensor monitoring pipeline
python main.py oil-rigs

LSDP - Lakeflow Spark Declarative Pipelines (on Databricks)

Ideal for streaming data processing with medallion architecture and data quality validation on Databricks Platform.

# Navigate to Music Analytics SDP example
cd src/py/lsdp/music_analytics

# Deploy pipeline to Databricks workspace
# Pipeline processes Million Song Dataset with medallion architecture
# See README.md for detailed implementation overview

πŸ“Š Use Cases Demonstrated

1. Daily Orders E-commerce Analytics (SDP)

  • Framework: Spark Declarative Pipelines
  • Data: Synthetic e-commerce orders with 20+ product categories
  • Features: Order lifecycle management, sales tax calculations, business analytics
  • Storage: Local Spark warehouse with Parquet files
  • Scale: Development/testing workloads

2. Oil Rigs Industrial Monitoring (SDP)

  • Framework: Spark Declarative Pipelines
  • Data: IoT sensor data from Texas oil fields (temperature, pressure, water level)
  • Features: Multi-location monitoring, statistical analysis, interactive visualizations
  • Storage: Local Spark warehouse with time-series data
  • Scale: Sensor analytics and operational monitoring

3. Music Analytics - Million Song Dataset (on Databricks)

  • Framework: Spark Declarative Pipelines
  • Data: Million Song Dataset with 20 fields of artist, song, and audio features
  • Architecture: Medallion pattern with specialized silver tables and comprehensive gold analytics
  • Silver Layer: Domain-focused tables (songs_metadata_silver, songs_audio_features_silver) with comprehensive data quality validation
  • Gold Layer: 9 advanced analytics tables across temporal, artist, and musical analysis dimensions
  • Analytics: Artist discography analysis, temporal trends, musical characteristics, tempo/time signature patterns, comprehensive artist profiles
  • Storage: Delta tables with Unity Catalog integration and automatic data lineage
  • Scale: Production-ready streaming data processing with Auto Loader and extensive data quality rules

πŸ› οΈ Technologies & Frameworks

Core Technologies

  • PySpark 4.1.0.preview2: Latest Spark features with Python API
  • Spark Declarative Pipelines: Building data processing pipelines with materialized views and streaming tables
  • UV Package Manager: Modern Python dependency management
  • Faker: Realistic synthetic data generation
  • Plotly: Interactive data visualizations

Architecture Patterns

  • Declarative Pipelines: SDP framework with Python decorators with @dp.table decorators
  • Medallion Architecture: Bronze/Silver/Gold data layers with specialized silver tables for domain separation
  • Materialized Views: Efficient data transformation caching and automatic dependency resolution
  • Data Quality Framework: Comprehensive validation rules with @dp.expect decorators for tempo, duration, and metadata validation
  • Advanced Analytics: Multi-dimensional gold layer tables combining temporal, artist, and musical analysis
  • Shared Utilities: Reusable data generation components across frameworks

πŸ“‹ Quick Reference

SDP Commands

# Environment setup
cd src/py/sdp && uv sync

# Run pipelines
python main.py daily-orders # E-commerce analytics
python main.py oil-rigs     # IoT sensor monitoring

# Test utilities
uv run sdp-test-orders      # Test order generation
uv run sdp-test-oil-sensors # Test sensor data generation

# Development commands
uv run pytest              # Run tests
uv run black .             # Format code
uv run flake8 .            # Lint code

SDP for Databricks

# Navigate to Music Analytics pipeline
cd src/py/sdp2dbx/music_analytics

# View pipeline documentation and architecture
cat README.md

# Deploy to Databricks workspace (requires Databricks environment)
# See transformations/sdp_musical_pipeline.py for implementation

🎯 Learning Objectives

This repository demonstrates:

  1. Framework Comparison: SDP vs SDP2DBX for different use cases and data processing paradigms
  2. Data Generation: Realistic synthetic data creation patterns with Faker library
  3. Pipeline Architecture: Declarative transformations, medallion architecture, and specialized table design
  4. Quality Engineering: Comprehensive data validation with @dp.expect rules and monitoring strategies
  5. Advanced Analytics: Multi-dimensional analysis combining temporal trends, artist profiling, and musical characteristics
  6. Modern Tooling: UV package management, Unity Catalog, Auto Loader, and latest Spark features
  7. Production Patterns: Streaming ingestion, environment management, and scalable deployment workflows

πŸ“š Documentation

πŸ”§ Development Setup

Prerequisites

  • Python 3.11+: Required for all frameworks
  • UV Package Manager: Modern dependency management
  • Java 11+: Required by PySpark (handled automatically)
  • Databricks Workspace: Required for SPD pipelines on Databricks

Installation

# Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone <repository-url>
cd etl-pipelines

# Setup SDP environment
cd src/py/sdp && uv sync

# Verify installation
uv run python -c "import pyspark; print('PySpark version:', pyspark.__version__)"

πŸ’‘ Best Practices Demonstrated

Code Organization

  • Centralized Utilities: Shared data generation functions
  • Clear Separation: Framework-specific implementations
  • Configuration Management: Environment-specific settings
  • Comprehensive Testing: Unit tests and validation scripts

Data Engineering

  • Quality First: Built-in data validation and monitoring
  • Scalable Patterns: From development to production
  • Modern Tooling: Latest framework features and best practices
  • Documentation: Comprehensive guides and examples

Pipeline Design

  • Modularity: Reusable components and transformations
  • Observability: Metrics, logging, and monitoring
  • Flexibility: Support for both batch and streaming workloads
  • Maintainability: Clear structure and comprehensive documentation

This repository provides practical examples of modern data engineering patterns, suitable for learning and development. Each framework demonstrates different strengths and use cases in the data processing ecosystem.

About

A collection of modern Spark Declarative Pipeline examples and implementations demonstrating different data processing paradigms and frameworks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •