A fully containerized Apache Spark cluster with JupyterLab for distributed data processing and interactive development.
- Overview
- Architecture
- Features
- Prerequisites
- Quick Start
- AWS S3 Support
- Usage
- Access URLs
- Project Structure
- Technologies
This project provides a ready-to-use Apache Spark cluster running in Docker containers, featuring:
- Spark Master node for cluster coordination
- 2 Spark Workers for distributed computation
- JupyterLab for interactive data analysis and development
- Pre-configured networking and volume mounts
Perfect for local development, testing, and learning distributed data processing with Apache Spark.
The cluster consists of 4 Docker containers:
| Service | Container Name | Ports | Resources |
|---|---|---|---|
| JupyterLab | jupyterlab |
8888 (UI), 4040 (Spark UI) | - |
| Spark Master | spark-master |
8080 (UI), 7077 (Master) | - |
| Spark Worker 1 | spark-worker-1 |
8081 (UI) | 1 core, 1GB RAM |
| Spark Worker 2 | spark-worker-2 |
8082 (UI) | 1 core, 1GB RAM |
Total Cluster Capacity: 2 cores, 2GB memory
- Dockerized Setup - Easy deployment with Docker Compose
- Apache Spark 3.5.7 - Latest stable version with Hadoop 3
- JupyterLab 4.3.3 - Modern notebook interface for development
- Scalable Architecture - Easy to add more worker nodes
- Shared Workspace - Persistent volume for notebooks and data
- Pre-configured - Ready to run Spark jobs out of the box
- AWS S3 Support - Optional AWS-enabled variant with S3 integration
- Docker (version 20.10+)
- Docker Compose (version 2.0+)
- At least 4GB of available RAM
- 10GB of free disk space
git clone https://github.com/blnkoff/docker-spark-cluster
cd docker-spark-clustercd build/workspace && \
mkdir -p data && \
curl -L -o data/customs_data.csv "https://huggingface.co/datasets/halltape/customs_data/resolve/main/customs_data.csv?download=true"
cd ../..docker-compose up -ddocker-compose psAll containers should be in "Up" state.
This project includes an AWS-enabled variant with S3 support for reading and writing data from Amazon S3 or S3-compatible storage.
To build the AWS variant with S3 support:
# Build base images with AWS libraries
./build-base-images.sh aws
# Start the cluster with AWS configuration
cp env.aws.example .env
# Edit .env with your AWS credentials
docker-compose -f docker-compose.aws.yml up -dWith the AWS-enabled variant, you can access S3 buckets:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("S3 Example") \
.master("spark://spark-master:7077") \
.config("spark.hadoop.fs.s3a.access.key", "YOUR_ACCESS_KEY") \
.config("spark.hadoop.fs.s3a.secret.key", "YOUR_SECRET_KEY") \
.config("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.getOrCreate()
# Read from S3
df = spark.read.csv("s3a://your-bucket/path/to/file.csv", header=True)
df.show()
# Write to S3
df.write.mode("overwrite").csv("s3a://your-bucket/output/")The AWS variant includes:
- hadoop-aws-3.3.4.jar - Hadoop AWS connector
- aws-java-sdk-bundle-1.12.262.jar - AWS SDK for Java
For more details, see build/docker/spark-base/README.md
- Open your browser and navigate to: http://localhost:8888
- Enter the token:
hello_world - Open the sample notebook:
spark.ipynb
In JupyterLab, create a new notebook and connect to the cluster:
from pyspark.sql import SparkSession
# Create Spark session connected to the cluster
spark = (
SparkSession
.builder
.appName("docker-spark-cluster")
.master("spark://spark-master:7077")
.config("spark.submit.deployMode", "client")
.config("spark.driver.host", "jupyterlab")
.getOrCreate()
)
# Read the sample dataset
df = spark.read.csv("/opt/workspace/data/customs_data.csv", header=True, inferSchema=True)
df.show()
# Stop the session when done
spark.stop()docker-compose downTo remove volumes as well:
docker-compose down -vOnce the cluster is running, access the following web interfaces:
| Service | URL | Credentials |
|---|---|---|
| JupyterLab | http://localhost:8888 | Token: hello_world |
| Spark Master UI | http://localhost:8080 | - |
| Spark Worker 1 UI | http://localhost:8081 | - |
| Spark Worker 2 UI | http://localhost:8082 | - |
| Spark Application UI | http://localhost:4040 | - |
docker-spark-cluster/
βββ build/
β βββ docker/
β β βββ base/ # Base Python image
β β βββ spark-base/ # Spark installation (with AWS variant support)
β β βββ jupyterlab/ # JupyterLab image
β β βββ spark-master/ # Spark master node
β β βββ spark-worker/ # Spark worker nodes
β βββ workspace/
β βββ data/ # Datasets directory
β βββ spark.ipynb # Sample notebook
βββ docker-compose.yml # Standard cluster configuration
βββ docker-compose.aws.yml # AWS-enabled cluster configuration
βββ docker-compose.local.yml # Cluster configuration (build locally)
βββ build-base-images.sh # Base images build script (supports AWS variant)
βββ push-to-dockerhub.sh # Docker Hub push script (supports AWS variant)
βββ env.aws.example # Example AWS credentials file
βββ .gitignore
βββ README.md
- Apache Spark 3.5.7 - Distributed computing framework
- Hadoop 3 - Distributed storage system
- JupyterLab 4.3.3 - Interactive development environment
- Python 3 - Programming language for PySpark
- Docker - Containerization platform
- Docker Compose - Multi-container orchestration
This project is based on HalltapeSparkCluster by halltape. Special thanks for the original implementation and inspiration.