Skip to content

Commit f7f02a8

Browse files
committed
Add Databricks and benchmark results for most SQL warehouse options
1 parent 6bf0126 commit f7f02a8

File tree

18 files changed

+1159
-1
lines changed

18 files changed

+1159
-1
lines changed

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,6 @@ Please help us add more systems and run the benchmarks on more types of VMs:
222222
- [ ] Azure Synapse
223223
- [ ] Boilingdata
224224
- [ ] CockroachDB Serverless
225-
- [ ] Databricks
226225
- [ ] DolphinDB
227226
- [ ] Dremio (without publishing)
228227
- [ ] DuckDB operating like "Athena" on remote Parquet files

databricks/.env.example

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Databricks Configuration
2+
# Copy this file to .env and fill in your actual values
3+
4+
# Your Databricks workspace hostname (e.g., dbc-xxxxxxxx-xxxx.cloud.databricks.com)
5+
DATABRICKS_SERVER_HOSTNAME=your-workspace-hostname.cloud.databricks.com
6+
7+
# SQL Warehouse HTTP path (found in your SQL Warehouse settings)
8+
# Uncomment the warehouse size you want to use
9+
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id
10+
11+
# Instance type name for results file naming & results machine type label
12+
databricks_instance_type=Large
13+
14+
# Your Databricks personal access token
15+
DATABRICKS_TOKEN=your-databricks-token
16+
17+
# Unity Catalog and Schema names
18+
DATABRICKS_CATALOG=clickbench_catalog
19+
DATABRICKS_SCHEMA=clickbench_schema
20+
21+
# Parquet data location
22+
DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet

databricks/NOTES.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
I created each warehouse in the Databricks UI.
2+
Besides the warehouse size, the only other change I made to default settings was to set the sleep time to 5 minutes to save money (the 4x large warehouse is very expensive).
3+
4+
Once the warehouse was created, I'd save the warehouse path to use in the .env file for each run.

databricks/README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
## Setup
2+
3+
1. Create a Databricks workspace and SQL Warehouse
4+
2. Generate a personal access token from your Databricks workspace
5+
3. Copy `.env.example` to `.env` and fill in your values:
6+
7+
```bash
8+
cp .env.example .env
9+
# Edit .env with your actual credentials
10+
```
11+
12+
Required environment variables:
13+
- `DATABRICKS_SERVER_HOSTNAME`: Your workspace hostname (e.g., `dbc-xxxxxxxx-xxxx.cloud.databricks.com`)
14+
- `DATABRICKS_HTTP_PATH`: SQL Warehouse path (e.g., `/sql/1.0/warehouses/your-warehouse-id`)
15+
- `DATABRICKS_TOKEN`: Your personal access token
16+
- `databricks_instance_type`: Instance type name for results file naming, e.g., "2X-Large"
17+
- `DATABRICKS_CATALOG`: Unity Catalog name
18+
- `DATABRICKS_SCHEMA`: Schema name
19+
- `DATABRICKS_PARQUET_LOCATION`: S3 path to the parquet file
20+
21+
## Running the Benchmark
22+
23+
```bash
24+
./benchmark.sh
25+
```
26+
27+
## How It Works
28+
29+
1. **benchmark.sh**: Entry point that installs dependencies via `uv` and runs the benchmark
30+
2. **benchmark.py**: Orchestrates the full benchmark:
31+
- Creates the catalog and schema
32+
- Creates the `hits` table with explicit schema (including TIMESTAMP conversion)
33+
- Loads data from the parquet file using `INSERT INTO` with type conversions
34+
- Runs all queries via `run.sh`
35+
- Collects timing metrics from Databricks REST API
36+
- Outputs results to JSON in the `results/` directory
37+
3. **run.sh**: Iterates through queries.sql and executes each query
38+
4. **query.py**: Executes individual queries and retrieves execution times from Databricks REST API (`/api/2.0/sql/history/queries/{query_id}`)
39+
5. **queries.sql**: Contains the 43 benchmark queries
40+
41+
## Notes
42+
43+
- Query execution times are pulled from the Databricks REST API, which provides server-side metrics
44+
- The data is loaded from a parquet file with explicit type conversions (Unix timestamps → TIMESTAMP, Unix dates → DATE)
45+
- The benchmark uses Databricks SQL Connector for Python
46+
- Results include load time, data size, and individual query execution times (3 runs per query)
47+
- Results are saved to `results/{instance_type}.json`

0 commit comments

Comments
 (0)