Skip to content

Conversation

@conormccarter
Copy link

@conormccarter conormccarter commented Nov 7, 2025

Resolves: #24

  1. Add Databricks benchmark script
  2. Add results for most Databricks SQL warehouse sizes

@rschu1ze
Copy link
Member

Oh, that PR would have been nice to merge :-/

@conormccarter please let us know if you need support from our end to go forward with this PR.

@conormccarter
Copy link
Author

Hey @rschu1ze, I will reopen once I update the results! (I realized that I failed to turn off the query cache resulting in inaccurate "hot run" times).

@conormccarter conormccarter reopened this Nov 13, 2025
Copy link
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a permission error when I try to push to this repository:

remote: Permission to prequel-co/ClickBench.git denied to rschu1ze.
fatal: unable to access 'https://github.com/prequel-co/ClickBench.git/': The requested URL returned error: 403

... therefore leaving some comments for now.

@@ -0,0 +1,4 @@
I created each warehouse in the Databricks UI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the content of this file into README.md

DATABRICKS_SCHEMA=clickbench_schema

# Parquet data location
DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions here: I set my databricks hostname, the databricks HTTP path, the instance type (2X-Small for the free test version) and the token. I didn't touch the CATALOG and the SCHEMA variables.

When I ran benchmark.sh, I got this:

Connecting to Databricks; loading the data into clickbench_catalog.clickbench_schema                                                                 16:12:40 [247/341]
[WARN] pyarrow is not installed by default since databricks-sql-connector 4.0.0,any arrow specific api (e.g. fetchmany_arrow) and cloud fetch will be disabled.If you n
eed these features, please run pip install pyarrow or pip install databricks-sql-connector[pyarrow] to install
Creating table and loading data from s3://some/path/hits.parquet...
Traceback (most recent call last):
  File "/data/ClickBench/databricks/./benchmark.py", line 357, in <module>
    load_data(run_metadata)
  File "/data/ClickBench/databricks/./benchmark.py", line 289, in load_data
    cursor.execute(load_query)
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/telemetry/latency_logger.py", line 175, in wrapper
    result = func(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/client.py", line 1260, in execute
    self.active_result_set = self.backend.execute_command(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1058, in execute_command
    execute_response, has_more_rows = self._handle_execute_response(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1265, in _handle_execute_response
    final_operation_state = self._wait_until_command_done(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 957, in _wait_until_command_done
    self._check_command_not_in_error_or_closed_state(op_handle, poll_resp)
  File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 635, in _check_command_not_in_error_or_closed_st
ate
    raise ServerOperationError(
databricks.sql.exc.ServerOperationError: [UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY] Unsupported data source type for direct query on files: parquet SQLSTATE: 0A000; lin
e 109 pos 13
Attempt to close session raised a local exception: sys.meta_path is None, Python is likely shutting down

(l. 289 ran the INSERT statement - the prior CREATE TABLE was successful)

Do you have an idea what went wrong? Do I need to set any other variables?

Oh, I should have mentioned as well that I set DATABRICKS_PARQUET_LOCATION to https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/hits_compatible/hits.parquet. Is this correct? If yes, I think we can hard-code it as well.

# Edit .env with your actual credentials
```

Required environment variables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L. 12 - 19 are covered by comments in .env.example already and redundant.

@@ -0,0 +1,22 @@
#!/bin/bash

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add sudo snap install --classic astral-uv here.

@@ -0,0 +1,109 @@
-- This is not used in the setup script, but is included here for reference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly confusing to keep this file around, let's delete it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Help wanted: Databricks

2 participants