-
Notifications
You must be signed in to change notification settings - Fork 240
Add Databricks and benchmark results for most SQL warehouse options #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d9926ef to
f7f02a8
Compare
|
Oh, that PR would have been nice to merge :-/ @conormccarter please let us know if you need support from our end to go forward with this PR. |
|
Hey @rschu1ze, I will reopen once I update the results! (I realized that I failed to turn off the query cache resulting in inaccurate "hot run" times). |
07774c2 to
062b789
Compare
rschu1ze
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a permission error when I try to push to this repository:
remote: Permission to prequel-co/ClickBench.git denied to rschu1ze.
fatal: unable to access 'https://github.com/prequel-co/ClickBench.git/': The requested URL returned error: 403
... therefore leaving some comments for now.
| @@ -0,0 +1,4 @@ | |||
| I created each warehouse in the Databricks UI. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the content of this file into README.md
| DATABRICKS_SCHEMA=clickbench_schema | ||
|
|
||
| # Parquet data location | ||
| DATABRICKS_PARQUET_LOCATION=s3://some/path/hits.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions here: I set my databricks hostname, the databricks HTTP path, the instance type (2X-Small for the free test version) and the token. I didn't touch the CATALOG and the SCHEMA variables.
When I ran benchmark.sh, I got this:
Connecting to Databricks; loading the data into clickbench_catalog.clickbench_schema 16:12:40 [247/341]
[WARN] pyarrow is not installed by default since databricks-sql-connector 4.0.0,any arrow specific api (e.g. fetchmany_arrow) and cloud fetch will be disabled.If you n
eed these features, please run pip install pyarrow or pip install databricks-sql-connector[pyarrow] to install
Creating table and loading data from s3://some/path/hits.parquet...
Traceback (most recent call last):
File "/data/ClickBench/databricks/./benchmark.py", line 357, in <module>
load_data(run_metadata)
File "/data/ClickBench/databricks/./benchmark.py", line 289, in load_data
cursor.execute(load_query)
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/telemetry/latency_logger.py", line 175, in wrapper
result = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/client.py", line 1260, in execute
self.active_result_set = self.backend.execute_command(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1058, in execute_command
execute_response, has_more_rows = self._handle_execute_response(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 1265, in _handle_execute_response
final_operation_state = self._wait_until_command_done(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 957, in _wait_until_command_done
self._check_command_not_in_error_or_closed_state(op_handle, poll_resp)
File "/data/ClickBench/databricks/.venv/lib/python3.12/site-packages/databricks/sql/backend/thrift_backend.py", line 635, in _check_command_not_in_error_or_closed_st
ate
raise ServerOperationError(
databricks.sql.exc.ServerOperationError: [UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY] Unsupported data source type for direct query on files: parquet SQLSTATE: 0A000; lin
e 109 pos 13
Attempt to close session raised a local exception: sys.meta_path is None, Python is likely shutting down
(l. 289 ran the INSERT statement - the prior CREATE TABLE was successful)
Do you have an idea what went wrong? Do I need to set any other variables?
Oh, I should have mentioned as well that I set DATABRICKS_PARQUET_LOCATION to https://clickhouse-public-datasets.s3.eu-central-1.amazonaws.com/hits_compatible/hits.parquet. Is this correct? If yes, I think we can hard-code it as well.
| # Edit .env with your actual credentials | ||
| ``` | ||
|
|
||
| Required environment variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L. 12 - 19 are covered by comments in .env.example already and redundant.
| @@ -0,0 +1,22 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add sudo snap install --classic astral-uv here.
| @@ -0,0 +1,109 @@ | |||
| -- This is not used in the setup script, but is included here for reference. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slightly confusing to keep this file around, let's delete it.
Resolves: #24