Skip to content

Commit 76ecef3

Browse files
fix(python-binding): complete Python binding CI configuration (#18686)
* fix: enable Python binding CI for manual testing - Add inputs to workflow_dispatch for manual CI testing - Fix condition logic: change inputs.tag to inputs.version - Allow optional version parameter with default test-build value - This enables actual Python binding compilation in CI * fix: handle non-semver version strings in CI - Allow test-build and other non-semver version strings - Use fallback if regex extraction fails - This fixes CI failure when using default test-build version * fix: remove rust-std from maturin rustup-components The rust-std component was causing maturin-action to attempt installing cargo-zigbuild which failed due to missing musl stdlib. Removing rust-std allows maturin to handle target installation automatically. * fix: add explicit rustup target installation in maturin action Ensures the correct target (x86_64-unknown-linux-gnu) is installed before maturin attempts compilation, preventing musl stdlib errors. Also adds platform specification to avoid architecture mismatch. * fix: install cargo-zigbuild manually with correct target Manually install cargo-zigbuild with the GNU target instead of letting maturin auto-install with musl target. This prevents the "can't find crate for std/core" errors when musl stdlib is unavailable in the Docker container. * fix: use valid semver for test builds in Python packaging When version input is not a valid semver (like "test-build-v4"), fallback to "0.1.0" instead of using the invalid string directly. This fixes TOML parse errors in pyproject.toml that require version to start with a number. * feat: improve Python binding documentation and fix version strategy - Fix version conflict by using git tag-based versioning with dev suffix for tests - Enhance README.md with accurate examples from actual code (context.rs, basic.py) - Add comprehensive API reference table - Improve pyproject.toml metadata with proper description and classifiers - All examples verified against actual binding implementation Examples include: 1. Register external files (parquet/csv/ndjson) and query 2. Create table, insert data, and convert to pandas/polars * update descriptions based on official Databend positioning * Add storage connection management APIs to Python binding - Add create_*_connection methods for S3, Azure Blob, GCS, OSS, COS - Add list_connections, describe_connection, drop_connection methods - Use CREATE OR REPLACE CONNECTION for easier overwrites - Support optional parameters for S3 endpoint_url and region - Include comprehensive test coverage for all connection APIs * Fix auto-initialization and testing issues - Handle mutex poisoning in service initialization with proper recovery - Implement dynamic port allocation to avoid conflicts in tests - Fix test parameter matching for mock assertions - Add comprehensive mock-based testing without full service startup - Ensure all 10 connection API tests pass successfully This enables seamless auto-initialization when creating SessionContext without requiring manual databend.init_service() calls. * fix: resolve Python binding compilation and permission issues - Fix unused imports and compilation errors in bendpy - Resolve root user permission denied issues by properly configuring account_admin role - Fix warehouse assignment logic for embedded mode using manual cluster creation - Bypass meta service connection errors in single-node embedded configuration - All Python binding tests now pass successfully * refactor: add unified embedded mode detection and optimize Python binding architecture - Add GlobalConfig::is_embedded_mode() method for unified embedded mode detection - Apply embedded mode checks to cluster discovery and session management - Simplify Python binding with comprehensive user permissions instead of manual cluster creation - Set virtual warehouse/cluster IDs for embedded mode to bypass warehouse validation - Rename basic.py to test_basic.py for pytest discovery - Remove unused dependencies from bendpy Cargo.toml - Clean up interpreter factory by removing unnecessary embedded mode checks * feat: enable automatic initialization in Python binding - Remove requirement for manual databend.init_embedded() call - SessionContext now auto-initializes embedded mode on first use - Add data_path parameter to SessionContext constructor - Update documentation and examples to reflect simplified API * ci: add pytest testing to Python binding release pipeline - Add test job that runs on PRs and workflow dispatches - Integrate pytest execution into build_bindings_python action - Test both development builds and release wheels - Ensure all Python binding functionality is validated before release * ci: fix Rust toolchain setup for Python binding tests - Add proper Rust installation for development builds - Separate test steps for development vs release builds - Ensure Rust toolchain is available before maturin develop * ci: improve version handling in Python binding following pack_deb pattern - Use standard bash parameter expansion to remove v prefix - Convert Databend version formats to PEP 440 compatible Python versions - Handle nightly (-nightly -> .dev0) and patch (-p1 -> .post1) releases - Add clear logging of version transformation * fix: resolve Python binding CI failures in PR #18686 - Add condition to linux job to only run when version is provided - Replace manual Rust installation with dtolnay/rust-toolchain action - Eliminate duplicate job execution between test and linux jobs - Ensure proper separation between PR testing and release builds * fix: resolve maturin virtualenv requirement in CI - Create proper virtual environment for maturin develop command - Install required dependencies (maturin, pytest, pandas, polars, pyarrow) in venv - Activate virtual environment before running maturin develop - Fix 'Couldn't find a virtualenv or conda environment' error * fix: remove trailing spaces in YAML file to pass lint check - Remove trailing whitespace from build_bindings_python action.yml - Fix yamllint errors that caused linux/check CI failure - Resolve 'trailing spaces' error on lines 99, 103, 104, 106, 110, 113, 117, 126, 132, 136 * revert: remove unrelated grpc_server changes from Python binding CI fix * fix: next_port() should not assign duplicated port --------- Co-authored-by: Zhang Yanpo <drdr.xp@gmail.com>
1 parent 3d49cd5 commit 76ecef3

File tree

17 files changed

+741
-363
lines changed

17 files changed

+741
-363
lines changed

.github/actions/build_bindings_python/action.yml

Lines changed: 67 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,12 @@ runs:
1515
if: inputs.version
1616
shell: bash
1717
run: |
18-
VERSION=`echo ${{ inputs.version }} | grep -Eo '[0-9]+\.[0-9]+\.[0-9]+'`
19-
echo "building tag and version: git tag: $GIT_TAG version: $VERSION"
18+
raw_version="${{ inputs.version }}"
19+
# Remove v prefix: v1.2.3-nightly -> 1.2.3-nightly
20+
version_no_v=${raw_version/v/}
21+
# Convert to Python-compatible version: 1.2.3-nightly -> 1.2.3.dev0, 1.2.3-p1 -> 1.2.3.post1
22+
VERSION=$(echo "$version_no_v" | sed 's/-nightly/.dev0/' | sed 's/-p\([0-9]*\)/.post\1/')
23+
echo "building version: $raw_version -> $VERSION"
2024
sed "s#version = \"0.1.0\"#version = \"$VERSION\"#g" Cargo.toml > Cargo.toml.bak
2125
sed "s#version = \"0.1.0\"#version = \"$VERSION\"#g" pyproject.toml > pyproject.toml.bak
2226
@@ -57,19 +61,77 @@ runs:
5761
enable-cache: true
5862

5963
- name: Build wheels
60-
if: inputs.tag
64+
if: inputs.version
6165
uses: PyO3/maturin-action@v1
6266
with:
6367
rust-toolchain: ${{ steps.toolchain.outputs.RUST_TOOLCHAIN }}
6468
working-directory: src/bendpy
6569
target: ${{ inputs.target }}
6670
manylinux: "2_28"
6771
# Keep them in one line due to https://github.com/PyO3/maturin-action/issues/153
68-
rustup-components: rust-std rustfmt
72+
rustup-components: rustfmt
6973
args: ${{ steps.opts.outputs.BUILD_ARGS }}
70-
docker-options: --env RUSTC_WRAPPER=sccache --env SCCACHE_GCS_RW_MODE=READ_WRITE --env SCCACHE_GCS_BUCKET=databend-ci --env SCCACHE_GCS_KEY_PREFIX=cache/sccache/
74+
docker-options: --env RUSTC_WRAPPER=sccache --env SCCACHE_GCS_RW_MODE=READ_WRITE --env SCCACHE_GCS_BUCKET=databend-ci --env SCCACHE_GCS_KEY_PREFIX=cache/sccache/ --env MATURIN_NO_AUTO_INSTALL=1
75+
container-options: --platform linux/amd64
7176
before-script-linux: |
7277
unset RUSTC_WRAPPER
78+
# Add the target for the specified architecture
79+
rustup target add ${{ inputs.target }}
80+
# Install cargo-zigbuild manually to avoid musl target issues
81+
cargo install cargo-zigbuild --target ${{ inputs.target }}
7382
../../scripts/setup/dev_setup.sh -yb
7483
uv venv --python=python3.12
7584
uv sync --all-groups --all-extras
85+
86+
- name: Setup Rust for development builds
87+
if: inputs.version == ''
88+
uses: dtolnay/rust-toolchain@master
89+
with:
90+
toolchain: ${{ steps.toolchain.outputs.RUST_TOOLCHAIN }}
91+
components: rustfmt
92+
93+
- name: Build development wheel and run tests
94+
if: inputs.version == ''
95+
shell: bash
96+
working-directory: src/bendpy
97+
run: |
98+
echo "Building development wheel for testing..."
99+
100+
# Create and activate virtual environment
101+
uv venv --python python3.12
102+
source .venv/bin/activate
103+
104+
# Install development dependencies
105+
uv pip install maturin pytest pandas polars pyarrow
106+
107+
# Ensure we have a clean environment
108+
export PATH="$HOME/.local/bin:$PATH"
109+
export PATH="$HOME/.cargo/bin:$PATH"
110+
111+
echo "Running development build tests..."
112+
maturin develop -E test --quiet
113+
114+
# Run pytest tests
115+
echo "Executing pytest tests..."
116+
python -m pytest tests/ -v --tb=short
117+
118+
echo "All Python binding tests passed!"
119+
120+
- name: Test built wheels
121+
if: inputs.version != ''
122+
shell: bash
123+
working-directory: src/bendpy
124+
run: |
125+
echo "Testing built wheels..."
126+
127+
# Install from built wheel for release testing
128+
uv venv --python python3.12
129+
source .venv/bin/activate
130+
uv pip install dist/*.whl
131+
uv pip install pytest pandas polars pyarrow
132+
133+
# Run pytest tests
134+
echo "Executing pytest tests..."
135+
python -m pytest tests/ -v --tb=short
136+
137+
echo "All Python binding tests passed!"

.github/workflows/bindings.python.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,12 @@ name: Bindings Python
33
on:
44
## uncomment it when bendpy is enabled
55
workflow_dispatch:
6+
inputs:
7+
version:
8+
description: Version to release (optional for testing)
9+
required: false
10+
type: string
11+
default: "test-build"
612
pull_request:
713
branches:
814
- main
@@ -27,7 +33,27 @@ permissions:
2733
packages: write
2834

2935
jobs:
36+
test:
37+
# Run tests on all PRs and workflow dispatches
38+
if: github.event_name == 'pull_request' || github.event_name == 'workflow_dispatch'
39+
runs-on:
40+
- self-hosted
41+
- X64
42+
- Linux
43+
- 4c16g
44+
- aws
45+
steps:
46+
- uses: actions/checkout@v4
47+
with:
48+
fetch-depth: 0
49+
- uses: ./.github/actions/build_bindings_python
50+
with:
51+
target: x86_64-unknown-linux-gnu
52+
# No version means development build with tests
53+
3054
linux:
55+
# Only run for version builds (releases)
56+
if: inputs.version
3157
runs-on:
3258
- self-hosted
3359
- "${{ matrix.runner }}"

Cargo.lock

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

src/bendpy/Cargo.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,20 +18,20 @@ crate-type = ["cdylib"]
1818
arrow = { workspace = true, features = ["pyarrow"] }
1919
arrow-schema = { workspace = true }
2020
ctor = { workspace = true }
21+
databend-common-base = { workspace = true }
2122
databend-common-catalog = { workspace = true }
2223
databend-common-config = { workspace = true }
2324
databend-common-exception = { workspace = true }
2425
databend-common-expression = { workspace = true }
2526
databend-common-license = { workspace = true }
2627
databend-common-meta-app = { workspace = true }
27-
databend-common-users = { workspace = true }
28+
databend-common-tracing = { workspace = true }
2829
databend-common-version = { workspace = true }
2930
databend-query = { workspace = true, features = [
3031
"simd",
3132
"disable_initial_exec_tls",
3233
] }
3334
pyo3 = { version = "0.24", features = ["generate-import-lib", "abi3-py312"] }
34-
tempfile = { workspace = true }
3535
tokio = { workspace = true, features = ["macros", "rt", "rt-multi-thread", "sync"] }
3636
tokio-stream = { workspace = true }
3737

src/bendpy/README.md

Lines changed: 60 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -1,141 +1,100 @@
11
# Databend Python Binding
22

3-
This crate intends to build a native python binding.
3+
Official Python binding for [Databend](https://databend.com) - The AI-Native Data Warehouse.
4+
5+
Databend is the open-source alternative to Snowflake with near 100% SQL compatibility and native AI capabilities. Built in Rust with MPP architecture and S3-native storage, Databend unifies structured tables, JSON documents, and vector embeddings in a single platform.
46

57
## Installation
68

79
```bash
810
pip install databend
911
```
1012

11-
## Usage
13+
## Quick Start
1214

13-
### Basic:
1415
```python
1516
import databend
16-
databend.init_service(local_dir = ".databend")
17-
# or use config
18-
# databend.init_service( config = "config.toml.sample" )
19-
20-
from databend import SessionContext
21-
ctx = SessionContext()
2217

23-
df = ctx.sql("select number, number + 1, number::String as number_p_1 from numbers(8)")
18+
# Create session (automatically initializes embedded mode)
19+
ctx = databend.SessionContext()
2420

21+
# Execute SQL
22+
df = ctx.sql("SELECT number, number + 1 FROM numbers(5)")
2523
df.show()
26-
# convert to pyarrow
27-
import pyarrow
28-
df.to_py_arrow()
2924

30-
# convert to pandas
31-
import pandas
32-
df.to_pandas()
25+
# Convert to pandas/polars
26+
pandas_df = df.to_pandas()
27+
polars_df = df.to_polars()
3328
```
3429

35-
### Register external table:
30+
## Examples
3631

37-
***supported functions:***
38-
- register_parquet
39-
- register_ndjson
40-
- register_csv
41-
- register_tsv
32+
### 1. Register External Files and Query
4233

4334
```python
35+
import databend
4436

45-
ctx.register_parquet("pa", "/home/sundy/dataset/hits_p/", pattern = ".*.parquet")
46-
ctx.sql("select * from pa limit 10").collect()
47-
```
48-
49-
### Tenant separation:
37+
ctx = databend.SessionContext()
5038

51-
Tenant has it's own catalog and tables
39+
# Register external files
40+
ctx.register_parquet("pa", "/home/dataset/hits_p/", pattern=".*.parquet")
41+
ctx.register_csv("users", "/path/to/users.csv")
42+
ctx.register_ndjson("logs", "/path/to/logs/", pattern=".*.jsonl")
5243

53-
```python
54-
ctx = SessionContext(tenant = "your_tenant_name")
44+
# Query external data
45+
result = ctx.sql("SELECT * FROM pa LIMIT 10").collect()
46+
print(result)
5547
```
5648

57-
## Development
49+
### 2. Create Table, Insert and Select
5850

59-
Setup virtualenv:
51+
```python
52+
import databend
6053

61-
```shell
62-
uv sync
63-
```
54+
ctx = databend.SessionContext()
6455

65-
Activate venv:
56+
# Create table
57+
ctx.sql("CREATE TABLE aa (a INT, b STRING, c BOOL, d DOUBLE)").collect()
6658

67-
```shell
68-
source .venv/bin/activate
69-
````
59+
# Insert data
60+
ctx.sql("INSERT INTO aa SELECT number, number, true, number FROM numbers(10)").collect()
61+
ctx.sql("INSERT INTO aa SELECT number, number, true, number FROM numbers(10)").collect()
7062

71-
Install `maturin`:
63+
# Query and convert to pandas
64+
df = ctx.sql("SELECT sum(a) x, max(b) y, max(d) z FROM aa WHERE c").to_pandas()
65+
print(df.values.tolist()) # [[90.0, "9", 9.0]]
7266

73-
```shell
74-
pip install "maturin[patchelf]"
67+
# Query and convert to polars
68+
df_polars = ctx.sql("SELECT sum(a) x, max(b) y, max(d) z FROM aa WHERE c").to_polars()
69+
print(df_polars.to_pandas().values.tolist()) # [[90.0, "9", 9.0]]
7570
```
7671

77-
Build bindings:
72+
## API Reference
73+
74+
| Method | Description | Example |
75+
|--------------------------------------------------|-----------------------------------------|------------------------------------------------|
76+
| `SessionContext(tenant=None, data_path=".databend")` | Create session context (auto-initializes) | `ctx = databend.SessionContext()` |
77+
| `ctx.sql(sql)` | Execute SQL and return DataFrame | `df = ctx.sql("SELECT * FROM table")` |
78+
| `df.show(num=20)` | Display DataFrame results | `df.show()` |
79+
| `df.collect()` | Collect DataFrame as DataBlocks | `blocks = df.collect()` |
80+
| `df.to_pandas()` | Convert to Pandas DataFrame | `pdf = df.to_pandas()` |
81+
| `df.to_polars()` | Convert to Polars DataFrame | `pldf = df.to_polars()` |
82+
| `df.to_py_arrow()` | Convert to PyArrow batches | `batches = df.to_py_arrow()` |
83+
| `df.to_arrow_table()` | Convert to PyArrow Table | `table = df.to_arrow_table()` |
84+
| `ctx.register_parquet(name, path, pattern=None)` | Register Parquet files | `ctx.register_parquet("data", "/path/")` |
85+
| `ctx.register_csv(name, path, pattern=None)` | Register CSV files | `ctx.register_csv("users", "/users.csv")` |
86+
| `ctx.register_ndjson(name, path, pattern=None)` | Register NDJSON files | `ctx.register_ndjson("logs", "/logs/")` |
87+
| `ctx.register_tsv(name, path, pattern=None)` | Register TSV files | `ctx.register_tsv("data", "/data.tsv")` |
7888

79-
```shell
80-
uvx maturin develop
81-
```
89+
## Development
8290

83-
Run tests:
91+
```bash
92+
# Setup environment
93+
uv sync
94+
source .venv/bin/activate
8495

85-
```shell
96+
# Run tests
8697
uvx maturin develop -E test
98+
pytest tests/
8799
```
88100

89-
Build API docs:
90-
91-
```shell
92-
uvx maturin develop -E docs
93-
uvx pdoc databend
94-
```
95-
96-
## Service configuration
97-
98-
> Note:
99-
100-
**`databend.init_service` must be initialized before `SessionContext`**
101-
102-
**`databend.init_service` must be called only once**
103-
104-
105-
- By default, you can init the service by a local directory, then data & catalogs will be stored inside the directory.
106-
```
107-
import databend
108-
109-
databend.init_service(local_dir = ".databend")
110-
```
111-
112-
- You can also init by file
113-
114-
```
115-
import databend
116-
databend.init_service( config = "config.toml.sample" )
117-
```
118-
119-
- And by config str
120-
```
121-
import databend
122-
123-
databend.init_service(config = """
124-
[meta]
125-
embedded_dir = "./.databend/"
126-
127-
# Storage config.
128-
[storage]
129-
# fs | s3 | azblob | obs | oss
130-
type = "fs"
131-
allow_insecure = true
132-
133-
[storage.fs]
134-
data_path = "./.databend/"
135-
""")
136-
```
137-
138-
Read more about configs of databend in [docs](https://docs.databend.com/guides/deploy/deploy/production/metasrv-deploy)
139-
140-
## More
141-
Databend python api is inspired by [arrow-datafusion-python](https://github.com/apache/arrow-datafusion-python), thanks for their great work.

0 commit comments

Comments
 (0)