Skip to content

Commit 2ad5e7e

Browse files
further discretize pages
1 parent 9633c25 commit 2ad5e7e

File tree

7 files changed

+829
-256
lines changed

7 files changed

+829
-256
lines changed

docs/integrations/language-clients/python/advanced-inserting.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,4 +65,95 @@ In most cases, it is unnecessary to override the write format for a data type, b
6565
| Variant | object | | At this time on all variants are inserted as Strings and parsed by the ClickHouse server |
6666
| Dynamic | object | | Warning -- at this time any inserts into a Dynamic column are persisted as a ClickHouse String |
6767

68+
### Specialized insert methods {#specialized-insert-methods}
69+
70+
ClickHouse Connect provides specialized insert methods for common data formats:
71+
72+
- `insert_df` -- Insert a Pandas DataFrame. Instead of a Python Sequence of Sequences `data` argument, the second parameter of this method requires a `df` argument that must be a Pandas DataFrame instance. ClickHouse Connect automatically processes the DataFrame as a column oriented datasource, so the `column_oriented` parameter is not required or available.
73+
- `insert_arrow` -- Insert a PyArrow Table. ClickHouse Connect passes the Arrow table unmodified to the ClickHouse server for processing, so only the `database` and `settings` arguments are available in addition to `table` and `arrow_table`.
74+
- `insert_df_arrow` -- Insert an arrow-backed Pandas DataFrame or a Polars DataFrame. ClickHouse Connect will automatically determine if the DataFrame is a Pandas or Polars type. If Pandas, validation will be performed to ensure that each column's dtype backend is Arrow-based and an error will be raised if any are not.
75+
76+
:::note
77+
A NumPy array is a valid Sequence of Sequences and can be used as the `data` argument to the main `insert` method, so a specialized method is not required.
78+
:::
79+
80+
#### Pandas DataFrame insert {#pandas-dataframe-insert}
81+
82+
```python
83+
import clickhouse_connect
84+
import pandas as pd
85+
86+
client = clickhouse_connect.get_client()
87+
88+
df = pd.DataFrame({
89+
"id": [1, 2, 3],
90+
"name": ["Alice", "Bob", "Joe"],
91+
"age": [25, 30, 28],
92+
})
93+
94+
client.insert_df("users", df)
95+
```
96+
97+
#### PyArrow Table insert {#pyarrow-table-insert}
98+
99+
```python
100+
import clickhouse_connect
101+
import pyarrow as pa
102+
103+
client = clickhouse_connect.get_client()
104+
105+
arrow_table = pa.table({
106+
"id": [1, 2, 3],
107+
"name": ["Alice", "Bob", "Joe"],
108+
"age": [25, 30, 28],
109+
})
110+
111+
client.insert_arrow("users", arrow_table)
112+
```
113+
114+
#### Arrow-backed DataFrame insert (pandas 2.x) {#arrow-backed-dataframe-insert-pandas-2}
115+
116+
```python
117+
import clickhouse_connect
118+
import pandas as pd
119+
120+
client = clickhouse_connect.get_client()
121+
122+
# Convert to Arrow-backed dtypes for better performance
123+
df = pd.DataFrame({
124+
"id": [1, 2, 3],
125+
"name": ["Alice", "Bob", "Joe"],
126+
"age": [25, 30, 28],
127+
}).convert_dtypes(dtype_backend="pyarrow")
128+
129+
client.insert_df_arrow("users", df)
130+
```
131+
132+
## File inserts {#file-inserts}
133+
134+
The `clickhouse_connect.driver.tools` package includes the `insert_file` method that allows inserting data directly from the file system into an existing ClickHouse table. Parsing is delegated to the ClickHouse server. `insert_file` accepts the following parameters:
135+
136+
| Parameter | Type | Default | Description |
137+
|--------------|-----------------|-------------------|---------------------------------------------------------------------------------------------------------------------------|
138+
| client | Client | *Required* | The `driver.Client` used to perform the insert |
139+
| table | str | *Required* | The ClickHouse table to insert into. The full table name (including database) is permitted. |
140+
| file_path | str | *Required* | The native file system path to the data file |
141+
| fmt | str | CSV, CSVWithNames | The ClickHouse Input Format of the file. CSVWithNames is assumed if `column_names` is not provided |
142+
| column_names | Sequence of str | *None* | A list of column names in the data file. Not required for formats that include column names |
143+
| database | str | *None* | Database of the table. Ignored if the table is fully qualified. If not specified, the insert will use the client database |
144+
| settings | dict | *None* | See [settings description](driver-api.md#settings-argument). |
145+
| compression | str | *None* | A recognized ClickHouse compression type (zstd, lz4, gzip) used for the Content-Encoding HTTP header |
146+
147+
For files with inconsistent data or date/time values in an unusual format, settings that apply to data imports (such as `input_format_allow_errors_num` and `input_format_allow_errors_num`) are recognized for this method.
148+
149+
```python
150+
import clickhouse_connect
151+
from clickhouse_connect.driver.tools import insert_file
152+
153+
client = clickhouse_connect.get_client()
154+
insert_file(client, 'example_table', 'my_data.csv',
155+
settings={'input_format_allow_errors_ratio': .2,
156+
'input_format_allow_errors_num': 5})
157+
```
158+
68159

docs/integrations/language-clients/python/advanced-querying.md

Lines changed: 269 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,18 @@ Note that `QueryContext`s are not thread safe, but a copy can be obtained in a m
3232

3333
## Streaming queries {#streaming-queries}
3434

35+
The ClickHouse Connect Client provides multiple methods for retrieving data as a stream (implemented as a Python generator):
36+
37+
- `query_column_block_stream` -- Returns query data in blocks as a sequence of columns using native Python objects
38+
- `query_row_block_stream` -- Returns query data as a block of rows using native Python objects
39+
- `query_rows_stream` -- Returns query data as a sequence of rows using native Python objects
40+
- `query_np_stream` -- Returns each ClickHouse block of query data as a NumPy array
41+
- `query_df_stream` -- Returns each ClickHouse Block of query data as a Pandas DataFrame
42+
- `query_arrow_stream` -- Returns query data in PyArrow RecordBlocks
43+
- `query_df_arrow_stream` -- Returns each ClickHouse Block of query data as an arrow-backed Pandas DataFrame or a Polars DataFrame depending on the kwarg `dataframe_library` (default is "pandas").
44+
45+
Each of these methods returns a `ContextStream` object that must be opened via a `with` statement to start consuming the stream.
46+
3547
### Data blocks {#data-blocks}
3648
ClickHouse Connect processes all data from the primary `query` method as a stream of blocks received from the ClickHouse server. These blocks are transmitted in the custom "Native" format to and from ClickHouse. A "block" is simply a sequence of columns of binary data, where each column contains an equal number of data values of the specified data type. (As a columnar database, ClickHouse stores this data in a similar form.) The size of a block returned from a query is governed by two user settings that can be set at several levels (user profile, user, session, or query). They are:
3749

@@ -73,8 +85,6 @@ The `query_np_stream` method return each block as a two-dimensional NumPy Array.
7385

7486
The `query_df_stream` method returns each ClickHouse Block as a two-dimensional Pandas DataFrame. Here's an example which shows that the `StreamContext` object can be used as a context in a deferred fashion (but only once).
7587

76-
Finally, the `query_arrow_stream` method returns a ClickHouse `ArrowStream` formatted result as a `pyarrow.ipc.RecordBatchStreamReader` wrapped in `StreamContext`. Each iteration of the stream returns PyArrow RecordBlock.
77-
7888
```python
7989
df_stream = client.query_df_stream('SELECT * FROM hits')
8090
column_names = df_stream.source.column_names
@@ -83,6 +93,263 @@ with df_stream:
8393
<do something with the pandas DataFrame>
8494
```
8595

96+
The `query_df_arrow_stream` method returns each ClickHouse Block as a DataFrame with PyArrow dtype backend. This method supports both Pandas (2.x or later) and Polars DataFrames via the `dataframe_library` parameter (defaults to `"pandas"`). Each iteration yields a DataFrame converted from PyArrow record batches, providing better performance and memory efficiency for certain data types.
97+
98+
Finally, the `query_arrow_stream` method returns a ClickHouse `ArrowStream` formatted result as a `pyarrow.ipc.RecordBatchStreamReader` wrapped in `StreamContext`. Each iteration of the stream returns PyArrow RecordBlock.
99+
100+
### Streaming examples {#streaming-examples}
101+
102+
#### Stream rows {#stream-rows}
103+
104+
```python
105+
import clickhouse_connect
106+
107+
client = clickhouse_connect.get_client()
108+
109+
# Stream large result sets row by row
110+
with client.query_rows_stream("SELECT number, number * 2 as doubled FROM system.numbers LIMIT 100000") as stream:
111+
for row in stream:
112+
print(row) # Process each row
113+
# Output:
114+
# (0, 0)
115+
# (1, 2)
116+
# (2, 4)
117+
# ....
118+
```
119+
120+
#### Stream row blocks {#stream-row-blocks}
121+
122+
```python
123+
import clickhouse_connect
124+
125+
client = clickhouse_connect.get_client()
126+
127+
# Stream in blocks of rows (more efficient than row-by-row)
128+
with client.query_row_block_stream("SELECT number, number * 2 FROM system.numbers LIMIT 100000") as stream:
129+
for block in stream:
130+
print(f"Received block with {len(block)} rows")
131+
# Output:
132+
# Received block with 65409 rows
133+
# Received block with 34591 rows
134+
```
135+
136+
#### Stream Pandas DataFrames {#stream-pandas-dataframes}
137+
138+
```python
139+
import clickhouse_connect
140+
141+
client = clickhouse_connect.get_client()
142+
143+
# Stream query results as Pandas DataFrames
144+
with client.query_df_stream("SELECT number, toString(number) AS str FROM system.numbers LIMIT 100000") as stream:
145+
for df in stream:
146+
# Process each DataFrame block
147+
print(f"Received DataFrame with {len(df)} rows")
148+
print(df.head(3))
149+
# Output:
150+
# Received DataFrame with 65409 rows
151+
# number str
152+
# 0 0 0
153+
# 1 1 1
154+
# 2 2 2
155+
# Received DataFrame with 34591 rows
156+
# number str
157+
# 0 65409 65409
158+
# 1 65410 65410
159+
# 2 65411 65411
160+
```
161+
162+
#### Stream Arrow batches {#stream-arrow-batches}
163+
164+
```python
165+
import clickhouse_connect
166+
167+
client = clickhouse_connect.get_client()
168+
169+
# Stream query results as Arrow record batches
170+
with client.query_arrow_stream("SELECT * FROM large_table") as stream:
171+
for arrow_batch in stream:
172+
# Process each Arrow batch
173+
print(f"Received Arrow batch with {arrow_batch.num_rows} rows")
174+
# Output:
175+
# Received Arrow batch with 65409 rows
176+
# Received Arrow batch with 34591 rows
177+
```
178+
179+
## NumPy, Pandas, and Arrow queries {#numpy-pandas-and-arrow-queries}
180+
181+
ClickHouse Connect provides specialized query methods for working with NumPy, Pandas, and Arrow data structures. These methods allow you to retrieve query results directly in these popular data formats without manual conversion.
182+
183+
### NumPy queries {#numpy-queries}
184+
185+
The `query_np` method returns query results as a NumPy array instead of a ClickHouse Connect `QueryResult`.
186+
187+
```python
188+
import clickhouse_connect
189+
190+
client = clickhouse_connect.get_client()
191+
192+
# Query returns a NumPy array
193+
np_array = client.query_np("SELECT number, number * 2 AS doubled FROM system.numbers LIMIT 5")
194+
195+
print(type(np_array))
196+
# Output:
197+
# <class "numpy.ndarray">
198+
199+
print(np_array)
200+
# Output:
201+
# [[0 0]
202+
# [1 2]
203+
# [2 4]
204+
# [3 6]
205+
# [4 8]]
206+
```
207+
208+
### Pandas queries {#pandas-queries}
209+
210+
The `query_df` method returns query results as a Pandas DataFrame instead of a ClickHouse Connect `QueryResult`.
211+
212+
```python
213+
import clickhouse_connect
214+
215+
client = clickhouse_connect.get_client()
216+
217+
# Query returns a Pandas DataFrame
218+
df = client.query_df("SELECT number, number * 2 AS doubled FROM system.numbers LIMIT 5")
219+
220+
print(type(df))
221+
# Output: <class "pandas.core.frame.DataFrame">
222+
print(df)
223+
# Output:
224+
# number doubled
225+
# 0 0 0
226+
# 1 1 2
227+
# 2 2 4
228+
# 3 3 6
229+
# 4 4 8
230+
```
231+
232+
### PyArrow queries {#pyarrow-queries}
233+
234+
The `query_arrow` method returns query results as a PyArrow Table. It utilizes the ClickHouse `Arrow` format directly, so it only accepts three arguments in common with the main `query` method: `query`, `parameters`, and `settings`. In addition, there is an additional argument, `use_strings`, which determines whether the Arrow Table will render ClickHouse String types as strings (if True) or bytes (if False).
235+
236+
```python
237+
import clickhouse_connect
238+
239+
client = clickhouse_connect.get_client()
240+
241+
# Query returns a PyArrow Table
242+
arrow_table = client.query_arrow("SELECT number, toString(number) AS str FROM system.numbers LIMIT 3")
243+
244+
print(type(arrow_table))
245+
# Output:
246+
# <class "pyarrow.lib.Table">
247+
248+
print(arrow_table)
249+
# Output:
250+
# pyarrow.Table
251+
# number: uint64 not null
252+
# str: string not null
253+
# ----
254+
# number: [[0,1,2]]
255+
# str: [["0","1","2"]]
256+
```
257+
258+
### Arrow-backed DataFrames {#arrow-backed-dataframes}
259+
260+
ClickHouse Connect supports fast, memory‑efficient DataFrame creation from Arrow results via the `query_df_arrow` and `query_df_arrow_stream` methods. These are thin wrappers around the Arrow query methods and perform zero‑copy conversions to DataFrames where possible:
261+
262+
- `query_df_arrow`: Executes the query using the ClickHouse `Arrow` output format and returns a DataFrame.
263+
- For `dataframe_library='pandas'`, returns a pandas 2.x DataFrame using Arrow‑backed dtypes (`pd.ArrowDtype`). This requires pandas 2.x and leverages zero‑copy buffers where possible for excellent performance and low memory overhead.
264+
- For `dataframe_library='polars'`, returns a Polars DataFrame created from the Arrow table (`pl.from_arrow`), which is similarly efficient and can be zero‑copy depending on the data.
265+
- `query_df_arrow_stream`: Streams results as a sequence of DataFrames (pandas 2.x or Polars) converted from Arrow stream batches.
266+
267+
#### Query to Arrow-backed DataFrame {#query-to-arrow-backed-dataframe}
268+
269+
```python
270+
import clickhouse_connect
271+
272+
client = clickhouse_connect.get_client()
273+
274+
# Query returns a Pandas DataFrame with Arrow dtypes (requires pandas 2.x)
275+
df = client.query_df_arrow(
276+
"SELECT number, toString(number) AS str FROM system.numbers LIMIT 3",
277+
dataframe_library="pandas"
278+
)
279+
280+
print(df.dtypes)
281+
# Output:
282+
# number uint64[pyarrow]
283+
# str string[pyarrow]
284+
# dtype: object
285+
286+
# Or use Polars
287+
polars_df = client.query_df_arrow(
288+
"SELECT number, toString(number) AS str FROM system.numbers LIMIT 3",
289+
dataframe_library="polars"
290+
)
291+
print(df.dtypes)
292+
# Output:
293+
# [UInt64, String]
294+
295+
296+
# Streaming into batches of DataFrames (polars shown)
297+
with client.query_df_arrow_stream(
298+
"SELECT number, toString(number) AS str FROM system.numbers LIMIT 100000", dataframe_library="polars"
299+
) as stream:
300+
for df_batch in stream:
301+
print(f"Received {type(df_batch)} batch with {len(df_batch)} rows and dtypes: {df_batch.dtypes}")
302+
# Output:
303+
# Received <class 'polars.dataframe.frame.DataFrame'> batch with 65409 rows and dtypes: [UInt64, String]
304+
# Received <class 'polars.dataframe.frame.DataFrame'> batch with 34591 rows and dtypes: [UInt64, String]
305+
```
306+
307+
#### Notes and caveats {#notes-and-caveats}
308+
- Arrow type mapping: When returning data in Arrow format, ClickHouse maps types to the closest supported Arrow types. Some ClickHouse types do not have a native Arrow equivalent and are returned as raw bytes in Arrow fields (usually `BINARY` or `FIXED_SIZE_BINARY`).
309+
- Examples: `IPv4` is represented as Arrow `UINT32`; `IPv6` and large integers (`Int128/UInt128/Int256/UInt256`) are often represented as `FIXED_SIZE_BINARY`/`BINARY` with raw bytes.
310+
- In these cases, the DataFrame column will contain byte values backed by the Arrow field; it is up to the client code to interpret/convert those bytes according to ClickHouse semantics.
311+
- Unsupported Arrow data types (e.g., UUID/ENUM as true Arrow types) are not emitted; values are represented using the closest supported Arrow type (often as binary bytes) for output.
312+
- Pandas requirement: Arrow‑backed dtypes require pandas 2.x. For older pandas versions, use `query_df` (non‑Arrow) instead.
313+
- Strings vs binary: The `use_strings` option (when supported by the server setting `output_format_arrow_string_as_string`) controls whether ClickHouse `String` columns are returned as Arrow strings or as binary.
314+
315+
#### Mismatched ClickHouse/Arrow type conversion examples {#mismatched-clickhousearrow-type-conversion-examples}
316+
317+
When ClickHouse returns columns as raw binary data (e.g., `FIXED_SIZE_BINARY` or `BINARY`), it is the responsibility of application code to convert these bytes to appropriate Python types. The examples below illustrate that some conversions are feasible using DataFrame library APIs, while others may require pure Python approaches like `struct.unpack` (which sacrifice performance but maintain flexibility).
318+
319+
`Date` columns can arrive as `UINT16` (days since the Unix epoch, 1970‑01‑01). Converting inside the DataFrame is efficient and straightforward:
320+
```python
321+
# Polars
322+
df = df.with_columns(pl.col("event_date").cast(pl.Date))
323+
324+
# Pandas
325+
df["event_date"] = pd.to_datetime(df["event_date"], unit="D")
326+
```
327+
328+
Columns like `Int128` can arrive as `FIXED_SIZE_BINARY` with raw bytes. Polars provides native support for 128-bit integers:
329+
```python
330+
# Polars - native support
331+
df = df.with_columns(pl.col("data").bin.reinterpret(dtype=pl.Int128, endianness="little"))
332+
```
333+
334+
As of NumPy 2.3 there is no public 128-bit integer dtype, so we must fall back to pure Python and can do something like:
335+
336+
```python
337+
# Assuming we have a pandas dataframe with an Int128 column of dtype fixed_size_binary[16][pyarrow]
338+
339+
print(df)
340+
# Output:
341+
# str_col int_128_col
342+
# 0 num1 b'\\x15}\\xda\\xeb\\x18ZU\\x0fn\\x05\\x01\\x00\\x00\\x00...
343+
# 1 num2 b'\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00...
344+
# 2 num3 b'\\x15\\xdfp\\x81r\\x9f\\x01\\x00\\x00\\x00\\x00\\x00\\x...
345+
346+
print([int.from_bytes(n, byteorder="little") for n in df["int_128_col"].to_list()])
347+
# Output:
348+
# [1234567898765432123456789, 8, 456789123456789]
349+
```
350+
351+
The key takeaway: application code must handle these conversions based on the capabilities of the chosen DataFrame library and the acceptable performance trade-offs. When DataFrame-native conversions aren't available, pure Python approaches remain an option.
352+
86353
## Read formats {#read-formats}
87354

88355
Read formats control the data types of values returned from the client `query`, `query_np`, and `query_df` methods. (The `raw_query` and `query_arrow` do not modify incoming data from ClickHouse, so format control does not apply.) For example, if the read format for a UUID is changed from the default `native` format to the alternative `string` format, a ClickHouse query of `UUID` column will be returned as string values (using the standard 8-4-4-4-12 RFC 1422 format) instead of Python UUID objects.

0 commit comments

Comments
 (0)