You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ClickHouse Connect provides specialized insert methods for common data formats:
71
+
72
+
-`insert_df` -- Insert a Pandas DataFrame. Instead of a Python Sequence of Sequences `data` argument, the second parameter of this method requires a `df` argument that must be a Pandas DataFrame instance. ClickHouse Connect automatically processes the DataFrame as a column oriented datasource, so the `column_oriented` parameter is not required or available.
73
+
-`insert_arrow` -- Insert a PyArrow Table. ClickHouse Connect passes the Arrow table unmodified to the ClickHouse server for processing, so only the `database` and `settings` arguments are available in addition to `table` and `arrow_table`.
74
+
-`insert_df_arrow` -- Insert an arrow-backed Pandas DataFrame or a Polars DataFrame. ClickHouse Connect will automatically determine if the DataFrame is a Pandas or Polars type. If Pandas, validation will be performed to ensure that each column's dtype backend is Arrow-based and an error will be raised if any are not.
75
+
76
+
:::note
77
+
A NumPy array is a valid Sequence of Sequences and can be used as the `data` argument to the main `insert` method, so a specialized method is not required.
# Convert to Arrow-backed dtypes for better performance
123
+
df = pd.DataFrame({
124
+
"id": [1, 2, 3],
125
+
"name": ["Alice", "Bob", "Joe"],
126
+
"age": [25, 30, 28],
127
+
}).convert_dtypes(dtype_backend="pyarrow")
128
+
129
+
client.insert_df_arrow("users", df)
130
+
```
131
+
132
+
## File inserts {#file-inserts}
133
+
134
+
The `clickhouse_connect.driver.tools` package includes the `insert_file` method that allows inserting data directly from the file system into an existing ClickHouse table. Parsing is delegated to the ClickHouse server. `insert_file` accepts the following parameters:
| client | Client |*Required*| The `driver.Client` used to perform the insert |
139
+
| table | str |*Required*| The ClickHouse table to insert into. The full table name (including database) is permitted. |
140
+
| file_path | str |*Required*| The native file system path to the data file |
141
+
| fmt | str | CSV, CSVWithNames | The ClickHouse Input Format of the file. CSVWithNames is assumed if `column_names` is not provided |
142
+
| column_names | Sequence of str |*None*| A list of column names in the data file. Not required for formats that include column names |
143
+
| database | str |*None*| Database of the table. Ignored if the table is fully qualified. If not specified, the insert will use the client database |
144
+
| settings | dict |*None*| See [settings description](driver-api.md#settings-argument). |
145
+
| compression | str |*None*| A recognized ClickHouse compression type (zstd, lz4, gzip) used for the Content-Encoding HTTP header |
146
+
147
+
For files with inconsistent data or date/time values in an unusual format, settings that apply to data imports (such as `input_format_allow_errors_num` and `input_format_allow_errors_num`) are recognized for this method.
148
+
149
+
```python
150
+
import clickhouse_connect
151
+
from clickhouse_connect.driver.tools import insert_file
Copy file name to clipboardExpand all lines: docs/integrations/language-clients/python/advanced-querying.md
+269-2Lines changed: 269 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -32,6 +32,18 @@ Note that `QueryContext`s are not thread safe, but a copy can be obtained in a m
32
32
33
33
## Streaming queries {#streaming-queries}
34
34
35
+
The ClickHouse Connect Client provides multiple methods for retrieving data as a stream (implemented as a Python generator):
36
+
37
+
-`query_column_block_stream` -- Returns query data in blocks as a sequence of columns using native Python objects
38
+
-`query_row_block_stream` -- Returns query data as a block of rows using native Python objects
39
+
-`query_rows_stream` -- Returns query data as a sequence of rows using native Python objects
40
+
-`query_np_stream` -- Returns each ClickHouse block of query data as a NumPy array
41
+
-`query_df_stream` -- Returns each ClickHouse Block of query data as a Pandas DataFrame
42
+
-`query_arrow_stream` -- Returns query data in PyArrow RecordBlocks
43
+
-`query_df_arrow_stream` -- Returns each ClickHouse Block of query data as an arrow-backed Pandas DataFrame or a Polars DataFrame depending on the kwarg `dataframe_library` (default is "pandas").
44
+
45
+
Each of these methods returns a `ContextStream` object that must be opened via a `with` statement to start consuming the stream.
46
+
35
47
### Data blocks {#data-blocks}
36
48
ClickHouse Connect processes all data from the primary `query` method as a stream of blocks received from the ClickHouse server. These blocks are transmitted in the custom "Native" format to and from ClickHouse. A "block" is simply a sequence of columns of binary data, where each column contains an equal number of data values of the specified data type. (As a columnar database, ClickHouse stores this data in a similar form.) The size of a block returned from a query is governed by two user settings that can be set at several levels (user profile, user, session, or query). They are:
37
49
@@ -73,8 +85,6 @@ The `query_np_stream` method return each block as a two-dimensional NumPy Array.
73
85
74
86
The `query_df_stream` method returns each ClickHouse Block as a two-dimensional Pandas DataFrame. Here's an example which shows that the `StreamContext` object can be used as a context in a deferred fashion (but only once).
75
87
76
-
Finally, the `query_arrow_stream` method returns a ClickHouse `ArrowStream` formatted result as a `pyarrow.ipc.RecordBatchStreamReader` wrapped in `StreamContext`. Each iteration of the stream returns PyArrow RecordBlock.
77
-
78
88
```python
79
89
df_stream = client.query_df_stream('SELECT * FROM hits')
80
90
column_names = df_stream.source.column_names
@@ -83,6 +93,263 @@ with df_stream:
83
93
<do something with the pandas DataFrame>
84
94
```
85
95
96
+
The `query_df_arrow_stream` method returns each ClickHouse Block as a DataFrame with PyArrow dtype backend. This method supports both Pandas (2.x or later) and Polars DataFrames via the `dataframe_library` parameter (defaults to `"pandas"`). Each iteration yields a DataFrame converted from PyArrow record batches, providing better performance and memory efficiency for certain data types.
97
+
98
+
Finally, the `query_arrow_stream` method returns a ClickHouse `ArrowStream` formatted result as a `pyarrow.ipc.RecordBatchStreamReader` wrapped in `StreamContext`. Each iteration of the stream returns PyArrow RecordBlock.
99
+
100
+
### Streaming examples {#streaming-examples}
101
+
102
+
#### Stream rows {#stream-rows}
103
+
104
+
```python
105
+
import clickhouse_connect
106
+
107
+
client = clickhouse_connect.get_client()
108
+
109
+
# Stream large result sets row by row
110
+
with client.query_rows_stream("SELECT number, number * 2 as doubled FROM system.numbers LIMIT 100000") as stream:
111
+
for row in stream:
112
+
print(row) # Process each row
113
+
# Output:
114
+
# (0, 0)
115
+
# (1, 2)
116
+
# (2, 4)
117
+
# ....
118
+
```
119
+
120
+
#### Stream row blocks {#stream-row-blocks}
121
+
122
+
```python
123
+
import clickhouse_connect
124
+
125
+
client = clickhouse_connect.get_client()
126
+
127
+
# Stream in blocks of rows (more efficient than row-by-row)
128
+
with client.query_row_block_stream("SELECT number, number * 2 FROM system.numbers LIMIT 100000") as stream:
with client.query_df_stream("SELECT number, toString(number) AS str FROM system.numbers LIMIT 100000") as stream:
145
+
for df in stream:
146
+
# Process each DataFrame block
147
+
print(f"Received DataFrame with {len(df)} rows")
148
+
print(df.head(3))
149
+
# Output:
150
+
# Received DataFrame with 65409 rows
151
+
# number str
152
+
# 0 0 0
153
+
# 1 1 1
154
+
# 2 2 2
155
+
# Received DataFrame with 34591 rows
156
+
# number str
157
+
# 0 65409 65409
158
+
# 1 65410 65410
159
+
# 2 65411 65411
160
+
```
161
+
162
+
#### Stream Arrow batches {#stream-arrow-batches}
163
+
164
+
```python
165
+
import clickhouse_connect
166
+
167
+
client = clickhouse_connect.get_client()
168
+
169
+
# Stream query results as Arrow record batches
170
+
with client.query_arrow_stream("SELECT * FROM large_table") as stream:
171
+
for arrow_batch in stream:
172
+
# Process each Arrow batch
173
+
print(f"Received Arrow batch with {arrow_batch.num_rows} rows")
174
+
# Output:
175
+
# Received Arrow batch with 65409 rows
176
+
# Received Arrow batch with 34591 rows
177
+
```
178
+
179
+
## NumPy, Pandas, and Arrow queries {#numpy-pandas-and-arrow-queries}
180
+
181
+
ClickHouse Connect provides specialized query methods for working with NumPy, Pandas, and Arrow data structures. These methods allow you to retrieve query results directly in these popular data formats without manual conversion.
182
+
183
+
### NumPy queries {#numpy-queries}
184
+
185
+
The `query_np` method returns query results as a NumPy array instead of a ClickHouse Connect `QueryResult`.
186
+
187
+
```python
188
+
import clickhouse_connect
189
+
190
+
client = clickhouse_connect.get_client()
191
+
192
+
# Query returns a NumPy array
193
+
np_array = client.query_np("SELECT number, number * 2 AS doubled FROM system.numbers LIMIT 5")
194
+
195
+
print(type(np_array))
196
+
# Output:
197
+
# <class "numpy.ndarray">
198
+
199
+
print(np_array)
200
+
# Output:
201
+
# [[0 0]
202
+
# [1 2]
203
+
# [2 4]
204
+
# [3 6]
205
+
# [4 8]]
206
+
```
207
+
208
+
### Pandas queries {#pandas-queries}
209
+
210
+
The `query_df` method returns query results as a Pandas DataFrame instead of a ClickHouse Connect `QueryResult`.
211
+
212
+
```python
213
+
import clickhouse_connect
214
+
215
+
client = clickhouse_connect.get_client()
216
+
217
+
# Query returns a Pandas DataFrame
218
+
df = client.query_df("SELECT number, number * 2 AS doubled FROM system.numbers LIMIT 5")
219
+
220
+
print(type(df))
221
+
# Output: <class "pandas.core.frame.DataFrame">
222
+
print(df)
223
+
# Output:
224
+
# number doubled
225
+
# 0 0 0
226
+
# 1 1 2
227
+
# 2 2 4
228
+
# 3 3 6
229
+
# 4 4 8
230
+
```
231
+
232
+
### PyArrow queries {#pyarrow-queries}
233
+
234
+
The `query_arrow` method returns query results as a PyArrow Table. It utilizes the ClickHouse `Arrow` format directly, so it only accepts three arguments in common with the main `query` method: `query`, `parameters`, and `settings`. In addition, there is an additional argument, `use_strings`, which determines whether the Arrow Table will render ClickHouse String types as strings (if True) or bytes (if False).
235
+
236
+
```python
237
+
import clickhouse_connect
238
+
239
+
client = clickhouse_connect.get_client()
240
+
241
+
# Query returns a PyArrow Table
242
+
arrow_table = client.query_arrow("SELECT number, toString(number) AS str FROM system.numbers LIMIT 3")
ClickHouse Connect supports fast, memory‑efficient DataFrame creation from Arrow results via the `query_df_arrow` and `query_df_arrow_stream` methods. These are thin wrappers around the Arrow query methods and perform zero‑copy conversions to DataFrames where possible:
261
+
262
+
-`query_df_arrow`: Executes the query using the ClickHouse `Arrow` output format and returns a DataFrame.
263
+
- For `dataframe_library='pandas'`, returns a pandas 2.x DataFrame using Arrow‑backed dtypes (`pd.ArrowDtype`). This requires pandas 2.x and leverages zero‑copy buffers where possible for excellent performance and low memory overhead.
264
+
- For `dataframe_library='polars'`, returns a Polars DataFrame created from the Arrow table (`pl.from_arrow`), which is similarly efficient and can be zero‑copy depending on the data.
265
+
-`query_df_arrow_stream`: Streams results as a sequence of DataFrames (pandas 2.x or Polars) converted from Arrow stream batches.
266
+
267
+
#### Query to Arrow-backed DataFrame {#query-to-arrow-backed-dataframe}
268
+
269
+
```python
270
+
import clickhouse_connect
271
+
272
+
client = clickhouse_connect.get_client()
273
+
274
+
# Query returns a Pandas DataFrame with Arrow dtypes (requires pandas 2.x)
275
+
df = client.query_df_arrow(
276
+
"SELECT number, toString(number) AS str FROM system.numbers LIMIT 3",
277
+
dataframe_library="pandas"
278
+
)
279
+
280
+
print(df.dtypes)
281
+
# Output:
282
+
# number uint64[pyarrow]
283
+
# str string[pyarrow]
284
+
# dtype: object
285
+
286
+
# Or use Polars
287
+
polars_df = client.query_df_arrow(
288
+
"SELECT number, toString(number) AS str FROM system.numbers LIMIT 3",
289
+
dataframe_library="polars"
290
+
)
291
+
print(df.dtypes)
292
+
# Output:
293
+
# [UInt64, String]
294
+
295
+
296
+
# Streaming into batches of DataFrames (polars shown)
297
+
with client.query_df_arrow_stream(
298
+
"SELECT number, toString(number) AS str FROM system.numbers LIMIT 100000", dataframe_library="polars"
299
+
) as stream:
300
+
for df_batch in stream:
301
+
print(f"Received {type(df_batch)} batch with {len(df_batch)} rows and dtypes: {df_batch.dtypes}")
302
+
# Output:
303
+
# Received <class 'polars.dataframe.frame.DataFrame'> batch with 65409 rows and dtypes: [UInt64, String]
304
+
# Received <class 'polars.dataframe.frame.DataFrame'> batch with 34591 rows and dtypes: [UInt64, String]
305
+
```
306
+
307
+
#### Notes and caveats {#notes-and-caveats}
308
+
- Arrow type mapping: When returning data in Arrow format, ClickHouse maps types to the closest supported Arrow types. Some ClickHouse types do not have a native Arrow equivalent and are returned as raw bytes in Arrow fields (usually `BINARY` or `FIXED_SIZE_BINARY`).
309
+
- Examples: `IPv4` is represented as Arrow `UINT32`; `IPv6` and large integers (`Int128/UInt128/Int256/UInt256`) are often represented as `FIXED_SIZE_BINARY`/`BINARY` with raw bytes.
310
+
- In these cases, the DataFrame column will contain byte values backed by the Arrow field; it is up to the client code to interpret/convert those bytes according to ClickHouse semantics.
311
+
- Unsupported Arrow data types (e.g., UUID/ENUM as true Arrow types) are not emitted; values are represented using the closest supported Arrow type (often as binary bytes) for output.
312
+
- Pandas requirement: Arrow‑backed dtypes require pandas 2.x. For older pandas versions, use `query_df` (non‑Arrow) instead.
313
+
- Strings vs binary: The `use_strings` option (when supported by the server setting `output_format_arrow_string_as_string`) controls whether ClickHouse `String` columns are returned as Arrow strings or as binary.
314
+
315
+
#### Mismatched ClickHouse/Arrow type conversion examples {#mismatched-clickhousearrow-type-conversion-examples}
316
+
317
+
When ClickHouse returns columns as raw binary data (e.g., `FIXED_SIZE_BINARY` or `BINARY`), it is the responsibility of application code to convert these bytes to appropriate Python types. The examples below illustrate that some conversions are feasible using DataFrame library APIs, while others may require pure Python approaches like `struct.unpack` (which sacrifice performance but maintain flexibility).
318
+
319
+
`Date` columns can arrive as `UINT16` (days since the Unix epoch, 1970‑01‑01). Converting inside the DataFrame is efficient and straightforward:
print([int.from_bytes(n, byteorder="little") for n in df["int_128_col"].to_list()])
347
+
# Output:
348
+
# [1234567898765432123456789, 8, 456789123456789]
349
+
```
350
+
351
+
The key takeaway: application code must handle these conversions based on the capabilities of the chosen DataFrame library and the acceptable performance trade-offs. When DataFrame-native conversions aren't available, pure Python approaches remain an option.
352
+
86
353
## Read formats {#read-formats}
87
354
88
355
Read formats control the data types of values returned from the client `query`, `query_np`, and `query_df` methods. (The `raw_query` and `query_arrow` do not modify incoming data from ClickHouse, so format control does not apply.) For example, if the read format for a UUID is changed from the default `native` format to the alternative `string` format, a ClickHouse query of `UUID` column will be returned as string values (using the standard 8-4-4-4-12 RFC 1422 format) instead of Python UUID objects.
0 commit comments