Skip to content

Commit ca3605d

Browse files
committed
various minor improvements to integrations docs
1 parent 8526d25 commit ca3605d

File tree

7 files changed

+96
-67
lines changed

7 files changed

+96
-67
lines changed

docs/_snippets/_gather_your_details_http.mdx

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,18 @@ import Image from '@theme/IdealImage';
44

55
To connect to ClickHouse with HTTP(S) you need this information:
66

7-
- The HOST and PORT: typically, the port is 8443 when using TLS or 8123 when not using TLS.
7+
| Parameter(s) | Description |
8+
|-------------------------|---------------------------------------------------------------------------------------------------------------|
9+
|`HOST` and `PORT` | Typically, the port is 8443 when using TLS or 8123 when not using TLS. |
10+
|`DATABASE NAME` | Out of the box, there is a database named `default`, use the name of the database that you want to connect to.|
11+
|`USERNAME` and `PASSWORD`| Out of the box, the username is `default`. Use the username appropriate for your use case. |
812

9-
- The DATABASE NAME: out of the box, there is a database named `default`, use the name of the database that you want to connect to.
10-
11-
- The USERNAME and PASSWORD: out of the box, the username is `default`. Use the username appropriate for your use case.
12-
13-
The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. Select the service that you will connect to and click **Connect**:
13+
The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console.
14+
Select a service and click **Connect**:
1415

1516
<Image img={cloud_connect_button} size="md" alt="ClickHouse Cloud service connect button" border />
1617

17-
Choose **HTTPS**, and the details are available in an example `curl` command.
18+
Choose **HTTPS**. Connection details are displayed in an example `curl` command.
1819

1920
<Image img={connection_details_https} size="md" alt="ClickHouse Cloud HTTPS connection details" border/>
2021

docs/_snippets/_gather_your_details_native.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,14 @@ import Image from '@theme/IdealImage';
44

55
To connect to ClickHouse with native TCP you need this information:
66

7-
- The HOST and PORT: typically, the port is 9440 when using TLS, or 9000 when not using TLS.
8-
9-
- The DATABASE NAME: out of the box there is a database named `default`, use the name of the database that you want to connect to.
10-
11-
- The USERNAME and PASSWORD: out of the box the username is `default`. Use the username appropriate for your use case.
12-
13-
The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. Select the service that you will connect to and click **Connect**:
7+
| Parameter(s) | Description |
8+
|---------------------------|---------------------------------------------------------------------------------------------------------------|
9+
| `HOST` and `PORT` | Typically, the port is 9440 when using TLS, or 9000 when not using TLS. |
10+
| `DATABASE NAME` | Out of the box there is a database named `default`, use the name of the database that you want to connect to. |
11+
| `USERNAME` and `PASSWORD` | Out of the box the username is `default`. Use the username appropriate for your use case. |
12+
13+
The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console.
14+
Select the service that you will connect to and click **Connect**:
1415

1516
<Image img={cloud_connect_button} size="md" alt="ClickHouse Cloud service connect button" border/>
1617

docs/integrations/data-ingestion/etl-tools/airbyte-and-clickhouse.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@ Please note that the Airbyte source and destination for ClickHouse are currently
3030

3131
<a href="https://www.airbyte.com/" target="_blank">Airbyte</a> is an open-source data integration platform. It allows the creation of <a href="https://airbyte.com/blog/why-the-future-of-etl-is-not-elt-but-el" target="_blank">ELT</a> data pipelines and is shipped with more than 140 out-of-the-box connectors. This step-by-step tutorial shows how to connect Airbyte to ClickHouse as a destination and load a sample dataset.
3232

33-
## 1. Download and run Airbyte {#1-download-and-run-airbyte}
33+
<VerticalStepper headerLevel="h2">
34+
35+
## Download and run Airbyte {#1-download-and-run-airbyte}
3436

3537
1. Airbyte runs on Docker and uses `docker-compose`. Make sure to download and install the latest versions of Docker.
3638

@@ -50,7 +52,7 @@ Please note that the Airbyte source and destination for ClickHouse are currently
5052
Alternatively, you can signup and use <a href="https://docs.airbyte.com/deploying-airbyte/on-cloud" target="_blank">Airbyte Cloud</a>
5153
:::
5254

53-
## 2. Add ClickHouse as a destination {#2-add-clickhouse-as-a-destination}
55+
## Add ClickHouse as a destination {#2-add-clickhouse-as-a-destination}
5456

5557
In this section, we will display how to add a ClickHouse instance as a destination.
5658

@@ -80,7 +82,7 @@ GRANT CREATE ON * TO my_airbyte_user;
8082
```
8183
:::
8284

83-
## 3. Add a dataset as a source {#3-add-a-dataset-as-a-source}
85+
## Add a dataset as a source {#3-add-a-dataset-as-a-source}
8486

8587
The example dataset we will use is the <a href="https://clickhouse.com/docs/getting-started/example-datasets/nyc-taxi/" target="_blank">New York City Taxi Data</a> (on <a href="https://github.com/toddwschneider/nyc-taxi-data" target="_blank">Github</a>). For this tutorial, we will use a subset of this dataset which corresponds to the month of Jan 2022.
8688

@@ -98,7 +100,7 @@ The example dataset we will use is the <a href="https://clickhouse.com/docs/gett
98100

99101
3. Congratulations! You have now added a source file in Airbyte.
100102

101-
## 4. Create a connection and load the dataset into ClickHouse {#4-create-a-connection-and-load-the-dataset-into-clickhouse}
103+
## Create a connection and load the dataset into ClickHouse {#4-create-a-connection-and-load-the-dataset-into-clickhouse}
102104

103105
1. Within Airbyte, select the "Connections" page and add a new connection
104106

@@ -170,3 +172,5 @@ The example dataset we will use is the <a href="https://clickhouse.com/docs/gett
170172
Now that the dataset is loaded on your ClickHouse instance, you can create an new table and use more suitable ClickHouse data types (<a href="https://clickhouse.com/docs/getting-started/example-datasets/nyc-taxi/" target="_blank">more details</a>).
171173
172174
8. Congratulations - you have successfully loaded the NYC taxi data into ClickHouse using Airbyte!
175+
176+
</VerticalStepper>

docs/integrations/data-ingestion/etl-tools/dlt-and-clickhouse.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@ pip install "dlt[clickhouse]"
2424

2525
## Setup guide {#setup-guide}
2626

27-
### 1. Initialize the dlt Project {#1-initialize-the-dlt-project}
27+
<VerticalStepper headerLevel="h3">
28+
29+
### Initialize the dlt Project {#1-initialize-the-dlt-project}
2830

2931
Start by initializing a new `dlt` project as follows:
3032
```bash
@@ -42,7 +44,7 @@ pip install -r requirements.txt
4244

4345
or with `pip install dlt[clickhouse]`, which installs the `dlt` library and the necessary dependencies for working with ClickHouse as a destination.
4446

45-
### 2. Setup ClickHouse Database {#2-setup-clickhouse-database}
47+
### Setup ClickHouse Database {#2-setup-clickhouse-database}
4648

4749
To load data into ClickHouse, you need to create a ClickHouse database. Here's a rough outline of what should you do:
4850

@@ -60,7 +62,7 @@ GRANT SELECT ON INFORMATION_SCHEMA.COLUMNS TO dlt;
6062
GRANT CREATE TEMPORARY TABLE, S3 ON *.* TO dlt;
6163
```
6264

63-
### 3. Add credentials {#3-add-credentials}
65+
### Add credentials {#3-add-credentials}
6466

6567
Next, set up the ClickHouse credentials in the `.dlt/secrets.toml` file as shown below:
6668

@@ -78,8 +80,7 @@ secure = 1 # Set to 1 if using HTTPS, else 0.
7880
dataset_table_separator = "___" # Separator for dataset table names from dataset.
7981
```
8082

81-
:::note
82-
HTTP_PORT
83+
:::note HTTP_PORT
8384
The `http_port` parameter specifies the port number to use when connecting to the ClickHouse server's HTTP interface. This is different from default port 9000, which is used for the native TCP protocol.
8485

8586
You must set `http_port` if you are not using external staging (i.e. you don't set the staging parameter in your pipeline). This is because the built-in ClickHouse local storage staging uses the <a href="https://github.com/ClickHouse/clickhouse-connect">clickhouse content</a> library, which communicates with ClickHouse over HTTP.
@@ -94,6 +95,8 @@ You can pass a database connection string similar to the one used by the `clickh
9495
destination.clickhouse.credentials="clickhouse://dlt:Dlt*12345789234567@localhost:9000/dlt?secure=1"
9596
```
9697

98+
</VerticalStepper>
99+
97100
## Write disposition {#write-disposition}
98101

99102
All [write dispositions](https://dlthub.com/docs/general-usage/incremental-loading#choosing-a-write-disposition)

docs/integrations/data-ingestion/etl-tools/nifi-and-clickhouse.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,20 +33,23 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained';
3333

3434
<a href="https://nifi.apache.org/" target="_blank">Apache NiFi</a> is an open-source workflow management software designed to automate data flow between software systems. It allows the creation of ETL data pipelines and is shipped with more than 300 data processors. This step-by-step tutorial shows how to connect Apache NiFi to ClickHouse as both a source and destination, and to load a sample dataset.
3535

36-
## 1. Gather your connection details {#1-gather-your-connection-details}
36+
<VerticalStepper headerLevel="h2">
37+
38+
## Gather your connection details {#1-gather-your-connection-details}
39+
3740
<ConnectionDetails />
3841

39-
## 2. Download and run Apache NiFi {#2-download-and-run-apache-nifi}
42+
## Download and run Apache NiFi {#2-download-and-run-apache-nifi}
4043

41-
1. For a new setup, download the binary from https://nifi.apache.org/download.html and start by running `./bin/nifi.sh start`
44+
For a new setup, download the binary from https://nifi.apache.org/download.html and start by running `./bin/nifi.sh start`
4245

43-
## 3. Download the ClickHouse JDBC driver {#3-download-the-clickhouse-jdbc-driver}
46+
## Download the ClickHouse JDBC driver {#3-download-the-clickhouse-jdbc-driver}
4447

4548
1. Visit the <a href="https://github.com/ClickHouse/clickhouse-java/releases" target="_blank">ClickHouse JDBC driver release page</a> on GitHub and look for the latest JDBC release version
4649
2. In the release version, click on "Show all xx assets" and look for the JAR file containing the keyword "shaded" or "all", for example, `clickhouse-jdbc-0.5.0-all.jar`
4750
3. Place the JAR file in a folder accessible by Apache NiFi and take note of the absolute path
4851

49-
## 4. Add `DBCPConnectionPool` Controller Service and configure its properties {#4-add-dbcpconnectionpool-controller-service-and-configure-its-properties}
52+
## Add `DBCPConnectionPool` Controller Service and configure its properties {#4-add-dbcpconnectionpool-controller-service-and-configure-its-properties}
5053

5154
1. To configure a Controller Service in Apache NiFi, visit the NiFi Flow Configuration page by clicking on the "gear" button
5255

@@ -90,7 +93,7 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained';
9093

9194
<Image img={nifi08} size="lg" border alt="Controller Services list showing enabled ClickHouse JDBC service" />
9295

93-
## 5. Read from a table using the `ExecuteSQL` processor {#5-read-from-a-table-using-the-executesql-processor}
96+
## Read from a table using the `ExecuteSQL` processor {#5-read-from-a-table-using-the-executesql-processor}
9497

9598
1. Add an ​`​ExecuteSQL` processor, along with the appropriate upstream and downstream processors
9699

@@ -115,7 +118,7 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained';
115118

116119
<Image img={nifi12} size="lg" border alt="FlowFile content viewer showing query results in formatted view" />
117120

118-
## 6. Write to a table using `MergeRecord` and `PutDatabaseRecord` processor {#6-write-to-a-table-using-mergerecord-and-putdatabaserecord-processor}
121+
## Write to a table using `MergeRecord` and `PutDatabaseRecord` processor {#6-write-to-a-table-using-mergerecord-and-putdatabaserecord-processor}
119122

120123
1. To write multiple rows in a single insert, we first need to merge multiple records into a single record. This can be done using the `MergeRecord` processor
121124

@@ -153,3 +156,5 @@ import CommunityMaintainedBadge from '@theme/badges/CommunityMaintained';
153156
<Image img={nifi15} size="sm" border alt="Query results showing row count in the destination table" />
154157

155158
5. Congratulations - you have successfully loaded your data into ClickHouse using Apache NiFi !
159+
160+
</VerticalStepper>

docs/integrations/data-ingestion/etl-tools/vector-to-clickhouse.md

Lines changed: 45 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -17,46 +17,56 @@ import PartnerBadge from '@theme/badges/PartnerBadge';
1717

1818
<PartnerBadge/>
1919

20-
Being able to analyze your logs in real time is critical for production applications. Have you ever wondered if ClickHouse is good at storing and analyzing log data? Just checkout <a href="https://eng.uber.com/logging/" target="_blank">Uber's experience</a> with converting their logging infrastructure from ELK to ClickHouse.
20+
Being able to analyze your logs in real time is critical for production applications.
21+
Have you ever wondered if ClickHouse is good at storing and analyzing log data?
22+
Just checkout [Uber's experience](https://eng.uber.com/logging/) with converting their logging infrastructure from ELK to ClickHouse.
2123

22-
This guide shows how to use the popular data pipeline <a href="https://vector.dev/docs/about/what-is-vector/" target="_blank">Vector</a> to tail an Nginx log file and send it to ClickHouse. The steps below would be similar for tailing any type of log file. We will assume you already have ClickHouse up and running and Vector installed (no need to start it yet though).
24+
This guide shows you how to use the popular data pipeline [Vector](https://vector.dev/docs/about/what-is-vector/) to tail an Nginx log file and send it to ClickHouse.
25+
The steps below would be similar for tailing any type of log file.
26+
We will assume you already have ClickHouse up and running and Vector installed (no need to start it yet though).
2327

24-
## 1. Create a database and table {#1-create-a-database-and-table}
28+
<VerticalStepper headerLevel="h2">
2529

26-
Let's define a table to store the log events:
30+
## Create a database and table {#1-create-a-database-and-table}
2731

28-
1. We will start with a new database named `nginxdb`:
29-
```sql
30-
CREATE DATABASE IF NOT EXISTS nginxdb
31-
```
32+
Define a table to store the log events:
3233

33-
2. For starters, we are just going to insert the entire log event as a single string. Obviously this is not a great format for performing analytics on the log data, but we will figure that part out below using ***materialized views***.
34-
```sql
35-
CREATE TABLE IF NOT EXISTS nginxdb.access_logs (
36-
message String
37-
)
38-
ENGINE = MergeTree()
39-
ORDER BY tuple()
40-
```
41-
:::note
42-
There is not really a need for a primary key yet, so that is why **ORDER BY** is set to **tuple()**.
43-
:::
34+
1. Begin with a new database named `nginxdb`:
4435

45-
## 2. Configure Nginx {#2--configure-nginx}
36+
```sql
37+
CREATE DATABASE IF NOT EXISTS nginxdb
38+
```
39+
40+
2. Insert the entire log event as a single string. Obviously this is not a great format for performing analytics on the log data, but we will figure that part out below using ***materialized views***.
41+
42+
```sql
43+
CREATE TABLE IF NOT EXISTS nginxdb.access_logs (
44+
message String
45+
)
46+
ENGINE = MergeTree()
47+
ORDER BY tuple()
48+
```
49+
50+
:::note
51+
**ORDER BY** is set to **tuple()** (an empty tuple) as there is no need for a primary key yet.
52+
:::
53+
54+
## Configure Nginx {#2--configure-nginx}
4655

4756
We certainly do not want to spend too much time explaining Nginx, but we also do not want to hide all the details, so in this step we will provide you with enough details to get Nginx logging configured.
4857

4958
1. The following `access_log` property sends logs to `/var/log/nginx/my_access.log` in the **combined** format. This value goes in the `http` section of your `nginx.conf` file:
50-
```bash
51-
http {
52-
include /etc/nginx/mime.types;
53-
default_type application/octet-stream;
54-
access_log /var/log/nginx/my_access.log combined;
55-
sendfile on;
56-
keepalive_timeout 65;
57-
include /etc/nginx/conf.d/*.conf;
58-
}
59-
```
59+
60+
```bash
61+
http {
62+
include /etc/nginx/mime.types;
63+
default_type application/octet-stream;
64+
access_log /var/log/nginx/my_access.log combined;
65+
sendfile on;
66+
keepalive_timeout 65;
67+
include /etc/nginx/conf.d/*.conf;
68+
}
69+
```
6070

6171
2. Be sure to restart Nginx if you had to modify `nginx.conf`.
6272

@@ -67,7 +77,7 @@ We certainly do not want to spend too much time explaining Nginx, but we also do
6777
192.168.208.1 - - [12/Oct/2021:03:31:49 +0000] "GET / HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
6878
```
6979

70-
## 3. Configure Vector {#3-configure-vector}
80+
## Configure Vector {#3-configure-vector}
7181

7282
Vector collects, transforms and routes logs, metrics, and traces (referred to as **sources**) to lots of different vendors (referred to as **sinks**), including out-of-the-box compatibility with ClickHouse. Sources and sinks are defined in a configuration file named **vector.toml**.
7383

@@ -95,7 +105,7 @@ Vector collects, transforms and routes logs, metrics, and traces (referred to as
95105
```
96106
<Image img={vector01} size="lg" border alt="View ClickHouse logs in table format" />
97107

98-
## 4. Parse the Logs {#4-parse-the-logs}
108+
## Parse the Logs {#4-parse-the-logs}
99109

100110
Having the logs in ClickHouse is great, but storing each event as a single string does not allow for much data analysis. Let's see how to parse the log events using a materialized view.
101111
@@ -180,4 +190,6 @@ Having the logs in ClickHouse is great, but storing each event as a single strin
180190
The lesson above stored the data in two tables, but you could change the initial `nginxdb.access_logs` table to use the **Null** table engine - the parsed data will still end up in the `nginxdb.access_logs_view` table, but the raw data will not be stored in a table.
181191
:::
182192
183-
**Summary:** By using Vector, which only required a simple install and quick configuration, we can send logs from an Nginx server to a table in ClickHouse. By using a clever materialized view, we can parse those logs into columns for easier analytics.
193+
</VerticalStepper>
194+
195+
> By using Vector, which only requires a simple install and quick configuration, you can send logs from an Nginx server to a table in ClickHouse. By using a materialized view, you can parse those logs into columns for easier analytics.

docs/integrations/data-ingestion/google-dataflow/dataflow.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,18 +15,21 @@ import ClickHouseSupportedBadge from '@theme/badges/ClickHouseSupported';
1515

1616
[Google Dataflow](https://cloud.google.com/dataflow) is a fully managed stream and batch data processing service. It supports pipelines written in Java or Python and is built on the Apache Beam SDK.
1717

18-
There are two main ways to use Google Dataflow with ClickHouse, both are leveraging [`ClickHouseIO Apache Beam connector`](/integrations/apache-beam):
18+
There are two main ways to use Google Dataflow with ClickHouse, both of which leverage [`ClickHouseIO Apache Beam connector`](/integrations/apache-beam).
19+
These are:
20+
- [Java runner](#1-java-runner)
21+
- [Predefined templates](#2-predefined-templates)
1922

20-
## 1. Java runner {#1-java-runner}
21-
The [Java Runner](./java-runner) allows users to implement custom Dataflow pipelines using the Apache Beam SDK `ClickHouseIO` integration. This approach provides full flexibility and control over the pipeline logic, enabling users to tailor the ETL process to specific requirements.
23+
## Java runner {#1-java-runner}
24+
The [Java runner](./java-runner) allows users to implement custom Dataflow pipelines using the Apache Beam SDK `ClickHouseIO` integration. This approach provides full flexibility and control over the pipeline logic, enabling users to tailor the ETL process to specific requirements.
2225
However, this option requires knowledge of Java programming and familiarity with the Apache Beam framework.
2326

2427
### Key features {#key-features}
2528
- High degree of customization.
2629
- Ideal for complex or advanced use cases.
2730
- Requires coding and understanding of the Beam API.
2831

29-
## 2. Predefined templates {#2-predefined-templates}
32+
## Predefined templates {#2-predefined-templates}
3033
ClickHouse offers [predefined templates](./templates) designed for specific use cases, such as importing data from BigQuery into ClickHouse. These templates are ready-to-use and simplify the integration process, making them an excellent choice for users who prefer a no-code solution.
3134

3235
### Key features {#key-features-1}

0 commit comments

Comments
 (0)