Skip to content

Commit c28761a

Browse files
authored
Merge pull request #4646 from ClickHouse/Blargian-patch-443620
Improvement: update guide to discuss inserting data from the command line
2 parents e62f244 + f5242d5 commit c28761a

File tree

1 file changed

+136
-31
lines changed

1 file changed

+136
-31
lines changed

docs/guides/inserting-data.md

Lines changed: 136 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -11,34 +11,6 @@ doc_type: 'guide'
1111
import postgres_inserts from '@site/static/images/guides/postgres-inserts.png';
1212
import Image from '@theme/IdealImage';
1313

14-
## Basic example {#basic-example}
15-
16-
You can use the familiar `INSERT INTO TABLE` command with ClickHouse. Let's insert some data into the table that we created in the start guide ["Creating Tables in ClickHouse"](./creating-tables).
17-
18-
```sql
19-
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
20-
(101, 'Hello, ClickHouse!', now(), -1.0 ),
21-
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
22-
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
23-
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 )
24-
```
25-
26-
To verify that worked, we'll run the following `SELECT` query:
27-
28-
```sql
29-
SELECT * FROM helloworld.my_first_table
30-
```
31-
32-
Which returns:
33-
34-
```response
35-
user_id message timestamp metric
36-
101 Hello, ClickHouse! 2024-11-13 20:01:22 -1
37-
101 Granules are the smallest chunks of data read 2024-11-13 20:01:27 3.14159
38-
102 Insert a lot of rows per batch 2024-11-12 00:00:00 1.41421
39-
102 Sort your data based on your commonly-used queries 2024-11-13 00:00:00 2.718
40-
```
41-
4214
## Inserting into ClickHouse vs. OLTP databases {#inserting-into-clickhouse-vs-oltp-databases}
4315

4416
As an OLAP (Online Analytical Processing) database, ClickHouse is optimized for high performance and scalability, allowing potentially millions of rows to be inserted per second.
@@ -143,16 +115,149 @@ The native protocol does allow query progress to be easily tracked.
143115

144116
See [HTTP Interface](/interfaces/http) for further details.
145117

118+
## Basic example {#basic-example}
119+
120+
You can use the familiar `INSERT INTO TABLE` command with ClickHouse. Let's insert some data into the table that we created in the start guide ["Creating Tables in ClickHouse"](./creating-tables).
121+
122+
```sql
123+
INSERT INTO helloworld.my_first_table (user_id, message, timestamp, metric) VALUES
124+
(101, 'Hello, ClickHouse!', now(), -1.0 ),
125+
(102, 'Insert a lot of rows per batch', yesterday(), 1.41421 ),
126+
(102, 'Sort your data based on your commonly-used queries', today(), 2.718 ),
127+
(101, 'Granules are the smallest chunks of data read', now() + 5, 3.14159 )
128+
```
129+
130+
To verify that worked, we'll run the following `SELECT` query:
131+
132+
```sql
133+
SELECT * FROM helloworld.my_first_table
134+
```
135+
136+
Which returns:
137+
138+
```response
139+
user_id message timestamp metric
140+
101 Hello, ClickHouse! 2024-11-13 20:01:22 -1
141+
101 Granules are the smallest chunks of data read 2024-11-13 20:01:27 3.14159
142+
102 Insert a lot of rows per batch 2024-11-12 00:00:00 1.41421
143+
102 Sort your data based on your commonly-used queries 2024-11-13 00:00:00 2.718
144+
```
145+
146146
## Loading data from Postgres {#loading-data-from-postgres}
147147

148148
For loading data from Postgres, users can use:
149149

150-
- `PeerDB by ClickHouse`, an ETL tool specifically designed for PostgreSQL database replication. This is available in both:
151-
- ClickHouse Cloud - available through our [new connector](/integrations/clickpipes/postgres) in ClickPipes, our managed ingestion service.
152-
- Self-managed - via the [open-source project](https://github.com/PeerDB-io/peerdb).
150+
- `ClickPipes`, an ETL tool specifically designed for PostgreSQL database replication. This is available in both:
151+
- ClickHouse Cloud - available through our [managed ingestion service](/integrations/clickpipes/postgres) in ClickPipes.
152+
- Self-managed - via the [PeerDB open-source project](https://github.com/PeerDB-io/peerdb).
153153
- The [PostgreSQL table engine](/integrations/postgresql#using-the-postgresql-table-engine) to read data directly as shown in previous examples. Typically appropriate if batch replication based on a known watermark, e.g., timestamp, is sufficient or if it's a one-off migration. This approach can scale to 10's of millions of rows. Users looking to migrate larger datasets should consider multiple requests, each dealing with a chunk of the data. Staging tables can be used for each chunk prior to its partitions being moved to a final table. This allows failed requests to be retried. For further details on this bulk-loading strategy, see here.
154154
- Data can be exported from PostgreSQL in CSV format. This can then be inserted into ClickHouse from either local files or via object storage using table functions.
155155

156156
:::note Need help inserting large datasets?
157157
If you need help inserting large datasets or encounter any errors when importing data into ClickHouse Cloud, please contact us at support@clickhouse.com and we can assist.
158158
:::
159+
160+
## Inserting data from the command line {#inserting-data-from-command-line}
161+
162+
**Prerequisites**
163+
- You have [installed](/install) ClickHouse
164+
- `clickhouse-server` is running
165+
- You have access to a terminal with `wget`, `zcat` and `curl`
166+
167+
In this example you'll see how to insert a CSV file into ClickHouse from the command line using clickhouse-client in batch mode. For more information and examples of inserting data via command line using clickhouse-client in batch mode, see ["Batch mode"](/interfaces/cli#batch-mode).
168+
169+
We'll be using the [Hacker News dataset](/getting-started/example-datasets/hacker-news) for this example, which contains 28 million rows of Hacker News data.
170+
171+
<VerticalStepper headerLevel="h3">
172+
173+
### Download the CSV {#download-csv}
174+
175+
Run the following command to download a CSV version of the dataset from our public S3 bucket:
176+
177+
```bash
178+
wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.csv.gz
179+
```
180+
181+
At 4.6GB, and 28m rows, this compressed file should take 5-10 minutes to download.
182+
183+
### Create the table {#create-table}
184+
185+
With `clickhouse-server` running, you can create an empty table with the following schema directly from the command line using `clickhouse-client` in batch mode:
186+
187+
```bash
188+
clickhouse-client <<'_EOF'
189+
CREATE TABLE hackernews(
190+
`id` UInt32,
191+
`deleted` UInt8,
192+
`type` Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
193+
`by` LowCardinality(String),
194+
`time` DateTime,
195+
`text` String,
196+
`dead` UInt8,
197+
`parent` UInt32,
198+
`poll` UInt32,
199+
`kids` Array(UInt32),
200+
`url` String,
201+
`score` Int32,
202+
`title` String,
203+
`parts` Array(UInt32),
204+
`descendants` Int32
205+
)
206+
ENGINE = MergeTree
207+
ORDER BY id
208+
_EOF
209+
```
210+
211+
If there are no errors, then the table has been successfully created. In the command above single quotes are used around the heredoc delimiter (`_EOF`) to prevent any interpolation. Without single quotes it would be necessary to escape the backticks around the column names.
212+
213+
### Insert the data from the command line {#insert-data-via-cmd}
214+
215+
Next run the command below to insert the data from the file you downloaded earlier into your table:
216+
217+
```bash
218+
zcat < hacknernews.csv.gz | ./clickhouse client --query "INSERT INTO hackernews FORMAT CSV"
219+
```
220+
221+
As our data is compressed, we need to first decompress the file using a tool like `gzip``zcat`, or similar, and then pipe the decompressed data into `clickhouse-client` with the appropriate `INSERT` statement and `FORMAT`.
222+
223+
:::note
224+
When inserting data with clickhouse-client in interactive mode, it is possible to let ClickHouse handle the decompression for you on insert using the `COMPRESSION` clause. ClickHouse can automatically detect the compression type from the file extension, but you can also specify it explicitly.
225+
226+
The query to insert would then look like this:
227+
228+
```bash
229+
clickhouse-client --query "INSERT INTO hackernews FROM INFILE 'hacknernews.csv.gz' COMPRESSION 'gzip' FORMAT CSV;"
230+
```
231+
:::
232+
233+
When the data has finished inserting you can run the following command to see the number of rows in the `hackernews` table:
234+
235+
```bash
236+
clickhouse-client --query "SELECT formatReadableQuantity(count(*)) FROM hackernews"
237+
28.74 million
238+
```
239+
240+
### inserting data via command line with curl {#insert-using-curl}
241+
242+
In the previous steps you first downloaded the csv file to your local machine using `wget`. It is also possible to directly insert the data from the remote URL using a single command.
243+
244+
Run the following command to truncate the data from the `hackernews` table so that you can insert it again without the intermediate step of downloading to your local machine:
245+
246+
```bash
247+
clickhouse-client --query "TRUNCATE hackernews"
248+
```
249+
250+
Now run:
251+
252+
```bash
253+
curl https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.csv.gz | zcat | clickhouse-client --query "INSERT INTO hackernews FORMAT CSV"
254+
```
255+
256+
You can now run the same command as previously to verify that the data was inserted again:
257+
258+
```bash
259+
clickhouse-client --query "SELECT formatReadableQuantity(count(*)) FROM hackernews"
260+
28.74 million
261+
```
262+
263+
</VerticalStepper>

0 commit comments

Comments
 (0)