Merge pull request #4498 from Blargian/revise_sharding_replication_docs

Blargian · web-flow · commit 58194db2afdb · 2025-10-06T13:12:48.000+02:00
Scaling docs: edits to make the guides flow easier
diff --git a/docs/deployment-guides/replication-sharding-examples/02_2_shards_1_replica.md b/docs/deployment-guides/replication-sharding-examples/02_2_shards_1_replica.md
@@ -557,11 +557,7 @@ SHOW DATABASES;
 
 ## Create a table on the cluster {#creating-a-table}
 
-Now that the database has been created, you will create a distributed table.
-Distributed tables are tables which have access to shards located on different
-hosts and are defined using the `Distributed` table engine. The distributed table
-acts as the interface across all the shards in the cluster.
-
+Now that the database has been created, you will create a table.
 Run the following query from any of the host clients:
 
 ```sql
@@ -608,8 +604,6 @@ SHOW TABLES IN uk;
    └─────────────────────┘
 ```
 
-## Insert data into a distributed table {#inserting-data}
-
 Before we insert the UK price paid data, let's perform a quick experiment to see
 what happens when we insert data into an ordinary table from either host.
 
@@ -622,7 +616,7 @@ CREATE TABLE test.test_table ON CLUSTER cluster_2S_1R
     `id` UInt64,
     `name` String
 )
-ENGINE = ReplicatedMergeTree
+ENGINE = MergeTree()
 ORDER BY id;
 ```
 
@@ -654,16 +648,18 @@ SELECT * FROM test.test_table;
 --   └────┴────────────────────┘
 ```
 
-You will notice that only the row that was inserted into the table on that
+You will notice that unlike with a `ReplicatedMergeTree` table only the row that was inserted into the table on that
 particular host is returned and not both rows.
 
-To read the data from the two shards we need an interface which can handle queries
+To read the data across the two shards, we need an interface which can handle queries
 across all the shards, combining the data from both shards when we run select queries
-on it, and handling the insertion of data to the separate shards when we run insert queries.
+on it or inserting data to both shards when we run insert queries.
 
-In ClickHouse this interface is called a distributed table, which we create using
+In ClickHouse this interface is called a **distributed table**, which we create using
 the [`Distributed`](/engines/table-engines/special/distributed) table engine. Let's take a look at how it works.
 
+## Create a distributed table {#create-distributed-table}
+
 Create a distributed table with the following query:
 
 ```sql
@@ -674,8 +670,12 @@ ENGINE = Distributed('cluster_2S_1R', 'test', 'test_table', rand())
 In this example, the `rand()` function is chosen as the sharding key so that
 inserts are randomly distributed across the shards.
 
-Now query the distributed table from either host and you will get back
-both of the rows which were inserted on the two hosts:
+Now query the distributed table from either host, and you will get back
+both of the rows which were inserted on the two hosts, unlike in our previous example:
+
+```sql
+SELECT * FROM test.test_table_dist;
+```
 
 ```sql
    ┌─id─┬─name───────────────┐
@@ -694,6 +694,8 @@ ON CLUSTER cluster_2S_1R
 ENGINE = Distributed('cluster_2S_1R', 'uk', 'uk_price_paid_local', rand());
 ```
 
+## Insert data into a distributed table {#inserting-data-into-distributed-table}
+
 Now connect to either of the hosts and insert the data:
 
 ```sql
diff --git a/docs/deployment-guides/replication-sharding-examples/03_2_shards_2_replicas.md b/docs/deployment-guides/replication-sharding-examples/03_2_shards_2_replicas.md
@@ -586,12 +586,9 @@ SHOW DATABASES;
    └────────────────────┘
 ```
 
-## Create a distributed table on the cluster {#creating-a-table}
+## Create a table on the cluster {#creating-a-table}
 
-Now that the database has been created, next you will create a distributed table.
-Distributed tables are tables which have access to shards located on different
-hosts and are defined using the `Distributed` table engine. The distributed table
-acts as the interface across all the shards in the cluster.
+Now that the database has been created, next you will create a table with replication.
 
 Run the following query from any of the host clients:
 
@@ -663,14 +660,16 @@ SHOW TABLES IN uk;
 
 ## Insert data into a distributed table {#inserting-data-using-distributed}
 
-To insert data into the distributed table, `ON CLUSTER` cannot be used as it does
+To insert data into the table, `ON CLUSTER` cannot be used as it does
 not apply to DML (Data Manipulation Language) queries such as `INSERT`, `UPDATE`,
 and `DELETE`. To insert data, it is necessary to make use of the 
 [`Distributed`](/engines/table-engines/special/distributed) table engine.
+As you learned in the [guide](/architecture/horizontal-scaling) for setting up a cluster with 2 shards and 1 replica, distributed tables are tables which have access to shards located on different
+hosts and are defined using the `Distributed` table engine.
+The distributed table acts as the interface across all the shards in the cluster.
 
 From any of the host clients, run the following query to create a distributed table
-using the existing table we created previously with `ON CLUSTER` and use of the
-`ReplicatedMergeTree`:
+using the existing replicated table we created in the previous step:
 
 ```sql
 CREATE TABLE IF NOT EXISTS uk.uk_price_paid_distributed
@@ -749,4 +748,16 @@ SELECT count(*) FROM uk.uk_price_paid_local;
    └──────────┘
 ```
 
-</VerticalStepper>
+</VerticalStepper>
+
+## Conclusion {#conclusion}
+
+The advantage of this cluster topology with 2 shards and 2 replicas is that it provides both scalability and fault tolerance.
+Data is distributed across separate hosts, reducing storage and I/O requirements per node, while queries are processed in parallel across both shards for improved performance and memory efficiency.
+Critically, the cluster can tolerate the loss of one node and continue serving queries without interruption, as each shard has a backup replica available on another node.
+
+The main disadvantage of this cluster topology is the increased storage overhead—it requires twice the storage capacity compared to a setup without replicas, as each shard is duplicated.
+Additionally, while the cluster can survive a single node failure, losing two nodes simultaneously may render the cluster inoperable, depending on which nodes fail and how shards are distributed.
+This topology strikes a balance between availability and cost, making it suitable for production environments where some level of fault tolerance is required without the expense of higher replication factors.
+
+To learn how ClickHouse Cloud processes queries, offering both scalability and fault-tolerance, see the section ["Parallel Replicas"](/deployment-guides/parallel-replicas).
diff --git a/docs/deployment-guides/replication-sharding-examples/_snippets/_working_example.mdx b/docs/deployment-guides/replication-sharding-examples/_snippets/_working_example.mdx
@@ -2,5 +2,5 @@
 The following steps will walk you through setting up the cluster from
 scratch. If you prefer to skip these steps and jump straight to running the
 cluster, you can obtain the example
-files from the [examples repository](https://github.com/ClickHouse/examples/tree/main/docker-compose-recipes)
+files from the examples repository ['docker-compose-recipes' directory](https://github.com/ClickHouse/examples/tree/main/docker-compose-recipes/recipes).
 :::