You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
= Ensuring reliable etcd performance and scalability
5
5
:context: etcd-performance
6
6
7
7
toc::[]
8
8
9
-
To ensure optimal performance and scalability for etcd in {product-title}, you can complete the following practices.
9
+
To ensure optimal performance with etcd, it's important to understand the conditions that affect performance, including node scaling, leader election, log replication, tuning, latency, network jitter, peer round trip time, database size, and Kubernetes API transaction rates.
* link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2024/html/installing_openshift_container_platform_with_the_assisted_installer/expanding-the-cluster#installing-control-plane-node-healthy-cluster_expanding-the-cluster[Expanding the cluster]
18
27
* xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state]
By using the `etcdctl` CLI, you can monitor the latency for reaching consensus as experienced by etcd. You must identify one of the etcd pods and then retrieve the endpoint health.
10
+
11
+
This procedure, which validates and monitors cluster health, can be run only on an active cluster.
12
+
13
+
.Prerequisites
14
+
15
+
* During planning for cluster deployment, you completed the disk and network tests.
16
+
17
+
.Procedure
18
+
19
+
. Enter the following command:
20
+
+
21
+
[source,terminal]
22
+
----
23
+
# oc get pods -n openshift-etcd -l app=etcd
24
+
----
25
+
+
26
+
.Example output
27
+
[source,terminal]
28
+
----
29
+
NAME READY STATUS RESTARTS AGE
30
+
etcd-m0 4/4 Running 4 8h
31
+
etcd-m1 4/4 Running 4 8h
32
+
etcd-m2 4/4 Running 4 8h
33
+
----
34
+
35
+
. Enter the following command. To better understand the etcd latency for consensus, you can run this command on a precise watch cycle for a few minutes to observe that the numbers remain below the ~66 ms threshold. The closer the consensus time is to 100 ms, the more likely the cluster will experience service-affecting events and instability.
= Determining the size of the etcd database and understanding its effects
8
+
9
+
The size of the etcd database has a direct impact on the time to complete the etcd defragmentation process. {product-title} automatically runs the etcd defragmentation on one etcd member at a time when it detects at least 45% fragmentation. During the defragmentation process, the etcd member cannot process any requests. On small etcd databases, the defragmentation process happens in less than a second. With larger etcd databases, the disk latency directly impacts the fragmentation time, causing additional latency, as operations are blocked while defragmentation happens.
10
+
11
+
The size of the etcd database is a factor to consider when network partitions isolate a control plane node for a period and the control plane needs to resync after communication is re-established.
12
+
13
+
Minimal options exist for controlling the size of the etcd database, as it depends on the operators and applications in the system. When you consider the latency range under which the system will operate, account for the effects of synchronization or defragmentation per size of the etcd database.
14
+
15
+
The magnitude of the effects is specific to the deployment. The time to complete a defragmentation will cause degradation in the transaction rate, as the etcd member cannot accept updates during the defragmentation process. Similarly, the time for the etcd re-synchronization for large databases with high change rate affects the transaction rate and transaction latency on the system.
16
+
17
+
Consider the following two examples for the type of impacts to plan for.
18
+
19
+
Example of the effect of etcd defragementation based on database size:: Writing an etcd database of 1 GB to a slow 7200 RPMs disk at 80 Mbit/s takes about 1 minute and 40 seconds. In such a scenario, the defragmentation process takes at least this long, if not longer, to complete the defragmentation.
20
+
21
+
Example of the effect of database size on etcd synchronization:: If there is a change of 10% of the etcd database during the disconnection of one of the control plane nodes, the resync needs to transfer at least 100 MB. Transferring 100 MB over a 1 Gbps link takes 800 ms. On clusters with regular transactions with the Kubernetes API, the larger the etcd database size, the more network instabilities will cause control plane instabilities.
22
+
23
+
You can determine the size of an etcd database by using the {product-title} console or by running commands in the `etcdctl` tool.
24
+
25
+
.Procedure
26
+
27
+
* To find the database size in the {product-title} console, go to the *etcd* dashboard to view a plot that reports the size of the etcd database.
28
+
29
+
* To find the database size by using the etcdctl tool, you can enter two commands:
30
+
31
+
.. Enter the following command to list the pods:
32
+
+
33
+
[source,terminal]
34
+
----
35
+
# oc get pods -n openshift-etcd -l app=etcd
36
+
----
37
+
+
38
+
.Example output
39
+
[source,terminal]
40
+
----
41
+
NAME READY STATUS RESTARTS AGE
42
+
etcd-m0 4/4 Running 4 22h
43
+
etcd-m1 4/4 Running 4 22h
44
+
etcd-m2 4/4 Running 4 22h
45
+
----
46
+
47
+
.. Enter the following command and view the database size in the output:
= Determining Kubernetes API transaction rate for your environment
8
+
9
+
When you are using stretched control planes, the Kubernetes API transaction rate depends on the characteristics of the particular deployment. Specifically, it depends on the following combined factors:
10
+
11
+
* The etcd disk latency
12
+
* The etcd round trip time
13
+
* The size of objects that are being written to the API
14
+
15
+
As a result, when you use stretched control planes, cluster administrators must test the environment to determine the sustained transaction rate that is possible for the environment. The `kube-burner` tool is useful for that purpose. The binary includes a wrapper for testing OpenShift clusters: `kube-burner-ocp`. You can use `kube-burner-ocp` to test cluster or node density. To test the control plane, `kube-burner-ocp` has three workload profiles: cluster-density, cluster-density-v2, and cluster-density-ms. Each workload profile creates a series of resources that are designed to load the control plane. For more information about each profile, see the `kube-burner-ocp` workload documentation.
16
+
17
+
.Procedure
18
+
19
+
. Enter a command to create and delete resources. The following example shows a command that creates and deletes resources within 20 minutes:
. The {product-title} console provides a dashboard with all the relevant API performance information. To access API performance information, click *Observe*->*Dashboards*, and from the *Dashboards* menu, click *API Performance*.
27
+
28
+
. During the run, observe the API performance dashboard in the {product-title} console by clicking *Observe*->*Dashboards*, and from the *Dashboards* menu, click *API Performance*.
29
+
+
30
+
On the dashboard, notice how the control plane responds during load and the 99th percentile transaction rate it can achieve for the execution of various verbs and request rates by read and write. Use this information and the knowledge of your organization's workload to determine the load that the organization can put in the clusters for the specific stretched control plane deployment.
An etcd cluster is sensitive to disk latencies. To understand the disk latency that is experienced by etcd in your control plane environment, run the `fio` tests or suite.
10
+
11
+
Make sure that the final report classifies the disk as appropriate for etcd, as shown in the following example:
12
+
13
+
[source,terminal]
14
+
----
15
+
...
16
+
99th percentile of fsync is 5865472 ns
17
+
99th percentile of the fsync is within the recommended threshold: - 20 ms, the disk can be used to host etcd
18
+
----
19
+
20
+
When a high latency disk is used, a message states that the disk is not recommended for etcd, as shown in the following example:
21
+
22
+
[source,terminal]
23
+
----
24
+
...
25
+
99th percentile of fsync is 15865472 ns
26
+
99th percentile of the fsync is greater than the recommended value which is 20 ms, faster disks are recommended to host etcd for better performance
27
+
----
28
+
29
+
When you use cluster deployments that span multiple data centers that are using disks for etcd that do not meet the recommended latency, it increases the chances of service-affecting failures and dramatically reduces the network latency that the control plane can sustain.
etcd is a consistent, distributed key-value store that operates as a cluster of replicated nodes. Following the Raft algorithm, etcd operates by electing one node as the leader and the others as followers. The leader maintains the system's current state and ensures that the followers are up-to-date.
10
+
11
+
The leader node is responsible for log replication. It handles incoming write transactions from the client and writes a Raft log entry that it then broadcasts to the followers.
12
+
13
+
//diagram goes here
14
+
15
+
When an etcd client such as `kube-apiserver` connects to an etcd member that is requesting an action that requires a quorum, such as writing a value, if the etcd member is a follower, it returns a message indicating the transaction should be sent to the leader.
16
+
17
+
//second diagram goes here
18
+
19
+
When the etcd client requests an action that requires a quorum from the leader, the leader keeps the client connection open while it writes the local Raft log, broadcasts the log to the followers, and waits for the majority of the followers to acknowledge to have committed the log without failures. Only then does the leader send the acknowledgment to the etcd client and close the session. If failure notifications are received from the followers and the majority fails to reach a consensus, the leader returns the error message to the client and closes the session.
0 commit comments