Skip to content

Commit 6b7ce0d

Browse files
author
northjhuang
committed
add suspend in workflow disruption template
1 parent 3275c6b commit 6b7ce0d

21 files changed

+636
-76
lines changed

playbook/README.md

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,19 @@ Supports enabling `etcd Overload Protection` and `APF Flow Control` [APF Rate Li
3333
| `inject-stress-list-qps` | `int` | "100" | QPS per stress test Pod |
3434
| `inject-stress-total-duration` | `string` | "30s" | Total test duration (e.g. 30s, 5m) |
3535

36+
**Recommended Parameters for TKE Clusters**
37+
38+
| Cluseter Level | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
39+
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
40+
| L5 | 10000 | 100 | 10 | 6 | 200 |
41+
| L50 | 10000 | 300 | 10 | 6 | 200 |
42+
| L100 | 50000 | 500 | 20 | 6 | 200 |
43+
| L200 | 100000 | 1000 | 50 | 9 | 200 |
44+
| L500 | 100000 | 1000 | 50 | 12 | 200 |
45+
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
46+
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
47+
| L5000 | 100000 | 10000 | 500 | 21 | 500 |
48+
3649
**etcd Overload Protection & Enhanced APF**
3750

3851
Tencent Cloud TKE team has developed these core protection features:
@@ -56,31 +69,39 @@ Supported versions:
5669
**playbook**: `workflow/coredns-disruption-scenario.yaml`
5770

5871
This scenario simulates coredns service disruption by:
59-
1. Scaling coredns Deployment replicas to 0
60-
2. Maintaining zero replicas for specified duration
61-
3. Restoring original replica count
72+
73+
1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing
74+
75+
2. **Component Shutdown**: Log in to the Argo Web UI, click on `coredns-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the coredns Deployment replicas to 0
76+
77+
3. **Service Validation**: During the coredns disruption, you can verify whether your services are affected by the coredns disruption
78+
79+
4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the coredns Deployment replicas
6280

6381
**Parameters**
6482

6583
| Parameter | Type | Default | Description |
6684
|-----------|------|---------|-------------|
67-
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
6885
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |
6986

7087
## kubernetes-proxy Disruption
7188

7289
**playbook**: `workflow/kubernetes-proxy-disruption-scenario.yaml`
7390

7491
This scenario simulates kubernetes-proxy service disruption by:
75-
1. Scaling kubernetes-proxy Deployment replicas to 0
76-
2. Maintaining zero replicas for specified duration
77-
3. Restoring original replica count
92+
93+
1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing
94+
95+
2. **Component Shutdown**: Log in to the Argo Web UI, click on `kubernetes-proxy-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the kubernetes-proxy Deployment replicas to 0
96+
97+
3. **Service Validation**: During the kubernetes-proxy disruption, you can verify whether your services are affected by the kubernetes-proxy disruption
98+
99+
4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the kubernetes-proxy Deployment replicas
78100

79101
**Parameters**
80102

81103
| Parameter | Type | Default | Description |
82104
|-----------|------|---------|-------------|
83-
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
84105
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |
85106

86107
## Namespace Deletion Protection
@@ -140,10 +161,10 @@ kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.ya
140161

141162
| Parameter | Type | Default | Description |
142163
|-----------|------|---------|-------------|
143-
| `region` | `string` | `<REGION>` | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
144-
| `secret-id` | `string` | `<SECRET_ID>` | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
145-
| `secret-key` | `string` | `<SECRET_KEY>` | Tencent Cloud API secret key |
146-
| `cluster-id` | `string` | `<CLUSTER_ID>` | Target cluster ID |
164+
| `region` | `string` | "" | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
165+
| `secret-id` | `string` | "" | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
166+
| `secret-key` | `string` | "" | Tencent Cloud API secret key |
167+
| `cluster-id` | `string` | "" | Target cluster ID |
147168
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Secret name containing target cluster kubeconfig |
148169

149170
**Notes**

playbook/README_zh.md

Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,19 @@
3333
| `inject-stress-list-qps` | `int` | "100" | 每个发压`Pod``QPS` |
3434
| `inject-stress-total-duration` | `string` | "30s" | 发压执行总时长(如30s,5m等) |
3535

36+
**TKE集群推荐压测参数**
37+
38+
| 集群规格 | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
39+
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
40+
| L5 | 10000 | 100 | 10 | 6 | 200 |
41+
| L50 | 10000 | 300 | 10 | 6 | 200 |
42+
| L100 | 50000 | 500 | 20 | 6 | 200 |
43+
| L200 | 100000 | 1000 | 50 | 9 | 200 |
44+
| L500 | 100000 | 1000 | 50 | 12 | 200 |
45+
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
46+
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
47+
| L5000 | 100000 | 10000 | 500 | 21 | 500 |
48+
3649
**etcd过载保护&增强apf限流说明**
3750

3851
腾讯云TKE团队在社区版本基础上开发了以下核心保护特性:
@@ -56,31 +69,33 @@
5669
**playbook**`workflow/coredns-disruption-scenario.yaml`
5770

5871
该场景通过以下方式构造`coredns`服务中断:
59-
1.`coredns Deployment`副本数缩容到`0`
60-
2. 维持指定时间副本数为`0`
61-
3. 恢复原有副本数
72+
73+
1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
74+
2. **组件停机**:登录argo Web UI,点击`coredns-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数缩容到`0`
75+
3. **业务验证**`coredns`停服期间,您可以去验证您的业务是否受到`cordns`停服的影响
76+
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数恢复
6277

6378
**参数说明**
6479

6580
| 参数名称 | 类型 | 默认值 | 说明 |
6681
|---------|------|--------|------|
67-
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
6882
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |
6983

7084
## kubernetes-proxy停服
7185

7286
**playbook**`workflow/kubernetes-proxy-disruption-scenario.yaml`
7387

7488
该场景通过以下方式构造`kubernetes-proxy`服务中断:
75-
1.`kubernetes-proxy` `Deployment`副本数缩容到0
76-
2. 维持指定时间副本数为`0`
77-
3. 恢复原有副本数
89+
90+
1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
91+
2. **组件停机**:登录argo Web UI,点击`kubernetes-proxy-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数缩容到`0`
92+
3. **业务验证**`kubernetes-proxy`停服期间,您可以去验证您的业务是否受到`kubernetes-proxy`停服的影响
93+
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数恢复
7894

7995
**参数说明**
8096

8197
| 参数名称 | 类型 | 默认值 | 说明 |
8298
|---------|------|--------|------|
83-
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
8499
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |
85100

86101
## 命名空间删除防护
@@ -139,10 +154,10 @@ kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.ya
139154

140155
| 参数名称 | 类型 | 默认值 | 说明 |
141156
|---------|------|--------|------|
142-
| `region` | `string` | `<REGION>` | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
143-
| `secret-id` | `string` | `<SECRET_ID>` | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
144-
| `secret-key` | `string` | `<SECRET_KEY>` | 腾讯云API密钥 |
145-
| `cluster-id` | `string` | `<CLUSTER_ID>` | 演练集群ID |
157+
| `region` | `string` | "" | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
158+
| `secret-id` | `string` | "" | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
159+
| `secret-key` | `string` | "" | 腾讯云API密钥 |
160+
| `cluster-id` | `string` | "" | 演练集群ID |
146161
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | 目标集群kubeconfig secret名称 |
147162

148163
**注意事项**

0 commit comments

Comments
 (0)