Skip to content

Commit 013c3ae

Browse files
authored
Make cleanOutServer step mandatory in drain documentation (#355)
2 parents 628f913 + 6244d57 commit 013c3ae

File tree

1 file changed

+31
-17
lines changed
  • docs/Manual/Deployment/Kubernetes

1 file changed

+31
-17
lines changed

docs/Manual/Deployment/Kubernetes/Drain.md

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -238,17 +238,18 @@ POST /_db/_system/_api/replication/clusterInventory
238238
}
239239
```
240240

241-
Check that for all collections the attribute `"allInSync"` has
242-
the value `true`. Note that it is necessary to do this for all databases!
241+
Check that for all collections the attributes `"isReady"` and `"allInSync"`
242+
both have the value `true`. Note that it is necessary to do this for all
243+
databases!
243244

244245
Here is a shell command which makes this check easy:
245246

246247
```bash
247-
curl -k https://arangodb.9hoeffer.de:8529/_db/_system/_api/replication/clusterInventory --user root: | jq . | grep '"allInSync"' | sort | uniq -c
248+
curl -k https://arangodb.9hoeffer.de:8529/_db/_system/_api/replication/clusterInventory --user root: | jq . | grep '"isReady"\|"allInSync"' | sort | uniq -c
248249
```
249250

250-
If all these checks are performed and are okay, the cluster is ready to
251-
run a risk-free drain operation.
251+
If all these checks are performed and are okay, then it is safe to
252+
continue with the clean out and drain procedure as described below.
252253

253254
{% hint 'danger' %}
254255
If there are some collections with `replicationFactor` set to
@@ -274,13 +275,14 @@ below, the procedure should also work without this.
274275
Finally, one should **not run a rolling upgrade or restart operation**
275276
at the time of a node drain.
276277

277-
## Clean out a DBserver manually (optional)
278+
## Clean out a DBserver manually
278279

279-
In this step we clean out a _DBServer_ manually, before even issuing the
280-
`kubectl drain` command. This step is optional, but can speed up things
281-
considerably. Here is why:
280+
In this step we clean out a _DBServer_ manually, **before issuing the
281+
`kubectl drain` command**. Previously, we have denoted this step as optional,
282+
but for safety reasons, we consider it mandatory now, since it is near
283+
impossible to choose the grace period long enough in a reliable way.
282284

283-
If this step is not performed, we must choose
285+
Furthermore, if this step is not performed, we must choose
284286
the grace period long enough to avoid any risk, as explained in the
285287
previous section. However, this has a disadvantage which has nothing to
286288
do with ArangoDB: We have observed, that some k8s internal services like
@@ -308,10 +310,10 @@ POST /_admin/cluster/cleanOutServer
308310
{"server":"DBServer0006"}
309311
```
310312

311-
(please compare the above output of the `/_admin/cluster/health` API).
312313
The value of the `"server"` attribute should be the name of the DBserver
313314
which is the one in the pod which resides on the node that shall be
314-
drained next. This uses the UI short name, alternatively one can use the
315+
drained next. This uses the UI short name (`ShortName` in the
316+
`/_admin/cluster/health` API), alternatively one can use the
315317
internal name, which corresponds to the pod name. In our example, the
316318
pod name is:
317319

@@ -328,6 +330,12 @@ could use the body:
328330
{"server":"PRMR-wbsq47rz"}
329331
```
330332

333+
You can use this command line to achieve this:
334+
335+
```bash
336+
curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/cleanOutServer --user root: -d '{"server":"PRMR-wbsq47rz"}'
337+
```
338+
331339
The API call will return immediately with a body like this:
332340

333341
```JSON
@@ -360,6 +368,12 @@ GET /_admin/cluster/queryAgencyJob?id=38029195
360368
}
361369
```
362370

371+
Use this command line to check progress:
372+
373+
```bash
374+
curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/queryAgencyJob?id=38029195 --user root:
375+
```
376+
363377
It indicates that the job is still ongoing (`"Pending"`). As soon as
364378
the job has completed, the answer will be:
365379

@@ -391,8 +405,8 @@ completely risk-free, even with a small grace period.
391405
## Performing the drain
392406

393407
After all above [checks before a node drain](#things-to-check-in-arangodb-before-a-node-drain)
394-
have been done successfully, it is safe to perform the drain
395-
operation, similar to this command:
408+
and the [manual clean out of the DBServer](#clean-out-a-dbserver-manually)
409+
have been done successfully, it is safe to perform the drain operation, similar to this command:
396410

397411
```bash
398412
kubectl drain gke-draintest-default-pool-394fe601-glts --delete-local-data --ignore-daemonsets --grace-period=300
@@ -402,12 +416,12 @@ As described above, the options `--delete-local-data` for ArangoDB and
402416
`--ignore-daemonsets` for other services have been added. A `--grace-period` of
403417
300 seconds has been chosen because for this example we are confident that all the data on our _DBServer_ pod
404418
can be moved to a different server within 5 minutes. Note that this is
405-
**not saying** that 300 seconds will always be enough, regardless of how
419+
**not saying** that 300 seconds will always be enough. Regardless of how
406420
much data is stored in the pod, your mileage may vary, moving a terabyte
407421
of data can take considerably longer!
408422

409-
If the optional step of
410-
[cleaning out a DBserver manually](#clean-out-a-dbserver-manually-optional)
423+
If the highly recommended step of
424+
[cleaning out a DBserver manually](#clean-out-a-dbserver-manually)
411425
has been performed beforehand, the grace period can easily be reduced to 60
412426
seconds - at least from the perspective of ArangoDB, since the server is already
413427
cleaned out, so it can be dropped readily and there is still no risk.

0 commit comments

Comments
 (0)