@@ -238,17 +238,18 @@ POST /_db/_system/_api/replication/clusterInventory
238238}
239239```
240240
241- Check that for all collections the attribute ` "allInSync" ` has
242- the value ` true ` . Note that it is necessary to do this for all databases!
241+ Check that for all collections the attributes ` "isReady" ` and ` "allInSync" `
242+ both have the value ` true ` . Note that it is necessary to do this for all
243+ databases!
243244
244245Here is a shell command which makes this check easy:
245246
246247``` bash
247- curl -k https://arangodb.9hoeffer.de:8529/_db/_system/_api/replication/clusterInventory --user root: | jq . | grep ' "allInSync"' | sort | uniq -c
248+ curl -k https://arangodb.9hoeffer.de:8529/_db/_system/_api/replication/clusterInventory --user root: | jq . | grep ' "isReady"\|" allInSync"' | sort | uniq -c
248249```
249250
250- If all these checks are performed and are okay, the cluster is ready to
251- run a risk-free drain operation .
251+ If all these checks are performed and are okay, then it is safe to
252+ continue with the clean out and drain procedure as described below .
252253
253254{% hint 'danger' %}
254255If there are some collections with ` replicationFactor ` set to
@@ -274,13 +275,14 @@ below, the procedure should also work without this.
274275Finally, one should ** not run a rolling upgrade or restart operation**
275276at the time of a node drain.
276277
277- ## Clean out a DBserver manually (optional)
278+ ## Clean out a DBserver manually
278279
279- In this step we clean out a _ DBServer_ manually, before even issuing the
280- ` kubectl drain ` command. This step is optional, but can speed up things
281- considerably. Here is why:
280+ In this step we clean out a _ DBServer_ manually, ** before issuing the
281+ ` kubectl drain ` command** . Previously, we have denoted this step as optional,
282+ but for safety reasons, we consider it mandatory now, since it is near
283+ impossible to choose the grace period long enough in a reliable way.
282284
283- If this step is not performed, we must choose
285+ Furthermore, if this step is not performed, we must choose
284286the grace period long enough to avoid any risk, as explained in the
285287previous section. However, this has a disadvantage which has nothing to
286288do with ArangoDB: We have observed, that some k8s internal services like
@@ -308,10 +310,10 @@ POST /_admin/cluster/cleanOutServer
308310{"server" :" DBServer0006" }
309311```
310312
311- (please compare the above output of the ` /_admin/cluster/health ` API).
312313The value of the ` "server" ` attribute should be the name of the DBserver
313314which is the one in the pod which resides on the node that shall be
314- drained next. This uses the UI short name, alternatively one can use the
315+ drained next. This uses the UI short name (` ShortName ` in the
316+ ` /_admin/cluster/health ` API), alternatively one can use the
315317internal name, which corresponds to the pod name. In our example, the
316318pod name is:
317319
@@ -328,6 +330,12 @@ could use the body:
328330{"server" :" PRMR-wbsq47rz" }
329331```
330332
333+ You can use this command line to achieve this:
334+
335+ ``` bash
336+ curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/cleanOutServer --user root: -d ' {"server":"PRMR-wbsq47rz"}'
337+ ```
338+
331339The API call will return immediately with a body like this:
332340
333341``` JSON
@@ -360,6 +368,12 @@ GET /_admin/cluster/queryAgencyJob?id=38029195
360368}
361369```
362370
371+ Use this command line to check progress:
372+
373+ ``` bash
374+ curl -k https://arangodb.9hoeffer.de:8529/_admin/cluster/queryAgencyJob? id=38029195 --user root:
375+ ```
376+
363377It indicates that the job is still ongoing (` "Pending" ` ). As soon as
364378the job has completed, the answer will be:
365379
@@ -391,8 +405,8 @@ completely risk-free, even with a small grace period.
391405## Performing the drain
392406
393407After all above [ checks before a node drain] ( #things-to-check-in-arangodb-before-a-node-drain )
394- have been done successfully, it is safe to perform the drain
395- operation, similar to this command:
408+ and the [ manual clean out of the DBServer ] ( #clean-out-a-dbserver-manually )
409+ have been done successfully, it is safe to perform the drain operation, similar to this command:
396410
397411``` bash
398412kubectl drain gke-draintest-default-pool-394fe601-glts --delete-local-data --ignore-daemonsets --grace-period=300
@@ -402,12 +416,12 @@ As described above, the options `--delete-local-data` for ArangoDB and
402416` --ignore-daemonsets ` for other services have been added. A ` --grace-period ` of
403417300 seconds has been chosen because for this example we are confident that all the data on our _ DBServer_ pod
404418can be moved to a different server within 5 minutes. Note that this is
405- ** not saying** that 300 seconds will always be enough, regardless of how
419+ ** not saying** that 300 seconds will always be enough. Regardless of how
406420much data is stored in the pod, your mileage may vary, moving a terabyte
407421of data can take considerably longer!
408422
409- If the optional step of
410- [ cleaning out a DBserver manually] ( #clean-out-a-dbserver-manually-optional )
423+ If the highly recommended step of
424+ [ cleaning out a DBserver manually] ( #clean-out-a-dbserver-manually )
411425has been performed beforehand, the grace period can easily be reduced to 60
412426seconds - at least from the perspective of ArangoDB, since the server is already
413427cleaned out, so it can be dropped readily and there is still no risk.
0 commit comments