Parallel cluster formation and the classic peer discovery race condition with the DNS discovery mechanism on Amazon ECS #14813
-
Community Support Policy
RabbitMQ version used4.1.4 Erlang version used27.3.x Operating system (distribution) usedLinux: bottlerocket-aws-ecs-2-aarch64-v1.49.0-713f44ce How is RabbitMQ deployed?Community Docker image rabbitmq-diagnostics status outputSee https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics Logs from node 1 (with sensitive values edited out)ip-10-59-135-109. Standalone NodeLogs from node 2 (if applicable, with sensitive values edited out)ip-10-59-80-177. The only node being able to join a clusterLogs from node 3 (if applicable, with sensitive values edited out)ip-10-59-125-51. The leader node?This is wired{'rabbit@ip-10-59-149-24',['rabbit@ip-10-59-149-24'],1761269529969988,true}:
rabbitmq.confSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location Steps to deploy RabbitMQ clusterAWS ECS deployment Steps to reproduce the behavior in questionChange the desire count from 3 to 1, and back to 3. the cluster will random fail into 3 one-node cluster, or 1 two-node cluster with a standalone node. advanced.configSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location Application code# PASTE CODE HERE, BETWEEN BACKTICKSKubernetes deployment file# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKSWhat problem are you trying to solve?This is the version showing in the admin console We are trying to form a 3-node-cluster on AWS ECS using community docker image: rabbitmq:4-management as base image. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
We won't troubleshoot your AWS deployment. Your screenshots demonstrate a classic example of a split cluster: during the initial cluster formation, some booting nodes discover their peers while others do not depending on their boot order/progress. This scenario is so well know that it has a dedicated doc section in the Cluster Formation guide. For the DNS peer discovery mechanism, a randomized delay (each nodes waits for a random number of seconds between 1 and, say, 10, before starting) should work well. RabbitMQ used to have a setting for that but it was removed because alternatives considered to be better were introduced (this is covered in the doc section above). Alternatively you can deploy one node at a time during the initial cluster formation. That should eliminate the fundamental problem. |
Beta Was this translation helpful? Give feedback.
-
|
While I don't know how you use ECS but if you have access to the EC2 instance metadata API endpoint, then you can use the If that endpoint is not available on ECS in your case but the node name is known, then use the classic config option. Finally, you can use etcd or Consul mechanisms anywhere where you can provision etcd or Consul with a stable hostname. So that's four more alternatives besides injecting a delay. |
Beta Was this translation helpful? Give feedback.



We won't troubleshoot your AWS deployment.
Your screenshots demonstrate a classic example of a split cluster: during the initial cluster formation, some booting nodes discover their peers while others do not depending on their boot order/progress.
This scenario is so well know that it has a dedicated doc section in the Cluster Formation guide.
For the DNS peer discovery mechanism, a randomized delay (each nodes waits for a random number of seconds between 1 and, say, 10, before starting) should work well. RabbitMQ used to have a setting for that but it was removed because alternatives considered to be better were introduced (this is covered in the doc section above).
Alternatively you can…