Parallel cluster formation and the classic peer discovery race condition with the DNS discovery mechanism on Amazon ECS #14813

jeremy-xu-resolver · 2025-10-24T15:30:28Z

jeremy-xu-resolver
Oct 24, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.1.4

Erlang version used

27.3.x

Operating system (distribution) used

Linux: bottlerocket-aws-ecs-2-aarch64-v1.49.0-713f44ce

How is RabbitMQ deployed?

Community Docker image

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

ip-10-59-135-109. Standalone Node


[38;5;246m2025-10-24 01:36:46.319314+00:00 [debug] <0.208.0> Peer discovery: retrying to create/sync cluster in 1000 ms (0 attempts left)�[0m
--
2025-10-24 01:36:47.320187+00:00 [info] <0.208.0> Addresses discovered via A records of rabbitmq.rabbitmq4-2520.core.local: 10.59.80.177, 10.59.125.51�[0m
2025-10-24 01:36:47.321196+00:00 [info] <0.208.0> Addresses discovered via AAAA records of rabbitmq.rabbitmq4-2520.core.local:�[0m
�[38;5;246m2025-10-24 01:36:47.321315+00:00 [debug] <0.208.0> Peer discovery: backend returned the following configuration:�[0m
�[38;5;246m2025-10-24 01:36:47.321315+00:00 [debug] <0.208.0>   {ok,{['rabbit@ip-10-59-80-177','rabbit@ip-10-59-125-51'],disc}}�[0m
�[38;5;246m2025-10-24 01:36:47.321523+00:00 [debug] <0.208.0> Peer discovery: peer node arguments: #{args =>�[0m
�[38;5;246m2025-10-24 01:36:47.321523+00:00 [debug] <0.208.0>                                            ["-setcookie",�[0m
�[38;5;246m2025-10-24 01:36:47.321523+00:00 [debug] <0.208.0>                                             "/var/lib/rabbitmq/mnesia/.erlang.cookie",�[0m
�[38;5;246m2025-10-24 01:36:47.321523+00:00 [debug] <0.208.0>                                             "-boot","start_sasl","-hidden"],�[0m
�[38;5;246m2025-10-24 01:36:47.321523+00:00 [debug] <0.208.0>                                        name => "rabbit-4258-39",�[0m
�[38;5;246m2025-10-24 01:36:47.321523+00:00 [debug] <0.208.0>                                        connection => standard_io,�[0m
�[38;5;246m2025-10-24 01:36:47.321523+00:00 [debug] <0.208.0>                                        wait_boot => infinity}�[0m
�[38;5;246m2025-10-24 01:36:47.490196+00:00 [debug] <0.208.0> Peer discovery: using temporary hidden node 'rabbit-4258-39@ip-10-59-135-109' to query discovered peers properties�[0m
=PROGRESS REPORT==== 24-Oct-2025::01:36:47.496387 ===
supervisor: {local,inet_gethost_native_sup}
started: [{pid,<0.95.0>},{mfa,{inet_gethost_native,init,[[]]}}]
=PROGRESS REPORT==== 24-Oct-2025::01:36:47.500299 ===
supervisor: {local,kernel_safe_sup}
started: [{pid,<0.94.0>},
{id,inet_gethost_native_sup},
{mfargs,{inet_gethost_native,start_link,[]}},
{restart_type,temporary},
{significant,false},
{shutdown,1000},
{child_type,worker}]
=DEBUG REPORT==== 24-Oct-2025::01:36:47.508906 ===
Peer discovery: temporary hidden node 'rabbit-4258-39@ip-10-59-135-109' queries properties from node 'rabbit@ip-10-59-80-177'
=DEBUG REPORT==== 24-Oct-2025::01:36:47.510833 ===
Peer discovery: temporary hidden node 'rabbit-4258-39@ip-10-59-135-109' queries properties from node 'rabbit@ip-10-59-125-51'
�[38;5;246m2025-10-24 01:36:47.512065+00:00 [debug] <0.208.0> Peer discovery: sorted list of nodes and their properties considered to create/sync the cluster:�[0m
�[38;5;246m2025-10-24 01:36:47.512065+00:00 [debug] <0.208.0>   - {'rabbit@ip-10-59-125-51',['rabbit@ip-10-59-80-177','rabbit@ip-10-59-125-51'],1761269643650318,true}�[0m
�[38;5;246m2025-10-24 01:36:47.512065+00:00 [debug] <0.208.0>   - {'rabbit@ip-10-59-80-177',['rabbit@ip-10-59-80-177','rabbit@ip-10-59-125-51'],1761269725550302,true}�[0m
�[38;5;246m2025-10-24 01:36:47.512308+00:00 [debug] <0.208.0> Peer discovery: not satisfyied with discovered peers: the list does not contain this node�[0m
�[38;5;160m2025-10-24 01:36:47.512369+00:00 [error] <0.208.0> Peer discovery: could not discover and join another node; proceeding as a standalone node�[0m

Logs from node 2 (if applicable, with sensitive values edited out)

ip-10-59-80-177. The only node being able to join a cluster


[38;5;246m2025-10-24 01:35:38.241532+00:00 [debug] <0.208.0> Peer discovery: using temporary hidden node 'rabbit-3490-38@ip-10-59-80-177' to query discovered peers properties�[0m
--
=PROGRESS REPORT==== 24-Oct-2025::01:35:38.247794 ===
supervisor: {local,inet_gethost_native_sup}
started: [{pid,<0.95.0>},{mfa,{inet_gethost_native,init,[[]]}}]
=PROGRESS REPORT==== 24-Oct-2025::01:35:38.251494 ===
supervisor: {local,kernel_safe_sup}
started: [{pid,<0.94.0>},
{id,inet_gethost_native_sup},
{mfargs,{inet_gethost_native,start_link,[]}},
{restart_type,temporary},
{significant,false},
{shutdown,1000},
{child_type,worker}]
=DEBUG REPORT==== 24-Oct-2025::01:35:38.263795 ===
Peer discovery: temporary hidden node 'rabbit-3490-38@ip-10-59-80-177' queries properties from node 'rabbit@ip-10-59-125-51'
�[38;5;246m2025-10-24 01:35:38.265243+00:00 [debug] <0.340.0> Peer discovery: temporary hidden node 'rabbit-3490-38@ip-10-59-80-177' queries properties from node 'rabbit@ip-10-59-80-177'�[0m
=DEBUG REPORT==== 24-Oct-2025::01:35:38.265243 ===
Peer discovery: temporary hidden node 'rabbit-3490-38@ip-10-59-80-177' queries properties from node 'rabbit@ip-10-59-80-177'
�[38;5;246m2025-10-24 01:35:38.265715+00:00 [debug] <0.208.0> Peer discovery: sorted list of nodes and their properties considered to create/sync the cluster:�[0m
�[38;5;246m2025-10-24 01:35:38.265715+00:00 [debug] <0.208.0>   - {'rabbit@ip-10-59-125-51',['rabbit@ip-10-59-125-51'],1761269643650318,true}�[0m
�[38;5;246m2025-10-24 01:35:38.265715+00:00 [debug] <0.208.0>   - {'rabbit@ip-10-59-80-177',['rabbit@ip-10-59-80-177'],1761269725550302,true}�[0m
2025-10-24 01:35:38.266196+00:00 [info] <0.208.0> Peer discovery: node 'rabbit@ip-10-59-125-51' selected for auto-clustering�[0m
�[38;5;246m2025-10-24 01:35:38.266275+00:00 [debug] <0.208.0> Peer discovery: trying to acquire lock�[0m
2025-10-24 01:35:38.266310+00:00 [info] <0.208.0> Peer discovery: will try to lock with peer discovery backend rabbit_peer_discovery_dns�[0m
�[38;5;246m2025-10-24 01:35:38.266349+00:00 [debug] <0.208.0> Peer discovery: rabbit_peer_discovery:lock/0 returned not_supported�[0m
�[38;5;246m2025-10-24 01:35:38.266415+00:00 [debug] <0.208.0> Peer discovery: no lock acquired�[0m
2025-10-24 01:35:38.266469+00:00 [info] <0.208.0> DB: checking if `rabbit@ip-10-59-80-177` can join cluster using remote node `rabbit@ip-10-59-125-51`�[0m
�[38;5;246m2025-10-24 01:35:38.266502+00:00 [debug] <0.208.0> Feature flags: CHECKING COMPATIBILITY between nodes `rabbit@ip-10-59-80-177` and `rabbit@ip-10-59-125-51`; consider node `rabbit@ip-10-59-80-177` as virgin�[0m
�[38;5;87m2025-10-24 01:35:38.266595+00:00 [notice] <0.208.0> Feature flags: checking nodes `rabbit@ip-10-59-80-177` and `rabbit@ip-10-59-125-51` compatibility...�[0m
�[38;5;246m2025-10-24 01:35:38.271668+00:00 [debug] <0.208.0> Feature flags: collecting inventory on nodes: ['rabbit@ip-10-59-80-177']�[0m

Logs from node 3 (if applicable, with sensitive values edited out)

ip-10-59-125-51. The leader node?

This is wired

{'rabbit@ip-10-59-149-24',['rabbit@ip-10-59-149-24'],1761269529969988,true}:
this means there is one node at ip-10-59-149-24? but the cluster never had any node with that ip


[38;5;246m2025-10-24 01:34:40.075517+00:00 [debug] <0.208.0> Peer discovery: sorted list of nodes and their properties considered to create/sync the cluster:�[0m
--
�[38;5;246m2025-10-24 01:34:40.075517+00:00 [debug] <0.208.0>   - {'rabbit@ip-10-59-149-24',['rabbit@ip-10-59-149-24'],1761269529969988,true}�[0m
�[38;5;246m2025-10-24 01:34:40.075770+00:00 [debug] <0.208.0> Peer discovery: not satisfyied with discovered peers: the list does not contain this node�[0m
�[38;5;246m2025-10-24 01:34:40.075855+00:00 [debug] <0.208.0> Peer discovery: retrying to create/sync cluster in 1000 ms (0 attempts left)�[0m
2025-10-24 01:34:41.076953+00:00 [info] <0.208.0> Addresses discovered via A records of rabbitmq.rabbitmq4-2520.core.local: 10.59.149.24�[0m
2025-10-24 01:34:41.077593+00:00 [info] <0.208.0> Addresses discovered via AAAA records of rabbitmq.rabbitmq4-2520.core.local:�[0m
�[38;5;246m2025-10-24 01:34:41.077671+00:00 [debug] <0.208.0> Peer discovery: backend returned the following configuration:�[0m
�[38;5;246m2025-10-24 01:34:41.077671+00:00 [debug] <0.208.0>   {ok,{['rabbit@ip-10-59-149-24'],disc}}�[0m
�[38;5;246m2025-10-24 01:34:41.077847+00:00 [debug] <0.208.0> Peer discovery: peer node arguments: #{args =>�[0m
�[38;5;246m2025-10-24 01:34:41.077847+00:00 [debug] <0.208.0>                                            ["-setcookie",�[0m
�[38;5;246m2025-10-24 01:34:41.077847+00:00 [debug] <0.208.0>                                             "/var/lib/rabbitmq/mnesia/.erlang.cookie",�[0m
�[38;5;246m2025-10-24 01:34:41.077847+00:00 [debug] <0.208.0>                                             "-boot","start_sasl","-hidden"],�[0m
�[38;5;246m2025-10-24 01:34:41.077847+00:00 [debug] <0.208.0>                                        name => "rabbit-4162-38",�[0m
�[38;5;246m2025-10-24 01:34:41.077847+00:00 [debug] <0.208.0>                                        connection => standard_io,�[0m
�[38;5;246m2025-10-24 01:34:41.077847+00:00 [debug] <0.208.0>                                        wait_boot => infinity}�[0m
�[38;5;246m2025-10-24 01:34:41.236680+00:00 [debug] <0.208.0> Peer discovery: using temporary hidden node 'rabbit-4162-38@ip-10-59-125-51' to query discovered peers properties�[0m
=PROGRESS REPORT==== 24-Oct-2025::01:34:41.243102 ===
supervisor: {local,inet_gethost_native_sup}
started: [{pid,<0.95.0>},{mfa,{inet_gethost_native,init,[[]]}}]
=PROGRESS REPORT==== 24-Oct-2025::01:34:41.247104 ===
supervisor: {local,kernel_safe_sup}
started: [{pid,<0.94.0>},
{id,inet_gethost_native_sup},
{mfargs,{inet_gethost_native,start_link,[]}},
{restart_type,temporary},
{significant,false},
{shutdown,1000},
{child_type,worker}]
�[38;5;246m2025-10-24 01:34:41.252250+00:00 [debug] <0.208.0> Peer discovery: sorted list of nodes and their properties considered to create/sync the cluster:�[0m
�[38;5;246m2025-10-24 01:34:41.252250+00:00 [debug] <0.208.0>   - {'rabbit@ip-10-59-149-24',['rabbit@ip-10-59-149-24'],1761269529969988,true}�[0m
�[38;5;246m2025-10-24 01:34:41.252489+00:00 [debug] <0.208.0> Peer discovery: not satisfyied with discovered peers: the list does not contain this node�[0m
�[38;5;160m2025-10-24 01:34:41.252553+00:00 [error] <0.208.0> Peer discovery: could not discover and join another node; proceeding as a standalone node�[0m

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

cluster_formation.peer_discovery_backend = dns
cluster_formation.node_cleanup.only_log_warning = false
#
log.file.level = debug
log.console.level = debug

Steps to deploy RabbitMQ cluster

AWS ECS deployment

Steps to reproduce the behavior in question

Change the desire count from 3 to 1, and back to 3. the cluster will random fail into 3 one-node cluster, or 1 two-node cluster with a standalone node.

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

This is the version showing in the admin console

We are trying to form a 3-node-cluster on AWS ECS using community docker image: rabbitmq:4-management as base image.
We build docker image on our own, by just copying rabbitmq.conf and .erlang.cookie. Have a cluster-startup.sh to inject node name and random wait for the nodes to start up. Cluster name is being updated via cli:
curl -sSf -X PUT -u "$USERNAME:$PASSWORD" "$MANAGEMENT_URL/api/cluster-name" -d "$name"

export RABBITMQ_NODENAME="rabbit@$(echo "$NAME" | cut -d. -f1)"
echo "cluster_formation.dns.hostname = ${DNS_HOSTNAME}" >> /etc/rabbitmq/rabbitmq.conf

: "${STARTUP_DELAY:=0}"
echo "Waiting ${STARTUP_DELAY}s before starting"
sleep "$STARTUP_DELAY"

Answered by michaelklishin

Nov 20, 2025

We won't troubleshoot your AWS deployment.

Your screenshots demonstrate a classic example of a split cluster: during the initial cluster formation, some booting nodes discover their peers while others do not depending on their boot order/progress.

This scenario is so well know that it has a dedicated doc section in the Cluster Formation guide.

For the DNS peer discovery mechanism, a randomized delay (each nodes waits for a random number of seconds between 1 and, say, 10, before starting) should work well. RabbitMQ used to have a setting for that but it was removed because alternatives considered to be better were introduced (this is covered in the doc section above).

Alternatively you can…

View full answer

michaelklishin · 2025-11-20T00:31:43Z

michaelklishin
Nov 20, 2025
Maintainer

We won't troubleshoot your AWS deployment.

Your screenshots demonstrate a classic example of a split cluster: during the initial cluster formation, some booting nodes discover their peers while others do not depending on their boot order/progress.

This scenario is so well know that it has a dedicated doc section in the Cluster Formation guide.

For the DNS peer discovery mechanism, a randomized delay (each nodes waits for a random number of seconds between 1 and, say, 10, before starting) should work well. RabbitMQ used to have a setting for that but it was removed because alternatives considered to be better were introduced (this is covered in the doc section above).

Alternatively you can deploy one node at a time during the initial cluster formation. That should eliminate the fundamental problem.

0 replies

michaelklishin · 2025-11-20T02:14:47Z

michaelklishin
Nov 20, 2025
Maintainer

While I don't know how you use ECS but if you have access to the EC2 instance metadata API endpoint, then you can use the rabbitmq_peer_discovery_aws plugin which is maintained by a few RabbitMQ core team members at AWS.

If that endpoint is not available on ECS in your case but the node name is known, then use the classic config option.

Finally, you can use etcd or Consul mechanisms anywhere where you can provision etcd or Consul with a stable hostname.

So that's four more alternatives besides injecting a delay.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel cluster formation and the classic peer discovery race condition with the DNS discovery mechanism on Amazon ECS #14813

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Parallel cluster formation and the classic peer discovery race condition with the DNS discovery mechanism on Amazon ECS #14813

Uh oh!

Uh oh!

jeremy-xu-resolver Oct 24, 2025

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

ip-10-59-135-109. Standalone Node

Logs from node 2 (if applicable, with sensitive values edited out)

ip-10-59-80-177. The only node being able to join a cluster

Logs from node 3 (if applicable, with sensitive values edited out)

ip-10-59-125-51. The leader node?

This is wired

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 2 comments

Uh oh!

Uh oh!

michaelklishin Nov 20, 2025 Maintainer

Uh oh!

michaelklishin Nov 20, 2025 Maintainer

jeremy-xu-resolver
Oct 24, 2025

michaelklishin
Nov 20, 2025
Maintainer

michaelklishin
Nov 20, 2025
Maintainer