[Questions] Classic queue overdelivers to closed channel with global prefetch #13229

gomoripeti · 2025-02-10T22:19:25Z

gomoripeti
Feb 10, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

other (please specify)

Erlang version used

26.2.x

Operating system (distribution) used

ubuntu

How is RabbitMQ deployed?

Debian package

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

# PASTE rabbitmq.conf HERE, BETWEEN BACKTICKS

Steps to deploy RabbitMQ cluster

deb package

Steps to reproduce the behavior in question

_

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

Note: this issue happens when a channel uses global qos which is a deprecated feature (however it is still permitted on latest main and 4.0.x.)

When AMQP 0.9.1 global qos is used the queue process asks the limiter process of the consumer channel if it can deliver the next message. However if the limiter process does not exist or shutting down, can_send allows the delivery. This puzzling behaviour results in the queue delivering messages to a dead consumer over the original prefetch count until it runs out of credit (200 by default). While the queue process is executing run_message_queue in a loop it does not handle DOWN messages from terminated consumer channels. There is extra cost of the many can_send calls if the queue and channel processes are on different nodes. And the unacknowledged messages have to be loaded and kept in memory this way the queue process memory gets bloated.

The problem in production that we've observed is a classic queue getting into a "bad" state when it has a long (~10K) internal gen_server2 message queue, it has a lot of messages (let's say 100K) and more and more go into unacked state although on the consuming client side there is none received. Normally there are 60 consumers of the queue (each from a separate connection/channel) The client application tries to scale out by adding a few more consumers but basic.consume always times out and the connections are closed and subscription is retried. Because the queue is always way behind processing the basic.consume and channel down events it can never recover. (Or very slow to recover after all the clients are stopped from retrying).

Stacktrace of the queue process is dominantly:

{current_stacktrace,[{gen,do_call,4,
                          [{file,"gen.erl"},{line,259}]},
                     {gen_server2,call,3,[{file,"gen_server2.erl"},{line,342}]},
                     {rabbit_misc,with_exit_handler,2,
                                  [{file,"rabbit_misc.erl"},{line,511}]},
                     {rabbit_limiter,can_send,3,
                                     [{file,"rabbit_limiter.erl"},{line,222}]},
                     {rabbit_queue_consumers,deliver_to_consumer,3,
                                             [{file,"rabbit_queue_consumers.erl"},{line,255}]},
                     {rabbit_queue_consumers,deliver,6,
                                             [{file,"rabbit_queue_consumers.erl"},{line,238}]},
                     {rabbit_amqqueue_process,run_message_queue,2,
                                              [{file,"rabbit_amqqueue_process.erl"},{line,657}]},
                     {rabbit_amqqueue_process,handle_call,3,
                                              [{file,"rabbit_amqqueue_process.erl"},{line,1417}]}]}

My questions:

Do I see correctly that that ExitValue = true in

rabbitmq-server/deps/rabbit/src/rabbit_limiter.erl

Line 221 in a87036b

safe_call(Pid, {can_send, self(), AckRequired}, true)) of

should be false, to prevent sending messages to dead consumers? (I think that true is a leftover from 2008 when the limiter process was shut down when it was not used, but today the limiter is always running next to the channel process but marked by the queue as dormant if unused)
Does this ExitValue = true affect any other code path than amqp 0.9.1 consumer with global qos? (eg amqp 1.0 consumers?)
Would a patch to change ExitValue from true to `false be accepted?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Questions] Classic queue overdelivers to closed channel with global prefetch #13229

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Questions] Classic queue overdelivers to closed channel with global prefetch #13229

Uh oh!

Uh oh!

gomoripeti Feb 10, 2025

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 0 comments

gomoripeti
Feb 10, 2025