My challenges deploying RabbitMQ Quorum Queues

How I've addressed issues like nodes stuck on terminating state, and queues not being replicated properly

Aug 12, 2024

Mirrored Queues have been deprecated, and it is recommended to use Quorum Queues instead, which rely on the Raft consensus algorithm.
To use Quorum Queues effectively, your cluster should have at least 3 nodes. If your cluster has fewer than 3 nodes, or if not all nodes are running, Quorum Queues will not function properly.
If you originally configured your cluster with a different number of nodes, you must adjust the number of queue replicas accordingly on each node.

RabbitMQ and Quorum Queues

RabbitMQ has been around for a while and is one of the best message brokers available. As a Python developer, I’ve been using it for years alongside Celery to run background tasks, and more recently, as a broker in scenarios where I utilize domain events.

Over the years, I’ve encountered some challenges when deploying RabbitMQ in production, with the most recent one involving the new type of queue introduced with RabbitMQ 3.8—Quorum Queues (which, spoiler alert, are a replacement for the old Mirrored Queues).

When running a RabbitMQ cluster in production, there are several critical considerations, but I’d like to highlight two key points:

You should have more than one node running.
Your queues should be replicated so that if one node goes down, your queues continue to function.

To achieve queue replication, I had been using RabbitMQ’s Mirrored Queues, but they were deprecated starting with RabbitMQ 3.9. As a result, I had to transition to using the new Quorum Queues.

The RabbitMQ quorum queue is a modern queue type, which implements a durable, replicated FIFO queue based on the Raft consensus algorithm.
Quorum queues are designed to be safer and provide simpler, well defined failure handling semantics that users should find easier to reason about when designing and operating their systems.

First Challenge: Insufficient Nodes

My RabbitMQ cluster was already up and running, but the queues were durable and not replicated. There were 2 nodes running in this cluster.

Enabling the Quorum Queue feature is straightforward—you just need to set the x-queue-type argument to quorum. However, since my queues were durable, I had to recreate them because you cannot simply change the queue type.

After adding this configuration to my queues, I encountered an unexpected issue: the nodes in my Kubernetes cluster were getting stuck in the Terminating state and were never fully terminated. After some debugging, I discovered that, unlike the old Mirrored Queues, Quorum Queues use the Raft Consensus Algorithm, which was causing this behavior.

Consensus is a fundamental problem in fault-tolerant distributed systems. Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect result).

The first issue I encountered was that my cluster only had 2 nodes. If one node goes down (or is terminated, in this case), the cluster cannot reach a consensus about a new leader, as there is only one node remaining. As a result, the remaining node never gets terminated because the cluster cannot lose its quorum.

The solution? Well, I increased the cluster to have 3 nodes instead of 2. The problem is now solved, right? Unfortunately no.

Second Challenge: Queue Replication

After increasing the number of nodes in the cluster, I noticed two issues:

The queues were being replicated, but only across 2 nodes—not all of them.
A node was still getting stuck in the Terminating state.

After some investigation, I realized that because the cluster was originally created with 2 nodes and the queues were durable, there was nothing indicating that the queues should now be replicated across 3 nodes instead of 2. As a result, the queues were randomly assigned to 2 nodes. Due to the Raft algorithm, the node could not be terminated because that would leave only one node remaining for some of the queues, leading to the same problem as before.

The solution? Ensure that all nodes have replicas of all queues. Fortunately, this only needs to be done once, and RabbitMQ CLI provides a command for it:

rabbitmq-queues grow <node> all

This command adds a new replica on the given node for all quorum queues.

Conclusion

All of this information is available in the RabbitMQ documentation, but it’s scattered across different sections, making it difficult to find everything in one place. Quorum Queues are definitely more reliable and faster than the old Mirrored Queues, but adding more nodes to the cluster is not as straightforward as it used to be.

The Code Chronicles

Discussion about this post