Design A Distributed Queue
Design A Distributed Queue
Distributed
Messaging Queue
RabbitMQ or Amazon SQS
Requirements:
1. High Throughput
2. The queue should have persistence as much as we can without compromising
on high throughput
3. Support for multiple consumers and producers
4. Once an item is taken from a queue, it can be deleted. Only one consumer can
access one item
We can create multiple queues and add and remove items from each one of them by
specifying the queue id or the queue name.
Let’s now say we have 3 machines and one queue to create and manage. We have a
large volume of data, so we want this queue to be sharded. How can we manage this?
We have a lot of producers writing to this queue and a lot of consumers pulling from
the queue. How do we handle the high throughput?
Well the logical thing to do will be to partition the queue into 3 separate queues, one
on each machine. Each producer can write to any partition, and consumers can
consume from any partition.
For huge workloads (think Facebook scale), you can partition this queue over 100
machines and achieve a lot of scale.
Keeping the queue balanced means writing and reading evenly from all the machines.
If one machine gets starved of items, then it’s readers will be starved as well.
The challenge really becomes writing and reading evenly from all partitions. So we
need some sort of a balancing approach, or some sort of coordinator.
Let’s say we are creating a Queue Q1. We create 3 partitions - P1, P2 and P3 - each on
a different machine. We will need some sort of Queue Manager that keeps track of
the partitions.
Let’s say that we have such a program running on one of the three machines:
When a producer needs to write to the Queue, it asks the Queue Manager for the IP
address of a machine it can write to. Let’s say the Queue Manager returns the IP
address of Machine C (where P3 is located). The Producer can now establish a
persistent TCP/IP connection and start writing to P3.
This will make things simple - just hand it off to the Queue Manager and the Queue
manager can take care of the rest. The problem with this approach is that the Queue
Manager becomes a bottleneck. All of the data has to go through it, and there’s only
so much throughput this one machine can handle.
So now we’ve established that the producers will directly connect to the partition
machine and write the data.
So far we had one producer. Now, let’s say one more producer wants to write to the
Queue. The Queue Manager can send this producer to a different Machine so that
the previous partition is not hogged and other partitions are also written to.
This way, the Queue Manager keeps sending new producers to different partitions,
distributing the throughput across partitions.
This above setup has a flaw though - can you spot it? Spend some time thinking
about it.
Here it is: This model of assigning a single partition per producer works well if the
producers are homogeneous and more in number than the queue partitions. That
For load that doesn’t have a lot of bursts, and which has a lot of producer servers, this
will work well. If there is even one “power producer” that is producing a lot of data at
a high rate, one partition will be hogged up.
This model breaks down as soon if we have different producers producing different
quantities of data.
What can we do to solve this? Well, one solution is for each producer to write to
multiple partitions. Each producer can do a round-robin write to different partitions.
For example, RabbitMQ has client libraries that the producer machine will install.
Let’s say the producer is a Web server and the web server needs to write lots of JSON
objects to the queue. The web server will install your Queue’s (MyQueue) client
library. This library will have functions to connect to the queue and write to it. For
example:
The enqueue function and the MyQueueManager library should handle round
robin writes to different partitions. All of this is abstracted away from the end user
aka the producer.
So now we have seen two approaches for writing to the Queue - assign one partition
to a producer and assign multiple partitions to the producer. In reality, different
situations might deem different approaches to be more suitable.
If we have 1000 producers and only 3 machines, then it might make sense to assign
one partition per producer, because evenly spreading so many producers might result
in good load balancing anyway.
In our library, we can give a configurable property for this. The developer can
configure it according to their situation and customize the load balancing.
From the discussion in the non-distributed version, we know that till the data flushes
to disk, it is susceptible to being lost if the process crashes or if the machine goes
down.
Now, if a hard drive fails, we have suddenly lost all contents in a queue.
We turn to a time tested method of replication. That is the only way to safeguard
against total machine failure.
This means that we need to replicate each partition into multiple machines. Here is
what that would look like:
Let’s look at how exactly the replication happens. As we can see, there is a leader
machine for each partition. The producer will connect with the leader machine and
send data. The leader will write data to its partition and to all its synchronous
replicas. Synchronous replicas are those who are updated synchronously with the
leader replica. Let’s say we have 1 synchronous replica. This ensures that the write is
written to 2 machines right away.
In order to lose data, now 2 machines have to go down at the same time, which is
much less likely. To make this even less likely, the administrator can also do things
like ensuring the two machines are connected to different power supplies or different
network routers - so that they don’t have a common cause of failure.
We can also have normal replicas or asynchronous replicas, which are synchronized
after the main replicas returns an ack to the producer. These replicas are faster for the
throughput, because the write is propagated asynchronously. This is what the
replication looks like now after adding the asynchronous replicas.
After adding fault tolerance, this queueing system can now scale to many machines.
We can auto scale the queue and add more partitions as load increases. To add
another partition to a new machine, we simply create a new partition and point
producers to it.