Partitioning in Distributed Systems
Partitioning in Distributed Systems
Saurav Prateek’s
Partitioning in
Distributed Systems
Diving in-depth into the Partitioning schemes in
distributed systems along with Rebalancing strategies
Table of Contents
Partitioning
Schemes - Primary
Indexes
This chapter discusses how the data is actually partitioned into
multiple Shards and what are the Partitioning Schemes
responsible for Sharding the Databases.
Sharding is the process of breaking up the data into partitions. This process is also
known as Partitioning. The main idea behind sharding is to scale our systems. We
can assume that each piece of data is located at exactly one shard or partition. So,
every Shard behaves as an independent database of its own.
Suppose we have a key-value datastore and we are planning to shard it. The major
goal is to partition the datastore in such a way that the queries are distributed
evenly across multiple shards. Suppose our system receives X amount of queries
every hour and has 10 shards. Then ideally each shard should handle X/10 queries.
This means the system should be able to handle 10 times the load handled by a
single shard.
Sharding by Range
Let's assume we have a datastore consisting of the records of students enrolled for
a CS course. Since we have a large number of students enrolling in the course, we
have partitioned the details into different machines. Every machine stores the
student details in a certain range by their names.
We can observe that the range of keys are not evenly spaced. For example,
Machine 1 holds names starting with letters {A, B, C} while Machine 5 holds
names starting with {T, U, V, W, X, Y, Z}. Since the main goal is to distribute the data
into partitions evenly. Hence there can be a chance that the number of students
whose names start with {A, B, C} might be equivalent to the number of students
whose names start with {T, U, V, W, X, Y, Z}.
Let’s take our previous data store of students as an example. This time the student
records have a different Primary index. We are storing them in shards on the basis
of the Timestamp when they take up the course. Let’s say the students registering
for the course on Day-1 are being stored in the 1st shard. In this way we can have a
distribution of 1 shard per day. Now suppose on a certain day there was a discount
on the course and a large number of students signed up on that particular day. In
this case, one shard will be handling a huge number of writes on that day while the
rest of the shards sit idle.
Sharding by Hash
We previously saw the problem of Skewed Load when we partitioned the data by
range on Primary Key. To avoid this we can use a Hash Function in order to
determine the partition of a given key.
Now the Hash function will evenly and randomly distribute the records to the
shards. In our previous access pattern we stored the records of students signing up
on one day on a particular shard. Now with Hashing in picture the students
registering up on the same day will be sent to different shards. Since the
registration time is used as a Primary Index here, the primary index value is passed
through the hash function and the converted hashed value is further sent to the
shard. Although the date of registration is the same, the registration timestamp is
different for all the write requests and hence different hash values will be
generated by the hash function. This will avoid the existence of Hot-Spots in our
architecture.
Partitioning
Schemes -
Secondary Indexes
This chapter discusses the Partitioning Schemes in Databases that
involve Secondary Indexes. In this chapter we will look around the
Partitioning Schemes that deal with Secondary Indexes and will
also explore the concept of Local and Global Secondary Indexes
as well.
“Secondary Indexes does not identify a record uniquely but rather is a way
of searching for occurrences of a particular value”
There are two ways by which we partition a data-store with Secondary Indexes.
Sharding by Document
In this partitioning scheme every partition is completely independent. Each
partition maintains its own set of secondary indexes. So whenever we need to write
a record we only need to update the content of the partition/shard that deals with
the Primary Index of the record we are writing.
Let’s take the previous data-model (the data-model we used in Part-1) of Student
details. The Primary Index was the Name of the students. Now apart from the name
of the students, we also need to search the records on the basis of the Course
taken up by the students and the Graduation Year of the students. Hence our
current data-model has Primary Index as name and Secondary Index as Course
and Graduation Year.
Since every partition maintains their own secondary indexes we can also call them
as Local Indexes or Local Secondary Indexes. Now our data-store will look like
this.
Sharding by Term
In the previous Partitioning Scheme every partition/shard maintained their own
Secondary Index which can also be termed as Local Index. In this approach we
won’t be having a concept of Local Index instead we will be dealing with Global
Index.
The Global Index or Global Secondary Index for the data-model will look like this.
The Global Secondary Index looks like this. We can observe that every secondary
index holds the values (Primary Indexes of records) from all the partitions. Hence
we call them global. Now, this Global Index alone can be huge and must be
partitioned.
We can perform a range partitioning of the Global Index. The first partition can hold
all the Secondary Indexes which start from letters {A to N} and the second partition
can hold the indexes which start from letters {O to Z}. After partitioning the Global
Index, our data-model will look like this.
Suppose we need to query all the students who opted for a Networks course. Then
we can simply look for the partition that holds the term “Networks”. In our case
which is Partition-1. We can further query Partition-1 to get the keys of all the
students who opted for Networks course.
The Global Secondary Index maintained by the partitions might take some time to
get updated once the new record is added, since the process is often
asynchronous.
Rebalancing the
Partitions - A naive
approach
The chapter discusses the concept of Rebalancing in detail and a
Hash-modulo based strategy along with its major drawback.
● What if the size of the database starts growing fast and hence we need to
add more disks to store the data.
● What if the load on the system increases and we need to add an extra
machine to scale our system.
● What if an existing machine/node crashes due to some unforeseen hardware
issue.
All of the above changes will require the existing data-items to move across the
nodes. This process of moving a data-item from one node to another is called
Rebalancing.
Let’s take a fairly simple example to understand this. Suppose we have a database
holding about 10,000 data items which are partitioned over 100 nodes such that
every node holds about 100 data-items. The current architecture would look like
this.
In the above node we can see that data items with Ids 3601 to 3700 need to be
shifted to other available nodes. We could try distributing these data-items to the
existing node in such a way that the amount of data-items held by every node is
almost even. We are aiming for an even distribution of data-items across the
available nodes. There are multiple schemes to perform this rebalancing which we
will discuss in the upcoming sections.
Scenario 2
Suppose the number of users of our system exceeded and now we are planning to
add one more node/machine to our cluster to distribute the incoming queries.
Since our queries throughput increased we are planning to scale our system. We
had 100 nodes previously and now added an extra node Node-101 to the existing
Now we perform the modulo 10 on the resultant hash value to get the node ID to
which they will be directed to.
Let’s observe the scenario when the number of nodes in the cluster is decreased
from 10 to 9.
Let’s also observe the scenario when the number of nodes in the cluster is
increased from 10 to 11.
Strategies for
Rebalancing the
partitions
The chapter discusses the Fixed and Dynamic partitioning as the
strategies for Rebalancing along with their advantages and pitfalls.
In this chapter we will look around some strategies for rebalancing which are
efficient and also used by multiple database services now-a-days.
Fixed Partitioning
In our previous edition we discussed a strategy which was quiet in-efficient and
involved very frequent transfer of data across nodes. To solve the problem, in this
scheme we can split our database into a large number of partitions and then assign
multiple partitions to each node. For this, the number of partitions must be many
more than the number of existing nodes in the system.
Suppose our database has N nodes, then we can split our database into 10N
partitions and further assign 10 partitions to every single node.
In this configuration the number of partitions remains constant and the assignment
of keys to the partition also remains the same. The only thing that changes is the
assignment of partitions to the nodes.
Let’s take a look at this strategy by an example. Suppose our database initially had 4
Nodes { N0, N1, N2, N3 }. The database has been split into 20 partitions { P0, P1, …
, P19 } which will be fixed throughout. Hence the initial configuration looks
somewhat like this.
As the number of partitions is fixed hence, as the total amount of data grows the
size of the partitions grows proportionally. Hence, it's difficult to choose the right
number of partitions initially when the total size of the data in the system tends to
be highly variable.
Suppose the size of the total amount of data in the system is 500 GB (approx.) and
the number of partitions decided initially is 100 then each partition would be of size
approx. 5 GB.
● If the number of partitions are very large then re-balancing and recovery
from node failure becomes hard and expensive.
● If the number of partitions are very small then they can incur too much
overhead.
The best performance is achieved when the number of partitions is just right,
neither too big nor too small.
Dynamic Partitioning
In the previous scheme of Fixed Partitioning there was a difficulty of choosing the
number of partitions initially, especially for those systems whose total amount of
data is highly variable.
Similarly, if lots of data is deleted from a partition and it shrinks below the
threshold, then it can be merged with an adjacent partition.