Partitioning Schemes in Databases Part-1 | Primary Indexes

Partitioning Schemes in Databases Part-1 | Primary Indexes

Introduction

We all must have heard of Sharding or Partitioning of databases earlier as well. Even if you haven’t, I will try to cover it from basics in this edition plus we will discover some interesting Partitioning schemes implemented by the Distributed Systems to partition their huge database. We will cover this topic in two editions.

Sharding is the process of breaking up the data into partitions. This process is also known as Partitioning. The main idea behind sharding is to scale our systems. We can assume that each piece of data is located at exactly one shard or partition. So, every Shard behaves as an independent database of its own.

Suppose we have a large Database having an enormous amount of data. Obviously we can’t store the entire data in a single server or machine. What we can do is to split the large data into smaller chunks known as Shards or Partitions and store them in independent machines. Since, now we have multiple machines holding different partitions of data, all these machines can execute queries related to them independently and in parallel. This can help in scaling the query throughput.

No alt text provided for this image


There can also be a scenario where a single query deals with data from multiple Shards. In that case the process might become complex. Since, we need to query different machines holding the shards and then join back the responses received from those machines to build the final response.

No alt text provided for this image


Suppose we have a key-value datastore and we are planning to shard it. The major goal is to partition the datastore in such a way that the queries are distributed evenly across multiple shards. Suppose our system receives X amount of queries every hour and has 10 shards. Then ideally each shard should handle X/10 queries. This means the system should be able to handle 10 times the load handled by a single shard.

No alt text provided for this image

But in real life this does not happen.There is some skewness introduced in our system. If we haven’t shard the data efficiently or intelligently then it might happen that the majority of queries get handled by a single shard or a minority group of shards. That single shard which handles the majority load is called a Hot-Spot.


Sharding by Primary Index

Suppose we have a key-value datastore where we always access a record by its Primary Key. We can shard the datastore on the basis of the Primary Index.

Sharding by Range

Let's assume we have a datastore consisting of the records of students enrolled for a CS course. Since we have a large number of students enrolling in the course, we have partitioned the details into different machines. Every machine stores the student details in a certain range by their names.

No alt text provided for this image

Every shard holds some student details. Shard 1 holds the details of students having names starting from A, B and C. Hence if in future we need to query details for a student named Alex, we can simply send the query request to Machine 1.

We can observe that the range of keys are not evenly spaced. For example, Machine 1 holds names starting with letters {A, B, C} while Machine 5 holds names starting with {T, U, V, W, X, Y, Z}. Since the main goal is to distribute the data into partitions evenly. Hence there can be a chance that the number of students whose names start with {A, B, C} might be equivalent to the number of students whose names start with {T, U, V, W, X, Y, Z}.

Moreover, within each shard we can keep the records of students sorted by their names. This will help in reducing the lookup time in a shard. This also makes the range queries much more efficient.

No alt text provided for this image

Hotspots in Range based Sharding

One of the major drawbacks of the Range based partitioning is that some access patterns can lead to hotspots

Let’s take our previous data store of students as an example. This time the student records have a different Primary index. We are storing them in shards on the basis of the Timestamp when they take up the course. Let’s say the students registering for the course on Day-1 are being stored in the 1st shard. In this way we can have a distribution of 1 shard per day. Now suppose on a certain day there was a discount on the course and a large number of students signed up on that particular day. In this case, one shard will be handling huge number of writes on that day while the rest of the shards sit idle.

Let’s see how Hash based sharding can help in avoiding this issue of Hotspots.

Sharding by Hash

We previously saw the problem of Skewed Load when we partitioned the data by range on Primary Key. To avoid this we can use a Hash Function in order to determine the partition of a given key.

Now the Hash function will evenly and randomly distribute the records to the shards. In our previous access pattern we stored the records of students signing up on one day on a particular shard. Now with Hashing in picture the students registering up on the same day will be sent to different shards. Since the registration time is used as a Primary Index here, the primary index value is passed through the hash function and the converted hashed value is further sent to the shard. Although the date of registration is same but the registration timestamp is different for all the write requests and hence different hash values will be generated by the hash function. This will avoid the existence of Hot-Spots in our architecture.

No alt text provided for this image

There is one major drawback of distributing the records by Hash of key. We lose the ability to perform efficient Range-based queries. Earlier the records which were stored adjacent to each other are now distributed across the shards. In this situation our range query might be sent to multiple shards out there. Then we will further need to join the responses from all the shards to build the final response.


Concluding

We discussed Partitioning Schemes that solely involve Primary Indexes. We looked around two schemes (Range based and Hash based) to partition the data. We also discussed their advantages and drawbacks in detail. This was Part-1 in the series of Partitioning Schemes in databases. The second part will deal with the scenarios when Secondary Index comes into picture along with the Primary Indexes. Stay tuned!

Meanwhile what you all can do is to Like and Share this edition among your peers and also subscribe to this Newsletter so that you all can get notified when I come up with more content in future. Share this Newsletter with anyone who might be benefitted from this content.

Until next time, Dive Deep and Keep Learning!

Anshu Sagar

Software Engineer at @Emids | Ex-Capgemini

3y

Thanks for sharing

Very precise, appreciate your hard work and research bro 🙌🏻

Tushar Sahu

<>Senior Software Engineer(SDE3)@PayU | Java | Spring Boot | GoLang | PHP | JavaScript | SIH'19 | RKGITian | BASILian |Tech Enthusiast </>

3y

Nicely explained✌🏻

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics