Partitioning Schemes in Databases Part-1 | Primary Indexes
Introduction
We all must have heard of Sharding or Partitioning of databases earlier as well. Even if you haven’t, I will try to cover it from basics in this edition plus we will discover some interesting Partitioning schemes implemented by the Distributed Systems to partition their huge database. We will cover this topic in two editions.
Sharding is the process of breaking up the data into partitions. This process is also known as Partitioning. The main idea behind sharding is to scale our systems. We can assume that each piece of data is located at exactly one shard or partition. So, every Shard behaves as an independent database of its own.
Suppose we have a large Database having an enormous amount of data. Obviously we can’t store the entire data in a single server or machine. What we can do is to split the large data into smaller chunks known as Shards or Partitions and store them in independent machines. Since, now we have multiple machines holding different partitions of data, all these machines can execute queries related to them independently and in parallel. This can help in scaling the query throughput.
There can also be a scenario where a single query deals with data from multiple Shards. In that case the process might become complex. Since, we need to query different machines holding the shards and then join back the responses received from those machines to build the final response.
Suppose we have a key-value datastore and we are planning to shard it. The major goal is to partition the datastore in such a way that the queries are distributed evenly across multiple shards. Suppose our system receives X amount of queries every hour and has 10 shards. Then ideally each shard should handle X/10 queries. This means the system should be able to handle 10 times the load handled by a single shard.
But in real life this does not happen.There is some skewness introduced in our system. If we haven’t shard the data efficiently or intelligently then it might happen that the majority of queries get handled by a single shard or a minority group of shards. That single shard which handles the majority load is called a Hot-Spot.
Sharding by Primary Index
Suppose we have a key-value datastore where we always access a record by its Primary Key. We can shard the datastore on the basis of the Primary Index.
Sharding by Range
Let's assume we have a datastore consisting of the records of students enrolled for a CS course. Since we have a large number of students enrolling in the course, we have partitioned the details into different machines. Every machine stores the student details in a certain range by their names.
Every shard holds some student details. Shard 1 holds the details of students having names starting from A, B and C. Hence if in future we need to query details for a student named Alex, we can simply send the query request to Machine 1.
We can observe that the range of keys are not evenly spaced. For example, Machine 1 holds names starting with letters {A, B, C} while Machine 5 holds names starting with {T, U, V, W, X, Y, Z}. Since the main goal is to distribute the data into partitions evenly. Hence there can be a chance that the number of students whose names start with {A, B, C} might be equivalent to the number of students whose names start with {T, U, V, W, X, Y, Z}.
Moreover, within each shard we can keep the records of students sorted by their names. This will help in reducing the lookup time in a shard. This also makes the range queries much more efficient.
Hotspots in Range based Sharding
One of the major drawbacks of the Range based partitioning is that some access patterns can lead to hotspots
Let’s take our previous data store of students as an example. This time the student records have a different Primary index. We are storing them in shards on the basis of the Timestamp when they take up the course. Let’s say the students registering for the course on Day-1 are being stored in the 1st shard. In this way we can have a distribution of 1 shard per day. Now suppose on a certain day there was a discount on the course and a large number of students signed up on that particular day. In this case, one shard will be handling huge number of writes on that day while the rest of the shards sit idle.
Let’s see how Hash based sharding can help in avoiding this issue of Hotspots.
Sharding by Hash
We previously saw the problem of Skewed Load when we partitioned the data by range on Primary Key. To avoid this we can use a Hash Function in order to determine the partition of a given key.
Now the Hash function will evenly and randomly distribute the records to the shards. In our previous access pattern we stored the records of students signing up on one day on a particular shard. Now with Hashing in picture the students registering up on the same day will be sent to different shards. Since the registration time is used as a Primary Index here, the primary index value is passed through the hash function and the converted hashed value is further sent to the shard. Although the date of registration is same but the registration timestamp is different for all the write requests and hence different hash values will be generated by the hash function. This will avoid the existence of Hot-Spots in our architecture.
There is one major drawback of distributing the records by Hash of key. We lose the ability to perform efficient Range-based queries. Earlier the records which were stored adjacent to each other are now distributed across the shards. In this situation our range query might be sent to multiple shards out there. Then we will further need to join the responses from all the shards to build the final response.
Concluding
We discussed Partitioning Schemes that solely involve Primary Indexes. We looked around two schemes (Range based and Hash based) to partition the data. We also discussed their advantages and drawbacks in detail. This was Part-1 in the series of Partitioning Schemes in databases. The second part will deal with the scenarios when Secondary Index comes into picture along with the Primary Indexes. Stay tuned!
Meanwhile what you all can do is to Like and Share this edition among your peers and also subscribe to this Newsletter so that you all can get notified when I come up with more content in future. Share this Newsletter with anyone who might be benefitted from this content.
Until next time, Dive Deep and Keep Learning!
Software Engineer at @Emids | Ex-Capgemini
3yThanks for sharing
GOOGLE - IN
3yVery precise, appreciate your hard work and research bro 🙌🏻
<>Senior Software Engineer(SDE3)@PayU | Java | Spring Boot | GoLang | PHP | JavaScript | SIH'19 | RKGITian | BASILian |Tech Enthusiast </>
3yNicely explained✌🏻