Data Distribution - Clustrix Documentation
Data Distribution - Clustrix Documentation
Distribution Clustrix Documentation
Clustrix Documentation Clustrix v7.5 Home Distributed Database Architecture Data Distribution
Data Distribution
Introduction
Shared Disk vs. Shared Nothing
Shared Nothing Challenges
Shared Nothing Distribution Strategies
ClustrixDB Basics
Distribution Concepts Overview
Concepts Example
Representation
Slice
Replica
Consistent Hashing
Slicing
Re-Slicing For Growth
Single Key vs. Independent Index Distribution
Single Key Approach
Independent Index Key Approach
Cache Efficiency
Distribution Key Imbalances
Introduction
Shared Disk vs. Shared Nothing
Distributed database systems fall into two major categories of data storage architectures: (1) shared disk and (2) shared nothing.
Shared disk approaches suffer from several architectural limitations inherent in coordinating access to a single central resource. In
such systems, as the number of nodes in the cluster increases, so does the coordination overhead. While some workloads can scale
well with shared disk (e.g. small working sets dominated by heavy reads), most workloads tend to scale very poorly -- especially
workloads with significant write load.
ClustrixDB uses the shared nothing approach because it's the only known approach that allows for large-scale distributed systems.
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 1/9
7/7/2016 Data Distribution Clustrix Documentation
Shared Nothing Challenges
In order to build a scalable shared nothing database system, one must solve two fundamental problems:
Shared Nothing Distribution Strategies
Within shared nothing architectures, most databases fall into the following categories:
1. Table-level distribution. The most basic approach, where an entire table is assigned to a node. The database does not split
the table. Such systems cannot handle very large tables.
2. Single-key-per-table distribution (a.k.a index colocation, or single-key sharding). The most common approach. Preferred
method for most distributed databases (e.g. MySQL Cluster, MongoDB, etc.). In this approach, the table is split into multiple
chunks using a single key (user id, for example). All indexes associated with the chunk are maintained (co-located) with the
primary key.
3. Independent index distribution. The strategy used by ClustrixDB. In this approach, each index has its own distribution.
Required to support a broad range of distributed query evaluation plans.
ClustrixDB Basics
ClustrixDB has a fine-grained approach to data distribution. The following table summarizes the basic concepts and terminology
used by our system. Notice that unlike many other systems, ClustrixDB uses a per-index distribution strategy.
Distribution Concepts Overview
Representation Each table contains one or more indexes. Internally, ClustrixDB refers to these indexes as representations of
the table. Each representation has its own distribution key (a.k.a. a partition key or a shard key), meaning that
ClustrixDB uses multiple independent keys to slice the data in one table. This is in contrast to most other
distributed database systems, which use a single key to slice the data in one table.
Each table must have a primary key. If the user does not define a primary key, ClustrixDB will automatically
create a hidden primary key. The base representation contains all of the columns within the table, ordered by
the primary key. Non-base representations contain a subset of the columns within the table.
Slice ClustrixDB breaks each representation into a collection of logical slices using consistent hashing.
By using consistent hashing, ClustrixDB can split individual slices without having to rehash the entire
representation.
Replica ClustrixDB maintains multiple copies of data for fault tolerance and availability. There are at least two
physical replicas of each logical slice, stored on separate nodes.
ClustrixDB supports configuring the number of replicas per representation. For example, a user may require
three replicas for the base representation of a table, and only two replicas for the other representations of
that table.
Concepts Example
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 2/9
7/7/2016 Data Distribution Clustrix Documentation
create table example (
id bigint primary key,
col1 integer,
col2 integer,
col3 varchar(64),
key k1 (col2),
key k2 (col3, col1),
);
Table: example
1 16 36 january
2 17 35 february
3 18 34 march
4 19 33 april
5 20 32 may
Representation
ClustrixDB will organize the above schema into three representations. One for the main table (the base representation, organized by
the primary key), followed by two more representations, each organized by the index keys.
The yellow coloring in the diagrams below illustrates the ordering key for each representation. Note that the representations for the
secondary indexes include the primary key columns.
Table: example
1 16 36 january 32 5 april 19 4
2 17 35 february 33 4 february 17 2
3 18 34 march 34 3 january 16 1
4 19 33 april 35 2 march 18 3
5 20 32 may 36 1 may 20 5
Slice
ClustrixDB will then split each representation into one or more logical slices. When slicing ClustrixDB uses the following rules:
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 3/9
7/7/2016 Data Distribution Clustrix Documentation
4 19 33 april 5 20 32 may
k1 representation
slice 1 slice 2
col2 id col2 id
32 5 33 4
34 3 36 1
35 2
k2 representation
march 18 3
Replica
To ensure fault tolerance and availability ClustrixDB contains multiple copies of data. ClustrixDB uses the following rules to
place replicas (copies of slices) within the cluster:
Each logical slice is implemented by two or more physical replicas. The default protection factor is configurable at a per-
representation level.
Replica placement is based on balance for size, reads, and writes.
No two replicas for the same slice can exist on the same node.
ClustrixDB can make new replicas online, without suspending or blocking writes to the slice.
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 4/9
7/7/2016 Data Distribution Clustrix Documentation
march 18 3
k2 slice 2 replica B k2 slice 4 replica B k2 slice 3 replica B
april 19 4 march 18 3
base rep slice 3 replica A base rep slice 2 replica A
id col1 col2 col3 base rep slice 1 replica A id col1 col2 col3 base rep slice 1 replica B
4 19 33 april 4 19 33 april
id col1 col2 col3 base rep slice 3 replica B
5 20 32 may 3 18 34 march
Consistent Hashing
ClustrixDB uses consistent hashing for data distribution. Consistent hashing allows ClustrixDB to dynamically redistribute data
without having to rehash the entire data set.
Slicing
ClustrixDB hashes each distribution key to a 64-bit number space. We then divide the space into ranges. Each range is then owned
by a specific slice. The table below illustrates how consistent hashing assigns specific keys to specific slices.
1 min-100 H, Z, J
2 101-200 A, F
3 201-max X, K, R
ClustrixDB then assigns slices to available nodes in the Cluster for data capacity and data access balance.
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 5/9
7/7/2016 Data Distribution Clustrix Documentation
ReSlicing For Growth
As the data set grows, ClustrixDB will automatically and incrementally re-slice the dataset one or more slices at a time. We currently
base our re-slicing thresholds on data set size. If a slice exceeds a maximum size, the system will automatically break it up into two
or more smaller slices.
For example, imagine that one of our slices grew beyond the preset threshold:
1 min-100 H, Z, J 768MB
3 201-max X, K, R, Y 800MB
Our rebalancer process will automatically detect the above condition and schedule a slice-split operation. The system will break up
the hash range into two new slices:
1 min-100 H, Z, J 768MB
4 101-150 A, F 670MB
5 151-200 U, O, S 684MB
3 201-max X, K, R, Y 800MB
Note that system does not have to modify slices 1 and 3. Our technique allows for very large data reorganizations to proceed in
small chunks.
Single Key vs. Independent Index Distribution
It's easy to see why table-level distribution provides very limited scalability. Imagine a schema dominated by one or two very large
tables (billions of rows). Adding nodes to the system does not help in such cases since a single node must be able to accommodate
the entire table.
Why does ClustrixDB use independent index distribution rather than a single-key approach? The answer is two-fold:
1. Independent index distribution allows for a much broader range of distributed query plans that scale with cluster node count.
2. Independent index distribution requires strict support within the system to guarantee that indexes stay consistent with each
other and the main table. Many systems do not provide the strict guarantees required to support index consistency.
Let's examine a specific use case to compare and contrast the two approaches. Imagine a bulletin board application where different
topics are grouped by threads, and users are able to post into different topics. Our bulletin board service has become popular, and
we now have billions of thread posts, hundreds of thousands of threads, and millions of users.
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 6/9
7/7/2016 Data Distribution Clustrix Documentation
Let's also assume that the primary workload for our bulletin board consists of the following two access patterns:
We could imagine a single large table which contains all of the posts in our application with the following simplified schema:
‐‐ Example schema for the posts table.
create table thread_posts (
post_id bigint,
thread_id bigint,
user_id bigint,
posted_on timestamp,
contents text,
primary key (thread_id, post_id),
key (user_id, posted_on)
);
‐‐ Use case 1: Retrieve all posts for a particular thread in post id order.
‐‐ desired access path: primary key (thread_id, post_id)
select *
from thread_posts
where thread_id = 314
order by post_id
;
‐‐ Use case 2: For a specific user, retrieve the last 10 posts by that user.
‐‐ desired access path: key (user_id, posted_on)
select *
from thread_posts
where user_id = 546
order by posted_on desc
limit 10
;
Single Key Approach
With the single key approach, we are faced with a dilemma: which key do we chose to distribute the posts table? As you can see with
the table blow, we cannot chose a single key which will result in good scalability across both use cases.
Distribution Use case 1: posts in a thread Use case 2: top 10 posts by user
Key
thread_id Queries which include the thread_id will perform Queries which do not include the thread_id, like the
well. Requests for a specific thread get routed query for last 10 posts by a specific user, must evaluate
to a single node within the cluster. When the on all nodes which contain the thread_posts table. In
number of threads and posts increases, we simply other words, the system must broadcast the query
add more nodes to the cluster to add capacity. request because the relevant post can reside on any
node.
user_id Queries which do not include the user_id result in Queries which include a user_id get routed to a single
a broadcast. As with use case 2 w/ thread_id key, node. Each node will contain an ordered set of posts for
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 7/9
7/7/2016 Data Distribution Clustrix Documentation
we lose system scalability when we have to a user. The system can scale by avoiding broadcasts.
broadcast.
One possibility with such a system could be to maintain a separate table which includes a user_id and a posted_on columns. We can
then have the application manually maintain this index table.
However, that means that the application must now issue multiple writes, and accept responsibility for data consistency between
the two tables. And imagine if we need to add more indexes? The approach simply doesn't scale. One of the advantages of a
database is automatic index management.
Independent Index Key Approach
ClustrixDB will automatically create independent distributions which satisfy both use cases. The DBA can specify to distribute the
base representation (primary key) by thread_id, and the secondary key by user_id. The system will automatically manage both the
table and secondary indexes with full ACID guarantees.
Cache Efficiency
Unlike other systems which use master-slave pairs for data fault tolerance, ClustrixDB distributes the data in a more fine grained
manner as explained in the above sections. Our approach allows ClustrixDB to increase cache efficiency by not sending reads to
secondary replicas.
Consider the following example. Assume a cluster of 2 nodes and 2 slices A and B, with secondary copies A' and B'.
A B A B
If we allow reads from both primary and secondary By limiting the reads to primary replica only, we make node 1
replicas, then each node will have to cache contents of responsible for A only, and node 2 responsible for B only.
both A and B. Assuming 32GB of cache per node, the total Assuming 32GB cache per node, the total effective cache footprint
effective cache of the system becomes 32GB. becomes 64GB, or double of the opposing model.
Distribution Key Imbalances
With some datasets, it's possible for the distribution of keys within a non-unique secondary index to cause a cluster imbalance. In
such cases ClustrixDB automatically detects the distribution imbalance and schedules an online redistribution action.
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 8/9
7/7/2016 Data Distribution Clustrix Documentation
Jan 3 Nov 43
Jan 8
Jan 6
Feb 9
In the above example, we see that a single value (Jan) represents 50% of all the values in the index. By default ClustrixDB started
with the indexed column as the distribution key for representation. However, since the dataset skews heavily toward one value, slice
1 gets a disproportionate number of entries.
The rebalancer process will notice that the difference between the minimum slice size per hash range is 10MB while the maximum
size is 30MB. Such a condition indicates a data imbalance within the dataset, which leads to a lumpy distribution of data within the
cluster.
In order to fix the above condition, the rebalancer will schedule a redistribute operation for the representation. It will begin adding
keys from the primary key to the representation distribution key. Since the primary key must be unique by definition, we know that a
balanced distribution is possible for the dataset provided we consume enough of the primary key.
For our example, the rebalancer will add the id column to the distribution key. Now that we hash over (col1, id), we get a much
better distribution of data across our slices.
Feb 9 Nov 43
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 9/9