0% found this document useful (0 votes)
38 views

Data Distribution - Clustrix Documentation

This document discusses data distribution strategies in ClustrixDB. It begins by contrasting shared disk and shared nothing architectures, noting that ClustrixDB uses a shared nothing approach. It then explains that ClustrixDB independently distributes each index of a table across multiple nodes using consistent hashing to partition indexes into logical slices, which are replicated across nodes for fault tolerance. Key concepts discussed include representations, slices, replicas, and how ClustrixDB handles data distribution at a fine-grained index-level rather than table- or single-key-level.

Uploaded by

CSK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Data Distribution - Clustrix Documentation

This document discusses data distribution strategies in ClustrixDB. It begins by contrasting shared disk and shared nothing architectures, noting that ClustrixDB uses a shared nothing approach. It then explains that ClustrixDB independently distributes each index of a table across multiple nodes using consistent hashing to partition indexes into logical slices, which are replicated across nodes for fault tolerance. Key concepts discussed include representations, slices, replicas, and how ClustrixDB handles data distribution at a fine-grained index-level rather than table- or single-key-level.

Uploaded by

CSK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

7/7/2016 Data 

Distribution ­ Clustrix Documentation ­

Clustrix Documentation Clustrix v7.5 Home Distributed Database Architecture Data Distribution

Data Distribution
Introduction
Shared Disk vs. Shared Nothing
Shared Nothing Challenges
Shared Nothing Distribution Strategies
ClustrixDB Basics
Distribution Concepts Overview
Concepts Example
Representation
Slice
Replica
Consistent Hashing
Slicing
Re-Slicing For Growth
Single Key vs. Independent Index Distribution
Single Key Approach
Independent Index Key Approach
Cache Efficiency
Distribution Key Imbalances

Introduction

Shared Disk vs. Shared Nothing
Distributed database systems fall into two major categories of data storage architectures: (1) shared disk and (2) shared nothing.

Shared Disk Architecture Shared Nothing Architecture

Shared disk approaches suffer from several architectural limitations inherent in coordinating access to a single central resource. In
such systems, as the number of nodes in the cluster increases, so does the coordination overhead. While some workloads can scale
well with shared disk (e.g. small working sets dominated by heavy reads), most workloads tend to scale very poorly -- especially
workloads with significant write load.

ClustrixDB uses the shared nothing approach because it's the only known approach that allows for large-scale distributed systems.

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 1/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

Shared Nothing Challenges
In order to build a scalable shared nothing database system, one must solve two fundamental problems:

1. Split a large data set across a number of individual nodes.


2. Create an evaluation model which can take advantage of the distributed data environment. 
This document explains how ClustrixDB distributes data sets across a large number of independent nodes, as well as provides
reasoning behind some of our architectural decisions. 

Shared Nothing Distribution Strategies
Within shared nothing architectures, most databases fall into the following categories:

1. Table-level distribution. The most basic approach, where an entire table is assigned to a node. The database does not split
the table. Such systems cannot handle very large tables.
2. Single-key-per-table distribution (a.k.a index colocation, or single-key sharding). The most common approach. Preferred
method for most distributed databases (e.g. MySQL Cluster, MongoDB, etc.). In this approach, the table is split into multiple
chunks using a single key (user id, for example). All indexes associated with the chunk are maintained (co-located) with the
primary key.
3. Independent index distribution. The strategy used by ClustrixDB. In this approach, each index has its own distribution.
Required to support a broad range of distributed query evaluation plans. 

ClustrixDB Basics
ClustrixDB has a fine-grained approach to data distribution. The following table summarizes the basic concepts and terminology
used by our system. Notice that unlike many other systems, ClustrixDB uses a per-index distribution strategy.

Distribution Concepts Overview

ClustrixDB Distribution Concepts

Representation Each table contains one or more indexes. Internally, ClustrixDB refers to these indexes as representations of
the table. Each representation has its own distribution key (a.k.a. a partition key or a shard key), meaning that
ClustrixDB uses multiple independent keys to slice the data in one table. This is in contrast to most other
distributed database systems, which use a single key to slice the data in one table.

Each table must have a primary key. If the user does not define a primary key, ClustrixDB will automatically
create a hidden primary key. The base representation contains all of the columns within the table, ordered by
the primary key. Non-base representations contain a subset of the columns within the table.

Slice ClustrixDB breaks each representation into a collection of logical slices using consistent hashing.

By using consistent hashing, ClustrixDB can split individual slices without having to rehash the entire
representation.

Replica ClustrixDB maintains multiple copies of data for fault tolerance and availability. There are at least two
physical replicas of each logical slice, stored on separate nodes.

ClustrixDB supports configuring the number of replicas per representation. For example, a user may require
three replicas for the base representation of a table, and only two replicas for the other representations of
that table.

Concepts Example
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 2/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

Consider the following example:

create table example (
    id      bigint  primary key,
    col1    integer,
    col2    integer,
    col3    varchar(64),
    key k1 (col2),
    key k2 (col3, col1),
);

We populate our table with the following data:

Table: example

id col1 col2 col3

1 16 36 january

2 17 35 february

3 18 34 march

4 19 33 april

5 20 32 may

Representation

ClustrixDB will organize the above schema into three representations. One for the main table (the base representation, organized by
the primary key), followed by two more representations, each organized by the index keys.

The yellow coloring in the diagrams below illustrates the ordering key for each representation. Note that the representations for the
secondary indexes include the primary key columns. 

Table: example

base representation k1 representation k2 representation

primary key index (col2) index (col3, col1)

id col1 col2 col3 col2 id col3 col1 id

1 16 36 january 32 5 april 19 4

2 17 35 february 33 4 february 17 2

3 18 34 march 34 3 january 16 1

4 19 33 april 35 2 march 18 3

5 20 32 may 36 1 may 20 5

Slice

ClustrixDB will then split each representation into one or more logical slices. When slicing ClustrixDB uses the following rules:

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 3/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

We apply a consistent hashing algorithm on the representation's key. 


We distribute each representation independently. Refer to single key vs. independent distribution below for an in-
depth examination of the reasoning behind this design.
The number of slices can vary between representations of the same table.
We split slices based on size.
Users may configure the initial slice count of each representation. By default, each representation starts with one slice per
node.

base representation slices

slice 1 slice 2 slice 3

id col1 col2 col3 id col1 col2 col3 id col1 col2 col3

2 17 35 february 1 16 36 january 3 18 34 march

4 19 33 april 5 20 32 may

k1 representation

slice 1 slice 2

col2 id col2 id

32 5 33 4

34 3 36 1

35 2

k2 representation

slice 1 slice 2 slice 3 slice 4

col3 col2 id col3 col2 id col3 col2 id col3 col2 id

april 19 4 february 17 2 january 16 1 may 20 5

march 18 3

Replica

To ensure fault tolerance and availability ClustrixDB contains multiple copies of data. ClustrixDB uses the following rules to
place replicas (copies of slices) within the cluster:

Each logical slice is implemented by two or more physical replicas. The default protection factor is configurable at a per-
representation level. 
Replica placement is based on balance for size, reads, and writes.
No two replicas for the same slice can exist on the same node.
ClustrixDB can make new replicas online, without suspending or blocking writes to the slice.

Sample data distribution within a 4 node cluster

node 1 node 2 node 3 node 4

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 4/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

k2 slice 1 replica A k2 slice 3 replica A k2 slice 2 replica A k2 slice 4 replica A

col3 col2 id col3 col2 id col3 col2 id col3 col2 id

april 19 4 january 16 1 february 17 2 may 20 5

march 18 3
k2 slice 2 replica B k2 slice 4 replica B k2 slice 3 replica B

col3 col2 id k2 slice 1 replica B col3 col2 id col3 col2 id

february 17 2 col3 col2 id may 20 5 january 16 1

april 19 4 march 18 3
base rep slice 3 replica A base rep slice 2 replica A

id col1 col2 col3 base rep slice 1 replica A id col1 col2 col3 base rep slice 1 replica B

3 18 34 march id col1 col2 col3 1 16 36 january id col1 col2 col3

2 17 35 february 5 20 32 may 2 17 35 february


base rep slice 2 replica B

4 19 33 april 4 19 33 april
id col1 col2 col3 base rep slice 3 replica B

1 16 36 january id col1 col2 col3

5 20 32 may 3 18 34 march

Consistent Hashing
ClustrixDB uses consistent hashing for data distribution. Consistent hashing allows ClustrixDB to dynamically redistribute data
without having to rehash the entire data set.

Slicing
ClustrixDB hashes each distribution key to a 64-bit number space. We then divide the space into ranges. Each range is then owned
by a specific slice. The table below illustrates how consistent hashing assigns specific keys to specific slices. 

Slice Hash Range Key Values

1 min-100 H, Z, J

2 101-200 A, F

3 201-max X, K, R

ClustrixDB then assigns slices to available nodes in the Cluster for data capacity and data access balance.

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 5/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

Re­Slicing For Growth
As the data set grows, ClustrixDB will automatically and incrementally re-slice the dataset one or more slices at a time. We currently
base our re-slicing thresholds on data set size. If a slice exceeds a maximum size, the system will automatically break it up into two
or more smaller slices. 

For example, imagine that one of our slices grew beyond the preset threshold:

Slice Hash Range Key Values Size

1 min-100 H, Z, J 768MB

2 101-200 A, F, U, O, S 1354MB (too large)

3 201-max X, K, R, Y 800MB

Our rebalancer process will automatically detect the above condition and schedule a slice-split operation. The system will break up
the hash range into two new slices:

Slice Hash Range Key Values Size

1 min-100 H, Z, J 768MB

2 101-200 A, F, U, O, S 1354MB (too large)

4 101-150 A, F 670MB

5 151-200 U, O, S 684MB

3 201-max X, K, R, Y 800MB

Note that system does not have to modify slices 1 and 3. Our technique allows for very large data reorganizations to proceed in
small chunks.

Single Key vs. Independent Index Distribution
It's easy to see why table-level distribution provides very limited scalability. Imagine a schema dominated by one or two very large
tables (billions of rows). Adding nodes to the system does not help in such cases since a single node must be able to accommodate
the entire table.  
Why does ClustrixDB use independent index distribution rather than a single-key approach? The answer is two-fold:
1. Independent index distribution allows for a much broader range of distributed query plans that scale with cluster node count.
2. Independent index distribution requires strict support within the system to guarantee that indexes stay consistent with each
other and the main table. Many systems do not provide the strict guarantees required to support index consistency.

Let's examine a specific use case to compare and contrast the two approaches. Imagine a bulletin board application where different
topics are grouped by threads, and users are able to post into different topics. Our bulletin board service has become popular, and
we now have billions of thread posts, hundreds of thousands of threads, and millions of users.

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 6/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

Let's also assume that the primary workload for our bulletin board consists of the following two access patterns:

1. Retrieve all posts for a particular thread in post id order.


2. For a specific user, retrieve the last 10 posts by that user.

We could imagine a single large table which contains all of the posts in our application with the following simplified schema:

‐‐ Example schema for the posts table.
 
create table thread_posts (
    post_id     bigint,
    thread_id   bigint,
    user_id     bigint,
    posted_on   timestamp,
    contents    text,
    primary key (thread_id, post_id),
    key (user_id, posted_on)
);
 
‐‐ Use case 1: Retrieve all posts for a particular thread in post id order.
‐‐ desired access path: primary key (thread_id, post_id)
 
select * 
 from thread_posts 
where thread_id = 314
order by post_id
;
 
‐‐ Use case 2: For a specific user, retrieve the last 10 posts by that user.
‐‐ desired access path: key (user_id, posted_on)
 
select *
  from thread_posts
 where user_id = 546
 order by posted_on desc
 limit 10
;

Single Key Approach
With the single key approach, we are faced with a dilemma: which key do we chose to distribute the posts table? As you can see with
the table blow, we cannot chose a single key which will result in good scalability across both use cases. 

Distribution Use case 1: posts in a thread Use case 2: top 10 posts by user
Key

thread_id Queries which include the thread_id will perform Queries which do not include the thread_id, like the
well. Requests for a specific thread get routed query for last 10 posts by a specific user, must evaluate
to a single node within the cluster. When the on all nodes which contain the thread_posts table. In
number of threads and posts increases, we simply other words, the system must broadcast the query
add more nodes to the cluster to add capacity. request because the relevant post can reside on any
node.

user_id Queries which do not include the user_id result in Queries which include a user_id get routed to a single
a broadcast. As with use case 2 w/ thread_id key, node. Each node will contain an ordered set of posts for

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 7/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

we lose system scalability when we have to a user. The system can scale by avoiding broadcasts.
broadcast.

One possibility with such a system could be to maintain a separate table which includes a user_id and a posted_on columns. We can
then have the application manually maintain this index table.

However, that means that the application must now issue multiple writes, and accept responsibility for data consistency between
the two tables. And imagine if we need to add more indexes? The approach simply doesn't scale. One of the advantages of a
database is automatic index management. 

Independent Index Key Approach
ClustrixDB will automatically create independent distributions which satisfy both use cases. The DBA can specify to distribute the
base representation (primary key) by thread_id, and the secondary key by user_id. The system will automatically manage both the
table and secondary indexes with full ACID guarantees. 

For more detailed explanation consult our Evaluation Model section. 

Cache Efficiency
Unlike other systems which use master-slave pairs for data fault tolerance, ClustrixDB distributes the data in a more fine grained
manner as explained in the above sections. Our approach allows ClustrixDB to increase cache efficiency by not sending reads to
secondary replicas.

Consider the following example. Assume a cluster of 2 nodes and 2 slices A and B, with secondary copies A' and B'. 

Read from both copies Read from primary copy only

Node 1 Node 2 Node 1 Node 2

A B A B

B' A' B' A'

If we allow reads from both primary and secondary By limiting the reads to primary replica only, we make node 1
replicas, then each node will have to cache contents of responsible for A only, and node 2 responsible for B only.
both A and B. Assuming 32GB of cache per node, the total Assuming 32GB cache per node, the total effective cache footprint
effective cache of the system becomes 32GB. becomes 64GB, or double of the opposing model.

Distribution Key Imbalances
With some datasets, it's possible for the distribution of keys within a non-unique secondary index to cause a cluster imbalance. In
such cases ClustrixDB automatically detects the distribution imbalance and schedules an online redistribution action. 

Example of secondary key imbalance

Node 1 Node 2 Node 3

slice 1 slice 2 slice 3

col1 id col1 id col1 id

Jan 1 Mar 4 Sep 10

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 8/9
7/7/2016 Data Distribution ­ Clustrix Documentation ­

Jan 5 Apr 2 Dec 11

Jan 3 Nov 43

Jan 8

Jan 6

Feb 9

In the above example, we see that a single value (Jan) represents 50% of all the values in the index. By default ClustrixDB started
with the indexed column as the distribution key for representation. However, since the dataset skews heavily toward one value, slice
1 gets a disproportionate number of entries. 

The rebalancer process will notice that the difference between the minimum slice size per hash range is 10MB while the maximum
size is 30MB. Such a condition indicates a data imbalance within the dataset, which leads to a lumpy distribution of data within the
cluster.

slice hash range slice size slice size / hash range

1 min-100 100MB 10MB

2 101-200 300MB 30MB

3 200-max 150MB 15MB

In order to fix the above condition, the rebalancer will schedule a redistribute operation for the representation. It will begin adding
keys from the primary key to the representation distribution key. Since the primary key must be unique by definition, we know that a
balanced distribution is possible for the dataset provided we consume enough of the primary key.

For our example, the rebalancer will add the id column to the distribution key. Now that we hash over (col1, id), we get a much
better distribution of data across our slices.

Secondary key imbalance fixed

Node 1 Node 2 Node 3

slice 1 slice 2 slice 3

col1 id col1 id col1 id

Jan 1 Mar 4 Sep 10

Jan 5 Apr 2 Dec 11

Jan 6 Jan 3 Jan 8

Feb 9 Nov 43

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 9/9

You might also like