0% found this document useful (0 votes)

38 views

Data Distribution - Clustrix Documentation

This document discusses data distribution strategies in ClustrixDB. It begins by contrasting shared disk and shared nothing architectures, noting that ClustrixDB uses a shared nothing approach. It then explains that ClustrixDB independently distributes each index of a table across multiple nodes using consistent hashing to partition indexes into logical slices, which are replicated across nodes for fault tolerance. Key concepts discussed include representations, slices, replicas, and how ClustrixDB handles data distribution at a fine-grained index-level rather than table- or single-key-level.

Uploaded by

CSK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

Data Distribution - Clustrix Documentation

Uploaded by

CSK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

7/7/2016 Data

Distribution Clustrix Documentation

Clustrix Documentation Clustrix v7.5 Home Distributed Database Architecture Data Distribution

Data Distribution
Introduction
Shared Disk vs. Shared Nothing
Shared Nothing Challenges
Shared Nothing Distribution Strategies
ClustrixDB Basics
Distribution Concepts Overview
Concepts Example
Representation
Slice
Replica
Consistent Hashing
Slicing
Re-Slicing For Growth
Single Key vs. Independent Index Distribution
Single Key Approach
Independent Index Key Approach
Cache Eﬃciency
Distribution Key Imbalances

Introduction

Shared Disk vs. Shared Nothing
Distributed database systems fall into two major categories of data storage architectures: (1) shared disk and (2) shared nothing.

Shared Disk Architecture Shared Nothing Architecture

Shared disk approaches suﬀer from several architectural limitations inherent in coordinating access to a single central resource. In
such systems, as the number of nodes in the cluster increases, so does the coordination overhead. While some workloads can scale
well with shared disk (e.g. small working sets dominated by heavy reads), most workloads tend to scale very poorly -- especially
workloads with signiﬁcant write load.

ClustrixDB uses the shared nothing approach because it's the only known approach that allows for large-scale distributed systems.

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 1/9
7/7/2016 Data Distribution Clustrix Documentation

Shared Nothing Challenges
In order to build a scalable shared nothing database system, one must solve two fundamental problems:

1. Split a large data set across a number of individual nodes.

2. Create an evaluation model which can take advantage of the distributed data environment.
This document explains how ClustrixDB distributes data sets across a large number of independent nodes, as well as provides
reasoning behind some of our architectural decisions.

Shared Nothing Distribution Strategies
Within shared nothing architectures, most databases fall into the following categories:

1. Table-level distribution. The most basic approach, where an entire table is assigned to a node. The database does not split
the table. Such systems cannot handle very large tables.
2. Single-key-per-table distribution (a.k.a index colocation, or single-key sharding). The most common approach. Preferred
method for most distributed databases (e.g. MySQL Cluster, MongoDB, etc.). In this approach, the table is split into multiple
chunks using a single key (user id, for example). All indexes associated with the chunk are maintained (co-located) with the
primary key.
3. Independent index distribution. The strategy used by ClustrixDB. In this approach, each index has its own distribution.
Required to support a broad range of distributed query evaluation plans.

ClustrixDB Basics
ClustrixDB has a ﬁne-grained approach to data distribution. The following table summarizes the basic concepts and terminology
used by our system. Notice that unlike many other systems, ClustrixDB uses a per-index distribution strategy.

Distribution Concepts Overview

ClustrixDB Distribution Concepts

Representation Each table contains one or more indexes. Internally, ClustrixDB refers to these indexes as representations of
the table. Each representation has its own distribution key (a.k.a. a partition key or a shard key), meaning that
ClustrixDB uses multiple independent keys to slice the data in one table. This is in contrast to most other
distributed database systems, which use a single key to slice the data in one table.

Each table must have a primary key. If the user does not deﬁne a primary key, ClustrixDB will automatically
create a hidden primary key. The base representation contains all of the columns within the table, ordered by
the primary key. Non-base representations contain a subset of the columns within the table.

Slice ClustrixDB breaks each representation into a collection of logical slices using consistent hashing.

By using consistent hashing, ClustrixDB can split individual slices without having to rehash the entire
representation.

Replica ClustrixDB maintains multiple copies of data for fault tolerance and availability. There are at least two
physical replicas of each logical slice, stored on separate nodes.

ClustrixDB supports conﬁguring the number of replicas per representation. For example, a user may require
three replicas for the base representation of a table, and only two replicas for the other representations of
that table.

Concepts Example
http://docs.clustrix.com/display/CLXDOC/Data+Distribution 2/9
7/7/2016 Data Distribution Clustrix Documentation

Consider the following example:

create table example (
    id      bigint  primary key,
    col1    integer,
    col2    integer,
    col3    varchar(64),
    key k1 (col2),
    key k2 (col3, col1),
);

We populate our table with the following data:

Table: example

id col1 col2 col3

1 16 36 january

2 17 35 february

3 18 34 march

4 19 33 april

5 20 32 may

Representation

ClustrixDB will organize the above schema into three representations. One for the main table (the base representation, organized by
the primary key), followed by two more representations, each organized by the index keys.

The yellow coloring in the diagrams below illustrates the ordering key for each representation. Note that the representations for the
secondary indexes include the primary key columns.

Table: example

base representation k1 representation k2 representation

primary key index (col2) index (col3, col1)

id col1 col2 col3 col2 id col3 col1 id

1 16 36 january 32 5 april 19 4

2 17 35 february 33 4 february 17 2

3 18 34 march 34 3 january 16 1

4 19 33 april 35 2 march 18 3

5 20 32 may 36 1 may 20 5

Slice

ClustrixDB will then split each representation into one or more logical slices. When slicing ClustrixDB uses the following rules:

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 3/9
7/7/2016 Data Distribution Clustrix Documentation

We apply a consistent hashing algorithm on the representation's key.

We distribute each representation independently. Refer to single key vs. independent distribution below for an in-
depth examination of the reasoning behind this design.
The number of slices can vary between representations of the same table.
We split slices based on size.
Users may conﬁgure the initial slice count of each representation. By default, each representation starts with one slice per
node.

base representation slices

slice 1 slice 2 slice 3

id col1 col2 col3 id col1 col2 col3 id col1 col2 col3

2 17 35 february 1 16 36 january 3 18 34 march

4 19 33 april 5 20 32 may

k1 representation

slice 1 slice 2

col2 id col2 id

32 5 33 4

34 3 36 1

35 2

k2 representation

slice 1 slice 2 slice 3 slice 4

col3 col2 id col3 col2 id col3 col2 id col3 col2 id

april 19 4 february 17 2 january 16 1 may 20 5

march 18 3

Replica

To ensure fault tolerance and availability ClustrixDB contains multiple copies of data. ClustrixDB uses the following rules to
place replicas (copies of slices) within the cluster:

Each logical slice is implemented by two or more physical replicas. The default protection factor is conﬁgurable at a per-
representation level.
Replica placement is based on balance for size, reads, and writes.
No two replicas for the same slice can exist on the same node.
ClustrixDB can make new replicas online, without suspending or blocking writes to the slice.

Sample data distribution within a 4 node cluster

node 1 node 2 node 3 node 4

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 4/9
7/7/2016 Data Distribution Clustrix Documentation

k2 slice 1 replica A k2 slice 3 replica A k2 slice 2 replica A k2 slice 4 replica A

col3 col2 id col3 col2 id col3 col2 id col3 col2 id

april 19 4 january 16 1 february 17 2 may 20 5

march 18 3
k2 slice 2 replica B k2 slice 4 replica B k2 slice 3 replica B

col3 col2 id k2 slice 1 replica B col3 col2 id col3 col2 id

february 17 2 col3 col2 id may 20 5 january 16 1

april 19 4 march 18 3
base rep slice 3 replica A base rep slice 2 replica A

id col1 col2 col3 base rep slice 1 replica A id col1 col2 col3 base rep slice 1 replica B

3 18 34 march id col1 col2 col3 1 16 36 january id col1 col2 col3

2 17 35 february 5 20 32 may 2 17 35 february

base rep slice 2 replica B

4 19 33 april 4 19 33 april
id col1 col2 col3 base rep slice 3 replica B

1 16 36 january id col1 col2 col3

5 20 32 may 3 18 34 march

Consistent Hashing
ClustrixDB uses consistent hashing for data distribution. Consistent hashing allows ClustrixDB to dynamically redistribute data
without having to rehash the entire data set.

Slicing
ClustrixDB hashes each distribution key to a 64-bit number space. We then divide the space into ranges. Each range is then owned
by a specific slice. The table below illustrates how consistent hashing assigns specific keys to specific slices.

Slice Hash Range Key Values

1 min-100 H, Z, J

2 101-200 A, F

3 201-max X, K, R

ClustrixDB then assigns slices to available nodes in the Cluster for data capacity and data access balance.

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 5/9
7/7/2016 Data Distribution Clustrix Documentation

ReSlicing For Growth
As the data set grows, ClustrixDB will automatically and incrementally re-slice the dataset one or more slices at a time. We currently
base our re-slicing thresholds on data set size. If a slice exceeds a maximum size, the system will automatically break it up into two
or more smaller slices.

For example, imagine that one of our slices grew beyond the preset threshold:

Slice Hash Range Key Values Size

1 min-100 H, Z, J 768MB

2 101-200 A, F, U, O, S 1354MB (too large)

3 201-max X, K, R, Y 800MB

Our rebalancer process will automatically detect the above condition and schedule a slice-split operation. The system will break up
the hash range into two new slices:

Slice Hash Range Key Values Size

1 min-100 H, Z, J 768MB

2 101-200 A, F, U, O, S 1354MB (too large)

4 101-150 A, F 670MB

5 151-200 U, O, S 684MB

3 201-max X, K, R, Y 800MB

Note that system does not have to modify slices 1 and 3. Our technique allows for very large data reorganizations to proceed in
small chunks.

Single Key vs. Independent Index Distribution
It's easy to see why table-level distribution provides very limited scalability. Imagine a schema dominated by one or two very large
tables (billions of rows). Adding nodes to the system does not help in such cases since a single node must be able to accommodate
the entire table.
Why does ClustrixDB use independent index distribution rather than a single-key approach? The answer is two-fold:
1. Independent index distribution allows for a much broader range of distributed query plans that scale with cluster node count.
2. Independent index distribution requires strict support within the system to guarantee that indexes stay consistent with each
other and the main table. Many systems do not provide the strict guarantees required to support index consistency.

Let's examine a specific use case to compare and contrast the two approaches. Imagine a bulletin board application where different
topics are grouped by threads, and users are able to post into different topics. Our bulletin board service has become popular, and
we now have billions of thread posts, hundreds of thousands of threads, and millions of users.

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 6/9
7/7/2016 Data Distribution Clustrix Documentation

Let's also assume that the primary workload for our bulletin board consists of the following two access patterns:

1. Retrieve all posts for a particular thread in post id order.

2. For a speciﬁc user, retrieve the last 10 posts by that user.

We could imagine a single large table which contains all of the posts in our application with the following simpliﬁed schema:

‐‐ Example schema for the posts table.

create table thread_posts (
    post_id     bigint,
    thread_id   bigint,
    user_id     bigint,
    posted_on   timestamp,
    contents    text,
    primary key (thread_id, post_id),
    key (user_id, posted_on)
);

‐‐ Use case 1: Retrieve all posts for a particular thread in post id order.
‐‐ desired access path: primary key (thread_id, post_id)

select *
from thread_posts
where thread_id = 314
order by post_id
;

‐‐ Use case 2: For a specific user, retrieve the last 10 posts by that user.
‐‐ desired access path: key (user_id, posted_on)

select *
  from thread_posts
where user_id = 546
order by posted_on desc
limit 10
;

Single Key Approach
With the single key approach, we are faced with a dilemma: which key do we chose to distribute the posts table? As you can see with
the table blow, we cannot chose a single key which will result in good scalability across both use cases.

Distribution Use case 1: posts in a thread Use case 2: top 10 posts by user
Key

thread_id Queries which include the thread_id will perform Queries which do not include the thread_id, like the
well. Requests for a speciﬁc thread get routed query for last 10 posts by a speciﬁc user, must evaluate
to a single node within the cluster. When the on all nodes which contain the thread_posts table. In
number of threads and posts increases, we simply other words, the system must broadcast the query
add more nodes to the cluster to add capacity. request because the relevant post can reside on any
node.

user_id Queries which do not include the user_id result in Queries which include a user_id get routed to a single
a broadcast. As with use case 2 w/ thread_id key, node. Each node will contain an ordered set of posts for

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 7/9
7/7/2016 Data Distribution Clustrix Documentation

we lose system scalability when we have to a user. The system can scale by avoiding broadcasts.
broadcast.

One possibility with such a system could be to maintain a separate table which includes a user_id and a posted_on columns. We can
then have the application manually maintain this index table.

However, that means that the application must now issue multiple writes, and accept responsibility for data consistency between
the two tables. And imagine if we need to add more indexes? The approach simply doesn't scale. One of the advantages of a
database is automatic index management.

Independent Index Key Approach
ClustrixDB will automatically create independent distributions which satisfy both use cases. The DBA can specify to distribute the
base representation (primary key) by thread_id, and the secondary key by user_id. The system will automatically manage both the
table and secondary indexes with full ACID guarantees.

For more detailed explanation consult our Evaluation Model section.

Cache Efficiency
Unlike other systems which use master-slave pairs for data fault tolerance, ClustrixDB distributes the data in a more ﬁne grained
manner as explained in the above sections. Our approach allows ClustrixDB to increase cache eﬃciency by not sending reads to
secondary replicas.

Consider the following example. Assume a cluster of 2 nodes and 2 slices A and B, with secondary copies A' and B'.

Read from both copies Read from primary copy only

Node 1 Node 2 Node 1 Node 2

A B A B

B' A' B' A'

If we allow reads from both primary and secondary By limiting the reads to primary replica only, we make node 1
replicas, then each node will have to cache contents of responsible for A only, and node 2 responsible for B only.
both A and B. Assuming 32GB of cache per node, the total Assuming 32GB cache per node, the total eﬀective cache footprint
eﬀective cache of the system becomes 32GB. becomes 64GB, or double of the opposing model.

Distribution Key Imbalances
With some datasets, it's possible for the distribution of keys within a non-unique secondary index to cause a cluster imbalance. In
such cases ClustrixDB automatically detects the distribution imbalance and schedules an online redistribution action.

Example of secondary key imbalance

Node 1 Node 2 Node 3

slice 1 slice 2 slice 3

col1 id col1 id col1 id

Jan 1 Mar 4 Sep 10

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 8/9
7/7/2016 Data Distribution Clustrix Documentation

Jan 5 Apr 2 Dec 11

Jan 3 Nov 43

Jan 8

Jan 6

Feb 9

In the above example, we see that a single value (Jan) represents 50% of all the values in the index. By default ClustrixDB started
with the indexed column as the distribution key for representation. However, since the dataset skews heavily toward one value, slice
1 gets a disproportionate number of entries.

The rebalancer process will notice that the diﬀerence between the minimum slice size per hash range is 10MB while the maximum
size is 30MB. Such a condition indicates a data imbalance within the dataset, which leads to a lumpy distribution of data within the
cluster.

slice hash range slice size slice size / hash range

1 min-100 100MB 10MB

2 101-200 300MB 30MB

3 200-max 150MB 15MB

In order to ﬁx the above condition, the rebalancer will schedule a redistribute operation for the representation. It will begin adding
keys from the primary key to the representation distribution key. Since the primary key must be unique by deﬁnition, we know that a
balanced distribution is possible for the dataset provided we consume enough of the primary key.

For our example, the rebalancer will add the id column to the distribution key. Now that we hash over (col1, id), we get a much
better distribution of data across our slices.

Secondary key imbalance ﬁxed

Node 1 Node 2 Node 3

slice 1 slice 2 slice 3

col1 id col1 id col1 id

Jan 1 Mar 4 Sep 10

Jan 5 Apr 2 Dec 11

Jan 6 Jan 3 Jan 8

Feb 9 Nov 43

http://docs.clustrix.com/display/CLXDOC/Data+Distribution 9/9

OCA Oracle Database SQL Exam Guide (Exam 1Z0-071) 1st Edition Steve O’Hearn download
100% (5)
OCA Oracle Database SQL Exam Guide (Exam 1Z0-071) 1st Edition Steve O’Hearn download
67 pages
Cyber Security Audit Sample Report v2.1
100% (4)
Cyber Security Audit Sample Report v2.1
9 pages
DoD Mandatory Controlled Unclassified Information (CUI) Training
No ratings yet
DoD Mandatory Controlled Unclassified Information (CUI) Training
1 page
Gmail - Air India E-Commerce - Fulfilment AIBE22340509 JB5C2
No ratings yet
Gmail - Air India E-Commerce - Fulfilment AIBE22340509 JB5C2
3 pages
Cyber Security Policy: Prakasam District Cooperative Central Bank Limited
No ratings yet
Cyber Security Policy: Prakasam District Cooperative Central Bank Limited
16 pages
DoD Mandatory Controlled Unclassified Information (CUI) Training
No ratings yet
DoD Mandatory Controlled Unclassified Information (CUI) Training
1 page
GDPR Compliance Audit Checklist
No ratings yet
GDPR Compliance Audit Checklist
18 pages
21 Distributed
No ratings yet
21 Distributed
6 pages
NoSQL - Unit2
No ratings yet
NoSQL - Unit2
8 pages
Consistency, Fault Tolerance, and Availability - Clustrix Documentation
No ratings yet
Consistency, Fault Tolerance, and Availability - Clustrix Documentation
4 pages
2: Data Model: Creating An E Cient Data Model For Highly-Loaded Applications
No ratings yet
2: Data Model: Creating An E Cient Data Model For Highly-Loaded Applications
83 pages
4 Key Value
No ratings yet
4 Key Value
30 pages
Lec21Notes Merged
No ratings yet
Lec21Notes Merged
20 pages
ClustrixDB - High Level Architectural Overview - Clustrix Documentation
No ratings yet
ClustrixDB - High Level Architectural Overview - Clustrix Documentation
5 pages
Distributed database system
No ratings yet
Distributed database system
5 pages
22-distributed
No ratings yet
22-distributed
6 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
Intro To Cassandra For Developers
No ratings yet
Intro To Cassandra For Developers
61 pages
module 2 nosql
No ratings yet
module 2 nosql
31 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
Mongo-Sharding and Replication
No ratings yet
Mongo-Sharding and Replication
8 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
8 pages
What Is A Distributed Database
No ratings yet
What Is A Distributed Database
8 pages
WP CloserLookatMySQLCluster 141011
No ratings yet
WP CloserLookatMySQLCluster 141011
7 pages
Basic Usage - Clustrix Documentation
No ratings yet
Basic Usage - Clustrix Documentation
3 pages
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
No ratings yet
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
25 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
No ratings yet
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
39 pages
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
No ratings yet
Where To Leave The Data ?: - Parallel Systems - Scalable Distributed Data Structures - Dynamic Hash Table (P2P)
39 pages
NO SQL
No ratings yet
NO SQL
14 pages
ECS781P-9-Cloud Data Management
No ratings yet
ECS781P-9-Cloud Data Management
79 pages
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
No ratings yet
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
13 pages
NOSQL M2-P1-P2 PPT
No ratings yet
NOSQL M2-P1-P2 PPT
75 pages
Modern Javascript v1
No ratings yet
Modern Javascript v1
55 pages
Unit5_Notes_Short_DB
No ratings yet
Unit5_Notes_Short_DB
6 pages
NoSql Module 2 Part 1
No ratings yet
NoSql Module 2 Part 1
13 pages
5 Partitioning
No ratings yet
5 Partitioning
23 pages
Distributed Data Store
No ratings yet
Distributed Data Store
11 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Unit 5 NOSQL
No ratings yet
Unit 5 NOSQL
102 pages
Partitioning in Distributed Systems
No ratings yet
Partitioning in Distributed Systems
34 pages
NoSql-Unit-2
No ratings yet
NoSql-Unit-2
72 pages
Database Sharding
No ratings yet
Database Sharding
5 pages
Weil-Sc06 1499966376 PDF
No ratings yet
Weil-Sc06 1499966376 PDF
12 pages
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
No ratings yet
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
12 pages
Introduction To Distributed Databases: Intro To Database Systems Andy Pavlo
No ratings yet
Introduction To Distributed Databases: Intro To Database Systems Andy Pavlo
37 pages
Shared-Disk vs. Shared-Nothing: Comparing Architectures For Clustered Databases
No ratings yet
Shared-Disk vs. Shared-Nothing: Comparing Architectures For Clustered Databases
18 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
Shared Disk vs. Shared Nothing
No ratings yet
Shared Disk vs. Shared Nothing
17 pages
S Harding
No ratings yet
S Harding
7 pages
PPT 2.2.1
No ratings yet
PPT 2.2.1
26 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
System Design
No ratings yet
System Design
32 pages
Class 7 - Scaling, Sharding, Consistent Hashing
No ratings yet
Class 7 - Scaling, Sharding, Consistent Hashing
4 pages
Shard Hash Pattern
No ratings yet
Shard Hash Pattern
9 pages
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
31 pages
Cachine 1682600243
No ratings yet
Cachine 1682600243
18 pages
NoSQL Database
No ratings yet
NoSQL Database
8 pages
UNIT 4 CAP MONGODB
No ratings yet
UNIT 4 CAP MONGODB
23 pages
House Dzone Refcard 334 Getting Started Distribute
No ratings yet
House Dzone Refcard 334 Getting Started Distribute
5 pages
III-sharding-strategies
No ratings yet
III-sharding-strategies
30 pages
Distributed Database
No ratings yet
Distributed Database
12 pages
module 2
No ratings yet
module 2
36 pages
Intro to NoSQL
No ratings yet
Intro to NoSQL
18 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
Dd Mid Answers
No ratings yet
Dd Mid Answers
29 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Received With Thanks ' 60,569.62 Through Payment Gateway Over The Internet From
100% (1)
Received With Thanks ' 60,569.62 Through Payment Gateway Over The Internet From
1 page
Postgres Comprehensive Administration
100% (1)
Postgres Comprehensive Administration
7 pages
Disaster Recovery Policy v1.0
No ratings yet
Disaster Recovery Policy v1.0
4 pages
Thrive GDPR Audit Template
No ratings yet
Thrive GDPR Audit Template
2 pages
Security Engineer
No ratings yet
Security Engineer
2 pages
IT Access Request Form
No ratings yet
IT Access Request Form
4 pages
Getting Started With IDS: (Informix On Campus Lecture Series)
No ratings yet
Getting Started With IDS: (Informix On Campus Lecture Series)
37 pages
Chapter 11 - Solutions To The Review Questions
No ratings yet
Chapter 11 - Solutions To The Review Questions
21 pages
Form 16
No ratings yet
Form 16
6 pages
Certified Project Director
No ratings yet
Certified Project Director
1 page
TPCH Postgres
No ratings yet
TPCH Postgres
2 pages
PAYG Fixed Term Agreement
No ratings yet
PAYG Fixed Term Agreement
11 pages
Thrive GDPR Audit Template
No ratings yet
Thrive GDPR Audit Template
2 pages
Certified Security Project Manager Course
100% (1)
Certified Security Project Manager Course
5 pages
Project Management in IT Security - EC Council Course
No ratings yet
Project Management in IT Security - EC Council Course
4 pages
IRIS Portal - Employee Appraisal Form
No ratings yet
IRIS Portal - Employee Appraisal Form
6 pages
Question Set IASME Governance Including CE Vbeacon
No ratings yet
Question Set IASME Governance Including CE Vbeacon
114 pages
Windows / Linux GUI-based Tool Comprises:: Object Browser Query Execution Client Monitoring Functionality
No ratings yet
Windows / Linux GUI-based Tool Comprises:: Object Browser Query Execution Client Monitoring Functionality
25 pages
SABSA Course Outline
No ratings yet
SABSA Course Outline
1 page
Entrep
No ratings yet
Entrep
36 pages
Quick Step Broken AMN Hack Wlan Hack Website Hack Admin Index
No ratings yet
Quick Step Broken AMN Hack Wlan Hack Website Hack Admin Index
67 pages
Total View
No ratings yet
Total View
13 pages
Setup 8
No ratings yet
Setup 8
16 pages
FDS Lesson Plan
No ratings yet
FDS Lesson Plan
8 pages
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
Detailed Lesson Plan in Math V I. Objectives
No ratings yet
Detailed Lesson Plan in Math V I. Objectives
4 pages
IIKS Report
No ratings yet
IIKS Report
6 pages
Coursera Courses List
No ratings yet
Coursera Courses List
4 pages
2 Structured Query Language
No ratings yet
2 Structured Query Language
28 pages
Cost Reduction & Cost Control
No ratings yet
Cost Reduction & Cost Control
31 pages
The Data Engineering Cookbook Mastering The Plumbing Of Data Science 3rd Edition Andreas Kretz - Download the ebook now for instant access to all chapters
100% (1)
The Data Engineering Cookbook Mastering The Plumbing Of Data Science 3rd Edition Andreas Kretz - Download the ebook now for instant access to all chapters
43 pages
What Is Descriptive Research
No ratings yet
What Is Descriptive Research
7 pages
C Reference Cheat Sheet: Integers
No ratings yet
C Reference Cheat Sheet: Integers
9 pages
Chapter 8
100% (1)
Chapter 8
29 pages
SQL Server 2016
No ratings yet
SQL Server 2016
27 pages
Primitive Data Types
No ratings yet
Primitive Data Types
12 pages
Research Proposal
No ratings yet
Research Proposal
9 pages
Chapter Two
No ratings yet
Chapter Two
16 pages
Chapter 3 - Big Data Overview
No ratings yet
Chapter 3 - Big Data Overview
17 pages
4-Integration of Data Mining With Database-20-12-2024
No ratings yet
4-Integration of Data Mining With Database-20-12-2024
11 pages
DR S.Aruna/Iii Bca-B/Ooad/Object Storage and Interoperability
No ratings yet
DR S.Aruna/Iii Bca-B/Ooad/Object Storage and Interoperability
22 pages
Mastering The DMA and IOMMU Apis: Embedded Linux Conference 2014 San Jose
No ratings yet
Mastering The DMA and IOMMU Apis: Embedded Linux Conference 2014 San Jose
102 pages
Classification of Computers
No ratings yet
Classification of Computers
4 pages
business-plan
No ratings yet
business-plan
27 pages
It U2 Notes
No ratings yet
It U2 Notes
86 pages
BFAR 9 Employees Undergo Workshop On E-SEAMS
No ratings yet
BFAR 9 Employees Undergo Workshop On E-SEAMS
2 pages
PowerBi Road Map
No ratings yet
PowerBi Road Map
11 pages
103-Huawei OceanStor Distributed Storage V1.2
No ratings yet
103-Huawei OceanStor Distributed Storage V1.2
42 pages

Data Distribution - Clustrix Documentation

Uploaded by

Data Distribution - Clustrix Documentation

Uploaded by

7/7/2016 Data

Shared Disk Architecture Shared Nothing Architecture

1. Split a large data set across a number of individual nodes.

ClustrixDB Distribution Concepts

Consider the following example:

We populate our table with the following data:

id col1 col2 col3

base representation k1 representation k2 representation

primary key index (col2) index (col3, col1)

id col1 col2 col3 col2 id col3 col1 id

We apply a consistent hashing algorithm on the representation's key.

base representation slices

slice 1 slice 2 slice 3

id col1 col2 col3 id col1 col2 col3 id col1 col2 col3

2 17 35 february 1 16 36 january 3 18 34 march

slice 1 slice 2 slice 3 slice 4

col3 col2 id col3 col2 id col3 col2 id col3 col2 id

april 19 4 february 17 2 january 16 1 may 20 5

Sample data distribution within a 4 node cluster

node 1 node 2 node 3 node 4

k2 slice 1 replica A k2 slice 3 replica A k2 slice 2 replica A k2 slice 4 replica A

col3 col2 id col3 col2 id col3 col2 id col3 col2 id

april 19 4 january 16 1 february 17 2 may 20 5

col3 col2 id k2 slice 1 replica B col3 col2 id col3 col2 id

february 17 2 col3 col2 id may 20 5 january 16 1

3 18 34 march id col1 col2 col3 1 16 36 january id col1 col2 col3

2 17 35 february 5 20 32 may 2 17 35 february

1 16 36 january id col1 col2 col3

Slice Hash Range Key Values

Slice Hash Range Key Values Size

2 101-200 A, F, U, O, S 1354MB (too large)

Slice Hash Range Key Values Size

2 101-200 A, F, U, O, S 1354MB (too large)

1. Retrieve all posts for a particular thread in post id order.

For more detailed explanation consult our Evaluation Model section.

Read from both copies Read from primary copy only

Node 1 Node 2 Node 1 Node 2

B' A' B' A'

Example of secondary key imbalance

Node 1 Node 2 Node 3

slice 1 slice 2 slice 3

col1 id col1 id col1 id

Jan 1 Mar 4 Sep 10

Jan 5 Apr 2 Dec 11

slice hash range slice size slice size / hash range

1 min-100 100MB 10MB

2 101-200 300MB 30MB

3 200-max 150MB 15MB

Secondary key imbalance ﬁxed

Node 1 Node 2 Node 3

slice 1 slice 2 slice 3

col1 id col1 id col1 id

Jan 1 Mar 4 Sep 10

Jan 5 Apr 2 Dec 11

Jan 6 Jan 3 Jan 8

You might also like