Parallel & Distributed Databases: C S 5 6 1 - S P R I N G 2 0 1 2 Wpi, Mohamed Eltabakh

The document discusses parallel and distributed databases. It describes how in distributed databases, data is stored across multiple machines running database management systems, requiring new approaches to distributed transactions, query processing, and other database functions. Key topics covered include different architectures for parallel databases, common parallel algorithms for database operations like scans, joins, and sorting, and factors that influence the performance of parallel query processing like data partitioning.

Uploaded by

Dibas Sil

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Parallel & Distributed Databases: C S 5 6 1 - S P R I N G 2 0 1 2 Wpi, Mohamed Eltabakh

Uploaded by

Dibas Sil

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

CS 5 6 1 - S PRI NG 2 0 1 2

WPI , MOHAME D E LTAB AKH

PARALLEL & DISTRIBUTED
DATABASES
1
INTRODUCTION
In centralized database:
Data is located in one place (one server)
All DBMS functionalities are done by that server
Enforcing ACID properties of transactions
Concurrency control, recovery mechanisms
Answering queries
In Distributed databases:
Data is stored in multiple places (each is running a DBMS)
New notion of distributed transactions
DBMS functionalities are now distributed over many machines
Revisit how these functionalities work in distributed environment
2
WHY DISTRIBUTED DATABASES
Data is too large
Applications are by nature distributed
Bank with many branches
Chain of retail stores with many locations
Library with many branches
Get benefit of distributed and parallel processing
Faster response time for queries
3
PARALLEL VS. DISTRIBUTED DATABASES
Distributed processing usually imply parallel processing
(not vise versa)
Can have parallel processing on a single machine

Assumptions about architecture
Parallel Databases
Machines are physically close to each other, e.g., same server room
Machines connects with dedicated high-speed LANs and switches
Communication cost is assumed to be small
Can shared-memory, shared-disk, or shared-nothing architecture
Distributed Databases
Machines can far from each other, e.g., in different continent
Can be connected using public-purpose network, e.g., Internet
Communication cost and problems cannot be ignored
Usually shared-nothing architecture
4
PARALLEL DATABASE
&
PARALLEL PROCESSING
5
WHY PARALLEL PROCESSING
6
1 Terabyte
10 MB/s
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
1.5 minute to scan.
B
a
n
d
w
i
d
t
h

Divide a big problem into many smaller ones to be solved in
parallel
Increase bandwidth (in our case decrease queries response
time)
DIFFERENT ARCHITECTURE
Three possible architectures for passing information
7
Shared-memory
Shared-disk
Shared-nothing
1- SHARED-MEMORY ARCHITECTURE
Every processor has its own disk
Single memory address-space for
all processors
Reading or writing to far memory can
be slightly more expensive
Every processor can have its own
local memory and cache as well
8
2- SHARED-DISK ARCHITECTURE
Every processor has its own
memory (not accessible by others)
All machines can access all disks
in the system
Number of disks does not
necessarily match the number of
processors
9
3- SHARED-NOTHING ARCHITECTURE
Most common architecture nowadays
Every machine has its own memory and
disk
Many cheap machines (commodity
hardware)
Communication is done through high-
speed network and switches
Usually machines can have a hierarchy
Machines on same rack
Then racks are connected through high-
speed switches
10
Scales better
Easier to build
Cheaper cost
TYPES OF PARALLELISM
Pipeline Parallelism (Inter-operator parallelism)
Ordered (or partially ordered) tasks and different machines
are performing different tasks
Partitioned Parallelism (Intra-operator parallelism)
A task divided over all machines to run in parallel
11
Partition
Sequential

Sequential

Pipeline
Sequential

Sequential

Sequential

Order between
them
IDEAL SCALABILITY SCENARIO
Speed-Up
More resources means
proportionally less time for
given amount of data.

Scale-Up
If resources increased in
proportion to increase in
data size, time is constant.
degree of ||-ism
X
a
c
t
/
s
e
c
.

(
t
h
r
o
u
g
h
p
u
t
)

Ideal
degree of ||-ism
s
e
c
.
/
X
a
c
t

(
r
e
s
p
o
n
s
e

t
i
m
e
)

Ideal
PARTITIONING OF DATA
13
A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z
A...E F...J
K...N O...S
T...Z
To partition a relation R over m machines
Range partitioning Hash-based partitioning
Round-robin partitioning
Shared-nothing architecture is sensitive to partitioning
Good partitioning depends on what operations are
common
PARALLEL ALGORITHMS FOR
DBMS OPERATIONS
14
PARALLEL SCAN
c
(R)
Relation R is partitioned over m machines
Each partition of R is around |R|/m tuples
Each machine scans its own partition and applies the selection
condition c
If data are partitioned using round robin or a hash function (over
the entire tuple)
The resulted relation is expected to be well distributed over all nodes
All partitioned will be scanned
If data are range partitioned or hash-based partitioned (on the
selection column)
The resulted relation can be clustered on few nodes
Few partitions need to be touched
15
Parallel Projection is also straightforward
All partitions will be touched
Not sensitive to how data is partitioned
PARALLEL DUPLICATE ELIMINATION
If relation is range or hash-based partitioned
Identical tuples are in the same partition
So, eliminate duplicates in each partition independently
If relation is round-robin partitioned
Re-partition the relation using a hash function
So every machine creates m partitions and send the i
th

partition to machine i
machine i can now perform the duplicate elimination
16
Same idea applies to Set Operations (Union, Intersect,
Except)
But apply the same partitioning to both relations R & S
PARALLEL JOIN R(X,Y) S(Y,Z)
Re-partition R and S on the join attribute Y (natural join) or (equi join)
Hash-based or range-based partitioning

Each machine i receives all i
th
partitions from all machines (from R
and S)
Each machine can locally join the partitions it has
Depending on the partitions sizes of R and S, local joins can be
hash-based or merge-join
17
Original Relations
(R then S)
OUTPUT
2
B main memory buffers Disk Disk
INPUT
1
hash
function
h
B-1
Partitions
1
2
B-1
. . .
PARALLEL SORTING
Range-based
Re-partition R based on ranges into m partitions
Machine i receives all i
th
partitions from all
machines and sort that partition
The entire R is now sorted
Skewed data is an issue
Apply sampling phase first
Ranges can be of different width
Merge-based
Each node sorts its own data
All nodes start sending their sorted data (one
block at a time) to a single machine
This machine applies merge-sort technique as
data come
18
COMPLEX PARALLEL QUERY PLANS
19
A B R S
Sites 1-4 Sites 5-8
Sites 1-8
All previous examples are intra-operator parallelism
Complex queries can have inter-operator parallelism
Different machines perform different tasks
PERFORMANCE OF PARALLEL
ALGORITHMS
In many cases, parallel algorithms reach their expected lower
bound (or close to)
If parallelism degree is m, then the parallel cost is 1/m of the sequential cost
Cost mostly refers to querys response time
Example
Parallel selection or projection is 1/m of the sequential cost
20
degree of ||-ism
X
a
c
t
/
s
e
c
.

(
t
h
r
o
u
g
h
p
u
t
)

Ideal
degree of ||-ism
s
e
c
.
/
X
a
c
t

(
r
e
s
p
o
n
s
e

t
i
m
e
)

Ideal
PERFORMANCE OF PARALLEL
ALGORITHMS (CONTD)
Total disk I/Os (sum over all machines) of parallel algorithms can
be larger than that of sequential counterpart
But we get the benefit of being done in parallel
Example
Merge-sort join (serial case) has I/O cost = 3(B(R) + B(S))
Merge-sort join (parallel case) has total (sum) I/O cost = 5(B(R) + B(S))
Considering the parallelism = 5(B(R) + B(S)) / m
21
Number of pages
of relations R and S
OPTIMIZING PARALLEL ALGORITHMS
Best serial plan != the best parallel one
Trivial counter-example:
Table partitioned with local secondary index at
two nodes
Range query: all data of node 1 and 1% of
node 2.
Node 1 should do a scan of its partition.
Node 2 should use secondary index.

22
N..Z
Table
Scan
A..M
Index
Scan
Different optimization algorithms for parallel plans (more
candidate plans)
Different machines may perform the same operation but using
different plans
SUMMARY OF PARALLEL DATABASES
Three possible architectures
Shared-memory
Shared-disk
Shared-nothing (the most common one)
Parallel algorithms
Intra-operator
Scans, projections, joins, sorting, set operators, etc.
Inter-operator
Distributing different operators in a complex query to different nodes
Partitioning and data layout is important and affect the
performance
Optimization of parallel algorithms is a challenge
23

Credit Card Management System
67% (3)
Credit Card Management System
3 pages
CS614 Finalterm Subjective Referencefile
No ratings yet
CS614 Finalterm Subjective Referencefile
27 pages
ParallelDBs PDF
No ratings yet
ParallelDBs PDF
23 pages
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
No ratings yet
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
27 pages
Elective-I Advanced Database Management Systems: Unit Ii
100% (1)
Elective-I Advanced Database Management Systems: Unit Ii
141 pages
M.C.a. (Sem - IV) Paper - IV - Adavanced Database Techniques
No ratings yet
M.C.a. (Sem - IV) Paper - IV - Adavanced Database Techniques
114 pages
ADBMS Parallel and Distributed Databases
No ratings yet
ADBMS Parallel and Distributed Databases
98 pages
Lecture 1 Parallel Databases
No ratings yet
Lecture 1 Parallel Databases
30 pages
TDD: Topics in Distributed Databases: Parallel Database Management Systems
No ratings yet
TDD: Topics in Distributed Databases: Parallel Database Management Systems
38 pages
SAYAN_GHOSH_26900123054_DISTRIBUTED_DATABASE_SYSTEM_CSE_6TH_SEM
No ratings yet
SAYAN_GHOSH_26900123054_DISTRIBUTED_DATABASE_SYSTEM_CSE_6TH_SEM
11 pages
Parallel DB /D.S.Jagli 1 5/4/2012 1 1. Parallel DB /D.S.Jagli
No ratings yet
Parallel DB /D.S.Jagli 1 5/4/2012 1 1. Parallel DB /D.S.Jagli
70 pages
Sayan Ghosh 26900123054 Distributed Database System Cse 6th Sem
No ratings yet
Sayan Ghosh 26900123054 Distributed Database System Cse 6th Sem
11 pages
Module1 ADBMS
No ratings yet
Module1 ADBMS
99 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
37 pages
Parallel Database
No ratings yet
Parallel Database
22 pages
adbms-unit4
No ratings yet
adbms-unit4
24 pages
Second Unit ADBMS
No ratings yet
Second Unit ADBMS
53 pages
2 Parallel Databases
No ratings yet
2 Parallel Databases
44 pages
Unit No.4 Parallel Database
No ratings yet
Unit No.4 Parallel Database
32 pages
Parallel Database System
No ratings yet
Parallel Database System
55 pages
Module III
No ratings yet
Module III
132 pages
9.CSI2004-ADBMS_Module2__part1
No ratings yet
9.CSI2004-ADBMS_Module2__part1
54 pages
8-Parallel Nhom5
No ratings yet
8-Parallel Nhom5
59 pages
ADTHEORY1
No ratings yet
ADTHEORY1
15 pages
Parallel Databases
No ratings yet
Parallel Databases
11 pages
Adv DBMS-Unit 2
No ratings yet
Adv DBMS-Unit 2
15 pages
Unit 5 Parallel and Distributed Databases
No ratings yet
Unit 5 Parallel and Distributed Databases
22 pages
Unit I
No ratings yet
Unit I
43 pages
17 DatabaseArchitectures
No ratings yet
17 DatabaseArchitectures
41 pages
Fundamentals of Database Systems: (Parallel and Distributed Databases)
No ratings yet
Fundamentals of Database Systems: (Parallel and Distributed Databases)
46 pages
Parallel Dbms
No ratings yet
Parallel Dbms
5 pages
Introduction To Parallel Databases
No ratings yet
Introduction To Parallel Databases
24 pages
Adbms
No ratings yet
Adbms
70 pages
2 Parallel Databases
No ratings yet
2 Parallel Databases
71 pages
Lesson2 Parallel Database
No ratings yet
Lesson2 Parallel Database
58 pages
parrel query processing
No ratings yet
parrel query processing
13 pages
Dbms
No ratings yet
Dbms
14 pages
I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
No ratings yet
I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
42 pages
Parallel Database Systems and Their Architecture
No ratings yet
Parallel Database Systems and Their Architecture
17 pages
Unit-7 - Parallel Database Systems
No ratings yet
Unit-7 - Parallel Database Systems
35 pages
Execution
No ratings yet
Execution
37 pages
CH 2
No ratings yet
CH 2
51 pages
26 Distributed Dbms Nosql
No ratings yet
26 Distributed Dbms Nosql
45 pages
Enterprise Systems: Distributed Databases and Systems - DT211 4
No ratings yet
Enterprise Systems: Distributed Databases and Systems - DT211 4
25 pages
LN 2
No ratings yet
LN 2
33 pages
Parallel Database
No ratings yet
Parallel Database
27 pages
Parallel Database Systems an Overview
No ratings yet
Parallel Database Systems an Overview
10 pages
Database Management Systems: Unit 4 - Parallel DBMS
No ratings yet
Database Management Systems: Unit 4 - Parallel DBMS
14 pages
Lecture 10: Parallel Query Evaluation: CS 838: Foundations of Data Management Spring 2016
No ratings yet
Lecture 10: Parallel Query Evaluation: CS 838: Foundations of Data Management Spring 2016
4 pages
Parallel Databases
No ratings yet
Parallel Databases
19 pages
Unit 2adtnotes
No ratings yet
Unit 2adtnotes
74 pages
Distributed Databases: Benefits and Issues To Be Considered
No ratings yet
Distributed Databases: Benefits and Issues To Be Considered
25 pages
I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
No ratings yet
I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
42 pages
Lec 1
No ratings yet
Lec 1
76 pages
DWHM 1
No ratings yet
DWHM 1
12 pages
Data Warehousing (2)
No ratings yet
Data Warehousing (2)
42 pages
Distributed DBM S
No ratings yet
Distributed DBM S
67 pages
Database Management Systems: Instructor: Murali Mani Mmani@cs - Wpi.edu
100% (1)
Database Management Systems: Instructor: Murali Mani Mmani@cs - Wpi.edu
22 pages
DBMS Unit-4
No ratings yet
DBMS Unit-4
66 pages
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Export Data From Matlab To PDF
No ratings yet
Export Data From Matlab To PDF
2 pages
Mifare Classic Hack
No ratings yet
Mifare Classic Hack
2 pages
PDW Optimization Technique
No ratings yet
PDW Optimization Technique
18 pages
Osa Multi Tasking
No ratings yet
Osa Multi Tasking
14 pages
NBD8004R-PL (EP) : Hangzhou Xiongmai Technology Co.,Ltd
No ratings yet
NBD8004R-PL (EP) : Hangzhou Xiongmai Technology Co.,Ltd
1 page
Parts of A Computer: Preparation
No ratings yet
Parts of A Computer: Preparation
21 pages
Unit 5 Automation
No ratings yet
Unit 5 Automation
39 pages
JMeter OAuth Sampler
No ratings yet
JMeter OAuth Sampler
3 pages
A Crash Course On x86 Disassembly
No ratings yet
A Crash Course On x86 Disassembly
23 pages
3-SDU Version 5.10 Release Notes
100% (2)
3-SDU Version 5.10 Release Notes
45 pages
The Forensic Investigation of Hike Messenger and Imo Application On Android Devices
No ratings yet
The Forensic Investigation of Hike Messenger and Imo Application On Android Devices
6 pages
Study of Uplink Interference in UMTS Network PDF
No ratings yet
Study of Uplink Interference in UMTS Network PDF
13 pages
F8650X e
No ratings yet
F8650X e
4 pages
C8189a HP Officejet Pro L7680
No ratings yet
C8189a HP Officejet Pro L7680
2 pages
Promot Electric Purchase Order: Bill To
No ratings yet
Promot Electric Purchase Order: Bill To
2 pages
Getting Started AccuNest
No ratings yet
Getting Started AccuNest
94 pages
Logitech C922 webcam_EN
No ratings yet
Logitech C922 webcam_EN
2 pages
VESA Adaptive Sync Whitepaper 140620
No ratings yet
VESA Adaptive Sync Whitepaper 140620
4 pages
Solutions Konica Minolta
100% (3)
Solutions Konica Minolta
355 pages
Computing Ds 8Gb DDR4 (A-Ver) Based UDIMMs (Rev.1.3) PDF
No ratings yet
Computing Ds 8Gb DDR4 (A-Ver) Based UDIMMs (Rev.1.3) PDF
73 pages
Introduction To MSP430 Microcontrollers
100% (1)
Introduction To MSP430 Microcontrollers
70 pages
Perrin Ifconfig Linux101
No ratings yet
Perrin Ifconfig Linux101
4 pages
STLD
No ratings yet
STLD
8 pages
Warranty Processing Recall BBP
100% (1)
Warranty Processing Recall BBP
19 pages
Symbian Os-Seminar Report
100% (1)
Symbian Os-Seminar Report
20 pages
CWS-3050 Multi-G: Multi-Technology (3G, 4G) Multi-Carrier Base Station With Integrated Flexible Backhaul
No ratings yet
CWS-3050 Multi-G: Multi-Technology (3G, 4G) Multi-Carrier Base Station With Integrated Flexible Backhaul
2 pages
Can Open Otb1c0dm9lp
No ratings yet
Can Open Otb1c0dm9lp
410 pages
Radius Manager Guide
No ratings yet
Radius Manager Guide
73 pages
Ionic Dev Nguyen Thong PDF
No ratings yet
Ionic Dev Nguyen Thong PDF
43 pages