CC - Lecture 8-Final
CC - Lecture 8-Final
DATA-INTENSIVE TECHNOLOGIES
FOR CLOUD COMPUTING
DATE 26/10/2018
1
TRENDS
Massive data
Thousands to millions of cores
Consolidated data centers
Shift from clock rate battle to multicore to many core…
Cheap hardware
Failures are the norm
VM based systems
3
DISTRIBUTED DATA STORAGE
4
CLOUD DATA STORES (NO-SQL)
Schema-less:
Shared nothing architecture
Elasticity
Sharding
Asynchronous replication
http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
Schema-less:
“Tables” don’t have a pre-defined schema. Records have a
variable number of fields that can vary from record to record.
Record contents and semantics are enforced by applications.
Shared nothing architecture
Instead of using a common storage pool (e.g., SAN), each server uses only
its own local storage. This allows storage to be accessed at local disk speeds
instead of network speeds, and it allows capacity to be increased by adding
more nodes. Cost is also reduced since commodity hardware can be used.
Elasticity
Both storage and server capacity can be added on-the-fly by merely adding
more servers. No downtime is required. When a new node is added, the
database begins giving it something to do and requests to fulfill.
6
http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
Sharding
Instead of viewing the storage as a monolithic space, records are partitioned into
shards. Usually, a shard is small enough to be managed by a single server, though
shards are usually replicated. Sharding can be automatic (e.g., an existing shard
splits when it gets too big), or applications can assist in data sharding by
assigning each record a partition ID.
Asynchronous replication
Compared to RAID storage (mirroring and/or striping) or synchronous
replication, NoSQL databases employ asynchronous replication. This allows
writes to complete more quickly since they don’t depend on extra network traffic.
One side effect of this strategy is that data is not immediately replicated and
could be lost in certain windows. Also, locking is usually not available to protect
all copies of a specific unit of data.
BASE instead of ACID
NoSQL databases emphasize performance and availability. This requires
prioritizing the components of the CAP theorem (described elsewhere) that tends
7
to make true ACID transactions implausible
http://nosqlpedia.com/wiki/Survey_distributed_databases
ACID VS BASE
ACID BASE
Strong consistency Weak consistency
Isolation – stale data OK
Focus on “commit” Availability first
Nested transactions Best effort
Availability? Approximate answers OK
Conservative Aggressive (optimistic)
(pessimistic) Simpler!
Difficult evolution Faster
(e.g. schema) Easier evolution
Fay Chang, et. al. “Bigtable: A Distributed Storage System for Structured Data”.
AMAZON DYNAMO
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
DeCandia, G., et al. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on
Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007). SOSP '07. ACM, 205-220. (pdf)
NO-SQL DATA STORES
11
http://nosqlpedia.com/wiki/Survey_distributed_databases
GOOGLE DISK FARM
Early days…
…today
12
MOTIVATION
Need for a scalable DFS
Large distributed data-intensive applications
13
ASSUMPTIONS –
ENVIRONMENT
Commodity Hardware
inexpensive
Component Failure
the norm rather than the exception
TBs of Space
must support TBs of space
14
DESIGN
Design factors
Failures are common (built from inexpensive commodity components)
Files
large (multi-GB)
mutation principally via appending new data
low-overhead atomicity essential
Co-design applications and file system API
Sustained bandwidth more critical than low latency
15
ARCHITECTURE
metadata
data
Master
Manages namespace/metadata
Manages chunk creation, replication, placement
Performs snapshot operation to create duplicate of file or directory
tree
Performs checkpointing and logging of changes to metadata
Chunkservers
Stores chunk data and checksum for each block
On startup/failure recovery, reports chunks to master
Periodically reports sub-set of chunks to master (to detect no longer
needed chunks) 16
ARCHITECTURE
Operations
Replica placement
New chunk and replica creation
Load balancing
Unused storage reclaim
IMPLEMENTATION –
CONSISTENCY MODEL
Relaxed consistency model
Two types of mutations
Writes
Cause data to be written at an application-specified file offset
Record appends
Operations that append data to a file
Cause data to be appended atomically at least once
Offset chosen by GFS, not by the client
19
IMPLEMENTATION –
LEASES AND MUTATION ORDER
Master uses leases to maintain a consistent mutation order among
replicas
20
MUTATION OPERATIONS
Primary replica
Holds lease assigned by master (60 sec. default)
Assigns serial order for all mutation operations
performed on replicas
Write operation
1-2: client obtains replica locations and identity of
primary replica
3: client pushes data to replicas (stored in LRU
buffer by chunk servers holding replicas)
4: client issues update request to primary
5: primary forwards/performs write request
6: primary receives replies from replica
7: primary replies to client
21
CONSISTENCY GUARANTEES
primary primary primary
22
METADATA MANAGEMENT
pathname lock chunk list
Logical structure /home read Chunk4400488,…
/save Chunk8ffe07783,…
Namespace
Logically a mapping from pathname to chunk list
Allows concurrent file creation in same directory
Read/write locks prevent conflicting operations
File deletion by renaming to a hidden name; removed during regular scan
Operation log
Historical record of metadata changes
Kept on multiple remote machines
Checkpoint created when log exceeds threshold
When checkpointing, switch to new log and create checkpoint in separate thread
Recovery made from most recent checkpoint and subsequent log
Snapshot
Revokes leases on chunks in file/directory
Log operation
Duplicate metadata (not the chunks!) for the source
On first client write to chunk:
Required for client to gain access to chunk
Reference count > 1 indicates a duplicated chunk
Create a new chunk and update chunk list for duplicate
23
CHUNK/REPLICA MANAGEMENT
Placement
On chunkservers with below-average disk space utilization
Limit number of “recent” creations on a chunkserver (since access traffic will follow)
Spread replicas across racks (for reliability)
Reclamation
Chunk become garbage when file of which they are a part is deleted
Lazy strategy (garbage college) is used since no attempt is made to reclaim chunks at time of deletion
In periodic “HeartBeat” message chunkserver reports to the master a subset of its current chunks
Master identifies which reported chunks are no longer accessible (i.e., are garbage)
Chunkserver reclaims garbage chunks
24
PERFORMANCE
25
BENEFITS AND LIMITATIONS
Simple design with single master
Fault tolerance
Custom designed
Limited security
26
SECTOR
27
File System GFS/HDFS Lustre Sector
Architecture Cluster-based, Cluster based, Cluster based,
asymmetric, parallel Asymettric, Parallel Asymettric, Parallel
Microsoft DryadLINQ
30
EXECUTION OVERVIEW
31
Source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
MAPREDUCE
1. The MapReduce library in the user program first shards the input files
into M pieces of typically 16 megabytes to 64 megabytes (MB) per
piece. It then starts up many copies of the program on a cluster of
machines.
2. One of the copies of the program is special: the master. The rest are
workers that are assigned work by the master. There are M map tasks
and R reduce tasks to assign. The master picks idle workers and assigns
each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input shard. It parses key/value pairs out of the input data
and passes each pair to the user-defined Map function. The intermediate
key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into
R regions by the partitioning function. The locations of these buffered
pairs on the local disk are passed back to the master, who is responsible32
car, 1 car, 1
car, 1 car, 1
car, 4
car, 1 car, 1
car,1
34
WORD COUNT
Input Mapping Shuffling Sorting Reducing
foo, 1
car, 1
bar, 1 foo,1
car,1
bar, 1
foo car bar foo, 1 foo, 1 bar,<1,1> bar,2
foo bar foo bar, 1 bar, 1 car,<1,1,1,1> car,4
car car car foo, 1 foo, 1 foo,<1,1,1> foo,3
car, 1
car, 1
car, 1
car, 1
car, 1
car, 1
35
HADOOP & DRYADLINQ
Apache Hadoop Microsoft DryadLINQ
Master Node Data/Compute Nodes
Standard LINQ operations
Job M M M M
DryadLINQ operations
Tracker R R R R
DryadLINQ Compiler
HDFS
Directed
2 Data
Vertex :
Name 1 execution task
2 blocks Acyclic Graph
Node 3 3 4 Edge : (DAG) based
communication
path execution
flows
Dryad Execution Engine
Judy Qiu Cloud Technologies and Their Applications Indiana University Bloomington March 26 2010
Programming Scheduling & Load
Feature Data Storage Communication
Model Balancing
Data locality,
Rack aware dynamic task
Hadoop MapReduce HDFS TCP scheduling through a
global queue,
natural load balancing
Windows Data locality/ Network
DAG based Shared Files/TCP
Shared topology based run time
Dryad execution pipes/ Shared memory
directories graph optimizations, Static
flows (Cosmos) FIFO
scheduling
Shared file
Iterative Content Distribution Data locality, based static
Twister system / Local
MapReduce Network/Direct TCP scheduling
disks
Dynamic scheduling
TCP through Azure
MapReduceRol Azure Blob
e4Azure MapReduce Storage Blob Storage/ (Direct through a global queue,
TCP) Good natural load
balancing
1700
1650
1600
1550
1500
0 50 100 150 200 250 300
Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM
Inhomogeneity of data does not have a significant effect when the sequence
39
lengths are randomly distributed
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32
nodes)
INHOMOGENEOUS DATA PERFORMANCE
Skewed Distributed Inhomogeneous data
Mean: 400, Dataset Size: 10000
6,000
5,000
Total Time (s)
4,000
3,000
2,000
1,000
0
0 50 100 150 200 250 300
Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM
41
SEQUENCE ASSEMBLY PERFORMANCE
42
OTHER ABSTRACTIONS
Other abstractions..
All-pairs
DAG
Wavefront
43
APPLICATIONS
44
APPLICATION CATEGORIES
1. Synchronous
Easiest to parallelize. Eg: SIMD
2. Asynchronous
Evolve dynamically in time and different evolution
algorithms.
3. Loosely Synchronous
Middle ground. Dynamically evolving members,
synchronized now and then. Eg: IterativeMapReduce
4. Pleasingly Parallel
5. Meta problems
45
1 from Reduce 1
M1 M2 M3 …. M# hdfs://.../rowblock_1.out
BioInformatics
(1-100) M6
2 from from Reduce 2
M4 M5 …. hdfs://.../rowblock_2.out
Sequence Alignment (101-200) M2 M9
3 from Reduce 3
SmithWaterman-GOTOH All-pairs alignment (201-300)
M6
M5
M7 M8 …. hdfs://.../rowblock_3.out
Reduce 4
Sequence Assembly 4 from
M9
from
M10 …. hdfs://.../rowblock_4.out
(301-400) M3 M8
Cap3 . . . . …. .
. . . . …. .
CloudBurst . . . . …. .
. . . . …. .
Data mining N From
M#
M(N*
(N+1)/2)
Reduce N
hdfs://.../rowblock_N.out
MDS, GTM & Interpolations
46
WORKFLOWS
Represent and manage complex distributed scientific
computations
Composition and representation
Mapping to resources (data as well as compute)
Execution and provenance capturing
Type of workflows
Sequence of tasks, DAGs, cyclic graphs, hierarchical
workflows (workflows of workflows)
Data Flows vs Control flows
Interactive workflows
47
LEAD – LINKED ENVIRONMENTS FOR
DYNAMIC DISCOVERY
Based on WS-BPEL and
SOA infrastructure
48
PEGASUS AND DAGMAN
Pegasus
Resource, data discovery
Mapping computation to resources
Orchestrate data transfers
Publish results
Graph optimizations
DAGMAN
Submits tasks to execution resources
Monitor the execution
Retries in case of failure
Maintain dependencies
49
CONCLUSION
Scientific analysis is moving more and more towards Clouds
and related technologies
Lot of cutting-edge technologies out in the industry which we
can use to facilitate data intensive computing.
Motivation
Developing easy-to-use efficient software frameworks to
facilitate data intensive computing
50
Thank You !!!
51