0% found this document useful (0 votes)
26 views51 pages

CC - Lecture 8-Final

The document discusses data-intensive technologies for cloud computing. It covers trends in massive data and cheap hardware. It also discusses distributed file systems, data stores, MapReduce frameworks, and running science applications in the cloud.

Uploaded by

Asif Mahmood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views51 pages

CC - Lecture 8-Final

The document discusses data-intensive technologies for cloud computing. It covers trends in massive data and cheap hardware. It also discusses distributed file systems, data stores, MapReduce frameworks, and running science applications in the cloud.

Uploaded by

Asif Mahmood
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Lecture 8

DATA-INTENSIVE TECHNOLOGIES
FOR CLOUD COMPUTING

DATE 26/10/2018

1
TRENDS
 Massive data
 Thousands to millions of cores
 Consolidated data centers
 Shift from clock rate battle to multicore to many core…

 Cheap hardware
 Failures are the norm

 VM based systems

 Making accessible (Easy to use)


 More people requiring large scale data processing
 Shift from academia to industry..
2
MOVING TOWARDS..
 Distributed File Systems
 HDFS, etc..
 Distributed Key-Value stores
 Data intensive parallel application frameworks
 MapReduce
 High level languages
 Science in the clouds

3
DISTRIBUTED DATA STORAGE
4
CLOUD DATA STORES (NO-SQL)
 Schema-less:
 Shared nothing architecture

 Elasticity

 Sharding

 Asynchronous replication

 BASE instead of ACID

http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
 Schema-less:
 “Tables” don’t have a pre-defined schema. Records have a
variable number of fields that can vary from record to record.
Record contents and semantics are enforced by applications.
 Shared nothing architecture
 Instead of using a common storage pool (e.g., SAN), each server uses only
its own local storage. This allows storage to be accessed at local disk speeds
instead of network speeds, and it allows capacity to be increased by adding
more nodes. Cost is also reduced since commodity hardware can be used.
 Elasticity
 Both storage and server capacity can be added on-the-fly by merely adding
more servers. No downtime is required. When a new node is added, the
database begins giving it something to do and requests to fulfill.
6

http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
 Sharding
 Instead of viewing the storage as a monolithic space, records are partitioned into
shards. Usually, a shard is small enough to be managed by a single server, though
shards are usually replicated. Sharding can be automatic (e.g., an existing shard
splits when it gets too big), or applications can assist in data sharding by
assigning each record a partition ID.
 Asynchronous replication
 Compared to RAID storage (mirroring and/or striping) or synchronous
replication, NoSQL databases employ asynchronous replication. This allows
writes to complete more quickly since they don’t depend on extra network traffic.
One side effect of this strategy is that data is not immediately replicated and
could be lost in certain windows. Also, locking is usually not available to protect
all copies of a specific unit of data.
 BASE instead of ACID
 NoSQL databases emphasize performance and availability. This requires
prioritizing the components of the CAP theorem (described elsewhere) that tends
7
to make true ACID transactions implausible
http://nosqlpedia.com/wiki/Survey_distributed_databases
ACID VS BASE

ACID BASE
‹ Strong consistency ‹ Weak consistency
‹ Isolation – stale data OK
‹ Focus on “commit” ‹ Availability first
‹ Nested transactions ‹ Best effort
‹ Availability? ‹ Approximate answers OK
‹ Conservative ‹ Aggressive (optimistic)
(pessimistic) ‹ Simpler!
‹ Difficult evolution ‹ Faster
(e.g. schema) ‹ Easier evolution

ACID (Atomicity, Consistency, Isolation, and Durability) 8

BASE (Basically Available, Soft state, Eventual consistency)


GOOGLE BIGTABLE
 Data Model
 A sparse, distributed, persistent multidimensional sorted map

 Indexed by a row key, column key, and a timestamp


 A table contains column families
 Column keys grouped in to column families
 Row ranges are stored as tablets (Sharding)
 Supports single row transactions
 Use Chubby distributed lock service to manage masters and tablet locks
 Based on GFS
 Supports running scripts and map reduce

Fay Chang, et. al. “Bigtable: A Distributed Storage System for Structured Data”.
AMAZON DYNAMO
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability

Vector clocks with # of versions is decoupled


High Availability for writes
reconciliation during reads from update rates.

Provides high availability and


Sloppy Quorum and hinted durability guarantee when
Handling temporary failures
handoff some of the replicas are not
available.

Recovering from permanent Synchronizes divergent


Using Merkle trees
failures replicas in the background.

Preserves symmetry and


avoids having a centralized
Membership and failure Gossip-based membership
registry for storing
detection protocol and failure detection.
membership and node liveness
information. 10

DeCandia, G., et al. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on
Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007). SOSP '07. ACM, 205-220. (pdf)
NO-SQL DATA STORES

11

http://nosqlpedia.com/wiki/Survey_distributed_databases
GOOGLE DISK FARM
Early days…

…today

12
MOTIVATION
 Need for a scalable DFS
 Large distributed data-intensive applications

 High data processing needs

 Performance, Reliability, Scalability and Availability

 More than traditional DFS

13
ASSUMPTIONS –
ENVIRONMENT
 Commodity Hardware
 inexpensive

 Component Failure
 the norm rather than the exception

 TBs of Space
 must support TBs of space

14
DESIGN

 Design factors
 Failures are common (built from inexpensive commodity components)
 Files
 large (multi-GB)
 mutation principally via appending new data
 low-overhead atomicity essential
 Co-design applications and file system API
 Sustained bandwidth more critical than low latency

 File structure chunk


 Divided into 64 MB chunks
 Chunk identified by 64-bit handle
 Chunks replicated (default 3 file
replicas)
 Chunks divided into 64KB blocks … blocks
 Each block has a 32-bit checksum

15
ARCHITECTURE

metadata

data

 Master
 Manages namespace/metadata
 Manages chunk creation, replication, placement
 Performs snapshot operation to create duplicate of file or directory
tree
 Performs checkpointing and logging of changes to metadata

 Chunkservers
 Stores chunk data and checksum for each block
 On startup/failure recovery, reports chunks to master
 Periodically reports sub-set of chunks to master (to detect no longer
needed chunks) 16
ARCHITECTURE

 Contact single master


 Obtain chunk locations
 Contact one of chunkservers
 Obtain data
17
MASTER
 Metadata
 Three types
 File & chunk namespaces
 Mapping from files to chunks
 Locations of chunks’ replicas
 Replicated on multiple remote machines
 Kept in memory

 Operations
 Replica placement
 New chunk and replica creation
 Load balancing
 Unused storage reclaim
IMPLEMENTATION –
CONSISTENCY MODEL
 Relaxed consistency model
 Two types of mutations
 Writes
 Cause data to be written at an application-specified file offset
 Record appends
 Operations that append data to a file
 Cause data to be appended atomically at least once
 Offset chosen by GFS, not by the client

 States of a file region after a mutation


 Consistent
 All clients see the same data, regardless which replicas they read from
 Defined
 consistent + all clients see what the mutation writes in its entirety
 Undefined
 consistent +but it may not reflect what any one mutation has written
 Inconsistent
 Clients see different data at different times

19
IMPLEMENTATION –
LEASES AND MUTATION ORDER
 Master uses leases to maintain a consistent mutation order among
replicas

 Primary is the chunkserver who is granted a chunk lease

 All others containing replicas are secondaries

 Primary defines a mutation order between mutations

 All secondaries follows this order

20
MUTATION OPERATIONS
 Primary replica
 Holds lease assigned by master (60 sec. default)
 Assigns serial order for all mutation operations
performed on replicas

 Write operation
 1-2: client obtains replica locations and identity of
primary replica
 3: client pushes data to replicas (stored in LRU
buffer by chunk servers holding replicas)
 4: client issues update request to primary
 5: primary forwards/performs write request
 6: primary receives replies from replica
 7: primary replies to client

 Record append operation


 Performed atomically (one byte sequence)
 At-least-once semantics
 Append location chosen by GFS and returned to client
 Extension to step 5:
 If record fits in current chunk: write record and tell replicas the offset
 If record exceeds chunk: pad the chunk, reply to client to use next chunk

21
CONSISTENCY GUARANTEES
primary primary primary

replica replica replica

defined consistent inconsistent


 Write
 Concurrent writes may be consistent but undefined
 Write operations that are large or cross chunk boundaries
are subdivided by client into individual writes
 Concurrent writes may become interleaved
 Record append
 Atomically, at-least-once semantics
 Client retries failed operation
 After successful retry, replicas are defined
in region of append but may have
intervening undefined regions
 Application safeguards
 Use record append rather than write
 Insert checksums in record headers to detect fragments
 Insert sequence numbers to detect duplicates

22
METADATA MANAGEMENT
pathname lock chunk list
Logical structure /home read Chunk4400488,…

/save Chunk8ffe07783,…

/home/user/foo write Chunk6254ee0,…


/home/user read Chunk88f703,…

 Namespace
 Logically a mapping from pathname to chunk list
 Allows concurrent file creation in same directory
 Read/write locks prevent conflicting operations
 File deletion by renaming to a hidden name; removed during regular scan
 Operation log
 Historical record of metadata changes
 Kept on multiple remote machines
 Checkpoint created when log exceeds threshold
 When checkpointing, switch to new log and create checkpoint in separate thread
 Recovery made from most recent checkpoint and subsequent log
 Snapshot
 Revokes leases on chunks in file/directory
 Log operation
 Duplicate metadata (not the chunks!) for the source
 On first client write to chunk:
 Required for client to gain access to chunk
 Reference count > 1 indicates a duplicated chunk
 Create a new chunk and update chunk list for duplicate

23
CHUNK/REPLICA MANAGEMENT
 Placement
 On chunkservers with below-average disk space utilization
 Limit number of “recent” creations on a chunkserver (since access traffic will follow)
 Spread replicas across racks (for reliability)

 Reclamation
 Chunk become garbage when file of which they are a part is deleted
 Lazy strategy (garbage college) is used since no attempt is made to reclaim chunks at time of deletion
 In periodic “HeartBeat” message chunkserver reports to the master a subset of its current chunks
 Master identifies which reported chunks are no longer accessible (i.e., are garbage)
 Chunkserver reclaims garbage chunks

 Stale replica detection


 Master assigns a version number to each chunk/replica
 Version number incremented each time a lease is granted
 Replicas on failed chunkservers will not have the current version number
 Stale replicas removed as part of garbage collection

24
PERFORMANCE

25
BENEFITS AND LIMITATIONS
 Simple design with single master
 Fault tolerance

 Custom designed

 Only viable in a specific environment

 Limited security

26
SECTOR

27
File System GFS/HDFS Lustre Sector
Architecture Cluster-based, Cluster based, Cluster based,
asymmetric, parallel Asymettric, Parallel Asymettric, Parallel

Communication RPC/TCP Network UDT


Independence
Naming Central metadata Central metadata Multiple Metadata
server server Masters
Synchronization Write-once-read-many, Hybrid locking General purpose I/O
locks on object leases mechanism using
leases, distributed lock
manager
Consistency and Server side replication, Server side meta data Server side
replication Async replication, replication, Client replication
checksum side caching,
checksum
Fault Tolerance Failure as norm Failure as exception Failure as norm
Security N/A Authentication, Security server,
Authorization based
Authentication, 28
Authorization
DATA INTENSIVE PARALLEL
PROCESSING FRAMEWORKS
29
MAPREDUCE
 General purpose massive data analysis in brittle
environments
 Commodity clusters
 Clouds

 Efficiency, Scalability, Redundancy, Load Balance, Fault


Tolerance
 Apache Hadoop
 HDFS

 Microsoft DryadLINQ

30
EXECUTION OVERVIEW

31

Source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
MAPREDUCE
1. The MapReduce library in the user program first shards the input files
into M pieces of typically 16 megabytes to 64 megabytes (MB) per
piece. It then starts up many copies of the program on a cluster of
machines.
2. One of the copies of the program is special: the master. The rest are
workers that are assigned work by the master. There are M map tasks
and R reduce tasks to assign. The master picks idle workers and assigns
each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input shard. It parses key/value pairs out of the input data
and passes each pair to the user-defined Map function. The intermediate
key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into
R regions by the partitioning function. The locations of these buffered
pairs on the local disk are passed back to the master, who is responsible32

for forwarding these locations to the reduce workers.


MAPREDUCE
5. When a reduce worker is notified by the master about these
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so that
all occurrences of the same key are grouped together. If the amount
of intermediate data is too large to fit in memory, an external sort is
used.
6. The reduce worker iterates over the sorted intermediate data and for
each unique intermediate key encountered, it passes the key and the
corresponding set of intermediate values to the user's Reduce
function. The output of the Reduce function is appended to a final
output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the
master wakes up the user program. At this point, the MapReduce 33
call in the user program returns back to the user code.
WORD COUNT
Input Mapping Shuffling Reducing
foo, 1 foo, 1
car, 1 foo, 1 foo, 3
bar, 1 foo, 1

foo car bar foo, 1


bar, 1
foo bar foo bar, 1 bar, 2
bar, 1
car car car foo, 1

car, 1 car, 1
car, 1 car, 1
car, 4
car, 1 car, 1
car,1

34
WORD COUNT
Input Mapping Shuffling Sorting Reducing
foo, 1
car, 1
bar, 1 foo,1
car,1
bar, 1
foo car bar foo, 1 foo, 1 bar,<1,1> bar,2
foo bar foo bar, 1 bar, 1 car,<1,1,1,1> car,4
car car car foo, 1 foo, 1 foo,<1,1,1> foo,3
car, 1
car, 1
car, 1
car, 1
car, 1
car, 1

35
HADOOP & DRYADLINQ
Apache Hadoop Microsoft DryadLINQ
Master Node Data/Compute Nodes
Standard LINQ operations

Job M M M M
DryadLINQ operations

Tracker R R R R
DryadLINQ Compiler
HDFS
Directed
2 Data
Vertex :
Name 1 execution task
2 blocks Acyclic Graph
Node 3 3 4 Edge : (DAG) based
communication
path execution
flows
Dryad Execution Engine

 Apache Implementation of Google’s MapReduce


 Dryad process the DAG executing vertices on compute
 Hadoop Distributed File System (HDFS) manage data
clusters
 Map/Reduce tasks are scheduled based on data locality in  LINQ provides a query interface for structured data
HDFS (replicated data blocks)  Provide Hash, Range, and Round-Robin partition
patterns

Job creation; Resource management; Fault tolerance& re-execution of failed taskes/vertices 36

Judy Qiu Cloud Technologies and Their Applications Indiana University Bloomington March 26 2010
Programming Scheduling & Load
Feature Data Storage Communication
Model Balancing
Data locality,
Rack aware dynamic task
Hadoop MapReduce HDFS TCP scheduling through a
global queue,
natural load balancing
Windows Data locality/ Network
DAG based Shared Files/TCP
Shared topology based run time
Dryad execution pipes/ Shared memory
directories graph optimizations, Static
flows (Cosmos) FIFO
scheduling
Shared file
Iterative Content Distribution Data locality, based static
Twister system / Local
MapReduce Network/Direct TCP scheduling
disks
Dynamic scheduling
TCP through Azure
MapReduceRol Azure Blob
e4Azure MapReduce Storage Blob Storage/ (Direct through a global queue,
TCP) Good natural load
balancing

Low latency Available processing


Variety of Shared file
MPI communication capabilities/ User 37
topologies systems
channels controlled
Failure
Feature Monitoring Language Support
Handling
Java, Executables Linux cluster,
Re-execution Web based
are supported via Amazon Elastic
Hadoop of map and Monitoring UI,
reduce tasks API Hadoop Streaming, MapReduce, Future
PigLatin Grid
Monitoring
Re-execution C# + LINQ (through Windows HPCS
Dryad support for
of vertices DryadLINQ) cluster
execution graphs
Re-execution API to monitor Java,
of iterations the progress of Linux Cluster,
Twister Executable via Java FutureGrid
jobs wrappers
Window Azure
Re-execution API, Web based
MapReduce Compute, Windows
Roles4Azure of map and monitoring UI C# Azure Local
reduce tasks
Development Fabric
Minimal support
Program level C, C++, Fortran, Linux/Windows
MPI for task level
Check pointing Java, C# cluster
monitoring
38
INHOMOGENEOUS DATA
PERFORMANCE
Randomly Distributed Inhomogeneous Data
Mean: 400, Dataset Size: 10000
1900
1850
1800
1750
Time (s)

1700
1650
1600
1550
1500
0 50 100 150 200 250 300

Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM

Inhomogeneity of data does not have a significant effect when the sequence
39
lengths are randomly distributed
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32
nodes)
INHOMOGENEOUS DATA PERFORMANCE
Skewed Distributed Inhomogeneous data
Mean: 400, Dataset Size: 10000
6,000

5,000
Total Time (s)

4,000

3,000

2,000

1,000

0
0 50 100 150 200 250 300

Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM

This shows the natural load balancing of Hadoop MR dynamic task


assignment using a global pipe line in contrast to the DryadLinq static 40
assignment
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes)
MAPREDUCEROLES4AZURE

41
SEQUENCE ASSEMBLY PERFORMANCE

42
OTHER ABSTRACTIONS
 Other abstractions..
 All-pairs
 DAG
 Wavefront

43
APPLICATIONS
44
APPLICATION CATEGORIES
1. Synchronous
 Easiest to parallelize. Eg: SIMD
2. Asynchronous
 Evolve dynamically in time and different evolution
algorithms.
3. Loosely Synchronous
 Middle ground. Dynamically evolving members,
synchronized now and then. Eg: IterativeMapReduce
4. Pleasingly Parallel
5. Meta problems
45

GC Fox, et al. Parallel Computing Works. http://www.netlib.org/utk/lsi/pcwLSI/text/node25.html#props


APPLICATIONS 1
(1-
100)
2 3 4
(101- (201- (301-
200) 300) 400)
N

1 from Reduce 1
M1 M2 M3 …. M# hdfs://.../rowblock_1.out
BioInformatics
(1-100) M6

2 from from Reduce 2
M4 M5 …. hdfs://.../rowblock_2.out
 Sequence Alignment (101-200) M2 M9

3 from Reduce 3
 SmithWaterman-GOTOH All-pairs alignment (201-300)
M6
M5
M7 M8 …. hdfs://.../rowblock_3.out

Reduce 4
 Sequence Assembly 4 from
M9
from
M10 …. hdfs://.../rowblock_4.out
(301-400) M3 M8
 Cap3 . . . . …. .
. . . . …. .
 CloudBurst . . . . …. .
. . . . …. .
 Data mining N From
M#
M(N*
(N+1)/2)
Reduce N
hdfs://.../rowblock_N.out
 MDS, GTM & Interpolations

46
WORKFLOWS
 Represent and manage complex distributed scientific
computations
 Composition and representation
 Mapping to resources (data as well as compute)
 Execution and provenance capturing

 Type of workflows
 Sequence of tasks, DAGs, cyclic graphs, hierarchical
workflows (workflows of workflows)
 Data Flows vs Control flows
 Interactive workflows

47
LEAD – LINKED ENVIRONMENTS FOR
DYNAMIC DISCOVERY
 Based on WS-BPEL and
SOA infrastructure

48
PEGASUS AND DAGMAN
 Pegasus
 Resource, data discovery
 Mapping computation to resources
 Orchestrate data transfers
 Publish results
 Graph optimizations

 DAGMAN
 Submits tasks to execution resources
 Monitor the execution
 Retries in case of failure
 Maintain dependencies
49
CONCLUSION
 Scientific analysis is moving more and more towards Clouds
and related technologies
 Lot of cutting-edge technologies out in the industry which we
can use to facilitate data intensive computing.

 Motivation
 Developing easy-to-use efficient software frameworks to
facilitate data intensive computing

50
 Thank You !!!

51

You might also like