BDS Session 2
BDS Session 2
Janardhanan PS
[email protected]
Context
2
Topics for today
3
Serial Computing
• Software written for serial computation:
✓ A problem is broken into a discrete series of instructions
✓ Instructions are executed sequentially one after another
✓ Executed on a single processor
✓ Only one instruction may execute at any moment in time
✓ Single data stores - memory and disk
Extra info:
4
Parallel Computing
5
Spectrum of Parallelism
Pseudo-parallel
(intra-processor) Parallel
Distributed
7
Multi-processor Vs Multi-computer systems
UMA NUMA
8
Multi-processor Vs Multi-computer systems
NUMA NUMA
UMA
9
Interconnection Networks
10
Classify based on Instruction and Data parallelism
• The term ‘stream’ refers to a sequence or flow of either instructions or data operated on by the
computer.
• In the complete cycle of instruction execution, a flow of instructions from main memory to the
CPU is established. This flow of instructions is called instruction stream.
• Similarly, there is a flow of operands between processor and memory bi-directionally. This flow
of operands is called data stream.
11
Flynn’s Taxonomy
Instruction Streams
Single Multiple
SISD MISD
Single
Uniprocessors Uncommon
Pipelining Fault tolerance
Data Streams
SIMD MIMD
Multiple
12
Some basic concepts ( esp. for programming in Big Data Systems )
» Coupling
» Tight - SIMD, MISD shared memory systems
» Loose - NOW, distributed systems, no shared memory
» Speedup
» how much faster can a program run when given N processors as opposed to 1 processor — T(1) / T(N)
» We will study Amdahl’s Law, Gustafson’s Law
» Parallelism of a program
» Compare time spent in computations to time spent for communication via shared memory or message passing
» Granularity
» Average number of compute instructions before communication is needed across processors
» Note:
» If coarse granularity, use distributed systems else use tightly coupled multi-processors/computers
» Potentially high parallelism doesn’t lead to high speedup if granularity is too small leading to high overheads
13
Comparing Parallel and Distributed Systems
Distributed System Parallel System
Independent, autonomous systems connected in a Computer system with several processing units
network accomplishing specific tasks attached to it
Coordination is possible between connected A common shared memory can be directly accessed
computers with own memory and CPU by every processing unit in a network
Loose coupling of computers connected in Tight coupling of processing resources that are used
network, providing access to data and remotely for solving single, complex problem
located resources
Programs have coarse grain parallelism Programs may demand fine grain parallelism
14
Motivation for parallel / distributed systems (1)
• Inherently distributed applications
• e.g. financial tx involving 2 or more parties
• Better scale in creating multiple smaller parallel tasks instead of a complex task
• e.g. evaluate an aggregate over 6 months data
• Processors getting cheaper and networks faster
• e.g. Processor speed 2x / 1.5 years, network traffic 2x/year, processors limited
by energy consumption
• Better scale using replication or partitioning of storage
• e.g. replicated media servers for faster access or shards in search engines replicated / partitioned storage
• Access to shared remote resources
• e.g. remote central DB
• Increased performance/cost ratio compared to special parallel systems
• e.g. search engine runs on a Network-of-Workstations
15
Motivation for parallel / distributed systems (2)
• Motivation
Better reliabilityfor parallel
because / distributed
less chance systems (2)
of multiple failures
• Be careful about Integrity : Consistent state of a resource
cluster nodes
across concurrent access
• Incremental scalability
• Add more nodes in a cluster to scale up
• e.g. Clusters in Cloud services, autoscaling in AWS
resize cluster
• Offload computing closer to user for scalability and better
resource usage
• Edge computing server
16
Example: Netflix
reference: https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex-stuff-that-happens-every-time-you-hit-play-3a40c9be254b
17
Distributed network of content caching servers
This would be a P2P network if you were using bit torrent for free
18
Examples
19
Techniques for High Volume Data Processing
Method Description Usage
Cluster computing A collection of computers, homogenous Commonly used in Big Data
or heterogenous, using commodity Systems, such as Hadoop
components running open source or
proprietary software, communicating via
message passing
Massively Parallel Processing Typically proprietary Distributed Shared May be used in traditional Data
(MPP) Memory machines with integrated Warehouses, Data processing
storage appliances, e.g. EMC Greenplum
(postgreSQL on an MPP)
High-Performance Computing Known to offer high performance and Used to develop specialty and custom
(HPC) scalability by using in-memory scientific applications for research
computing where results is more valuable than
cost
20
Topics for today
21
Limits of Parallelism
• A parallel program has some sequential / serial code and significant parallelized
code
22
Amdahl’s Law
23
Amdahl’s Law - Example
24
Limitations in speedup
25
Super-linear speedup
Partitioning a data parallel program to run across multiple processors may lead to a better cache hit.*
Possible in some workloads with not much of other overheads, esp in embarrassingly parallel
applications.
26
Why Amdahl’s Law is such bad news
27
Speedup plot
256
Speedup for 1, 4, 16, 64, and 256 Processors
T1 / TN = 1 / (f + (1-f)/N)
192
128
64
0
0.00% 6.50% 13.00% 19.50% 26.00%
28
But wait - may be we are missing something
29
Gustafson-Barsis Law
Let W be the execution workload of the program before adding resources
f is the sequential part of the workload
So W = f * W + (1-f) * W
Let W(N) be larger execution workload after adding N processors
So W(N) = f * W + N * (1-f) * W
Parallelizable work can increase N times
The theoretical speedup in latency of the whole task at a fixed interval time T
S(N) = T * W(N) / T * W
= W(N) / W = ( f * W + N * (1-f) * W) / W
Remember this when we
S(N) = f + (1-f) * N discuss programming in Session 5
S(N) is not limited by f as N scales
30
Topics for today
31
Memory access models
» Shared memory
» Multiple tasks on different processors access a common address
shared memory
space in UMA or NUMA architectures
» Conceptually easier for programmers
» Think of writing a voting algorithm - it is trivial because everyone is
in the same room, i.e. writing same variable P P P
» Distributed memory
» Multiple tasks – executing a single program – access data from
separate (and isolated) address spaces (i.e. separate virtual P/M P/M
memories)
» How will this remote access happen ? P/M
32
Shared Memory Model: Implications for Architecture
33
Distributed memory and message passing
• In a Distributed Memory model, data has to be moved
across Virtual Memories:
✓ i.e. a data item in VMem1 produced by task T1 has P1 P2
to be “communicated” to task T2 so that
✓ T2 can make a copy of the same in VMem2 and M1 M2
use it.
34
Computing model for message passing
35
Communication model for message passing
X
Process 2: Y
Process 1:
receive message
send local variable X
with id and store as
as message to P1
local variable Y
with id
36
Distributed Memory Model: Implications for Architecture
* One can create a shared memory view using message passing and vice versa
37
Message Passing Model – Separate Address Spaces
Data
• Use of separate address spaces complicates programming
• But this complication is usually restricted to one or two phases:
✓ Partitioning the input data Processor A Processor B
✓ Improves locality - computation closer to data Data item X Data item Y
✓ Each process is enabled to access data from within its address
space, which in turn is likely to be mapped to the memory Processor C
hierarchy of the processor in which the process is running Data item Z
✓ Merging / Collecting the output data
✓ This is required if each task is producing outputs that have to be
combined Processor A
Data item X’
Data item Y’
Data item Z’
Remember granularity ?
We will see example of Hadoop map-reduce where data is partitioned, outputs
communicated over messages and merged to get final answer.
38
Message Passing Primitives
» Important : A message can be received in the OS buffer but may not have been delivered
to application buffer. This is where a distributed message ordering logic can come in.
1 Send Sync Blocking Returns only after data is sent from kernel buffer. Easiest to
program but longest wait.
2 Send Async Blocking Returns after data is copied to kernel buffer but not sent.
Handle returned to check send status.
3 Send Sync Non-blocking Same as (2) but with no handle.
4 Send Async Non-blocking Returns immediately with handle. Complex to program but
minimum wait.
5 Receive Sync Blocking Returns after application gets data.
6 Receive Sync Non-blocking Returns immediately with data or handle to check status. More
efficient.
40
Topics for today
41
Data Access Strategies: Partition
• Strategy:
✓Partition data – typically, equally – to the nodes of the (distributed) system
• Cost:
✓ Network access and merge cost when query needs to go across partitions
• Advantage(s):
✓ Works well if task/algorithm is (mostly) data parallel
✓ Works well when there is Locality of Reference within a partition
• Concerns
✓ Merge across data fetched from multiple partitions
✓ Partition balancing
✓ Row vs Columnar layouts - what improves locality of reference ?
✓ Will study shards and partition in Hadoop and MongoDB
42
Data Access Strategies: Replication
• Strategy:
✓ Replicate all data across nodes of the (distributed) system
• Cost:
✓ Higher storage cost
• Advantage(s):
✓ All data accessed from local disk: no (runtime) communication on the network
✓ High performance with parallel access
✓ Fail over across replicas
• Concerns
✓ Keep replicas in sync — various consistency models between readers and writers
✓ Will study in depth for MongoDB
43
Data Access Strategies: (Dynamic) Communication
• Strategy:
✓ Communicate (at runtime) only the data that is required
• Cost:
✓ High network cost for loosely coupled systems and data set to be exchanged is large
• Advantage(s):
✓ Minimal communication cost when only a small portion of the data is actually
required by each node
• Concerns
✓ Highly available and performant network
✓ Fairly independent parallel data processing
44
Data Access Strategies – Networked Storage
45
Topics for today
46
Computer Cluster - Definition
47
Cluster - Objectives
• A computer cluster is typically built for one of the following two reasons:
✓ High Performance - referred to as compute-clusters
✓ High Availability - achieved via redundancy
An off-the-shelf or custom load balancer, reverse proxy can be configured to serve the use case
Hadoop nodes are a cluster for performance (independent Map/Reduce jobs are started on
multiple nodes) and availability (data is replicated on multiple nodes for fault tolerance)
Most Big Data systems run on a cluster configuration for performance and availability
48
Clusters – Peer to Peer computation
49
Client-Server vs. Peer-to-Peer
• Client-Server Computation
✓ A server node performs the core computation – business logic in case of
applications
✓ Client nodes request for such computation
✓ At the programming level this is referred to as the request-response model
✓ Email, network file servers, …
• Peer-to-Peer Computation:
✓ All nodes are peers i.e. they perform core computations and may act as client or
server for each other.
✓ bit torrent, some multi-player games, clusters
50
Cloud and Clusters
51
Motivation for using Clusters (1)
52
Motivation for using Clusters (2)
• Scale-out clusters with commodity workstations as nodes are suitable for software
environments that are resilient:
✓ i.e. individual nodes may fail, but
✓ middleware and software will enable computations to keep running (and keep services
available) for end users
✓for instance, back-ends of Google and Facebook use this model.
• On the other hand, (public) cloud infrastructure is typically built as clusters of servers
✓ due to higher reliability of individual servers – used as nodes – (compared to that of
workstations as nodes).
53
Typical cluster components
Parallel applications
cluster node
http://www.cloudbus.org/papers/SSI-CCWhitePaper.pdf
55
Example cluster: Hadoop
• A job divided into tasks
• Considers every task either as a Map or a Reduce
• Tasks assigned to a set of nodes (cluster)
• Special control nodes manage the nodes for resource
management, setup, monitoring, data transfer, failover etc.
• Hadoop clients work with these control nodes to get the job done
56
Summary
57
Next Session:
Big Data Analytics and Systems