Ram, Pram, and Logp Models
Ram, Pram, and Logp Models
Why models?
What is a machine model?
An abstraction describes the operation of a machine. Allowing to associate a value (cost) to each machine operation.
Running time of an algorithm is the number of instructions executed. Memory requirement is the number of memory cells used in the algorithm.
The RAM model (with asymptotic analysis) often gives relatively realistic results.
Illustration of PRAM
Single program executed in MIMD mode
CLK
P1
P2
P3
Pp
Shared Memory
All processors can do things in a synchronous manner (with infinite shared Memory and infinite local memory),
Communication in PRAM
Concurrent Read (CR) all processors can simultaneously read from all memory locations.
Model A is computationally stronger than model B if and only if any algorithm written in B will run unchange in A.
EREW <= CREW <= CRCW (common) <= CRCW (random)
Which mo
Target Architectures
Hypercube SIMD model 2D-mesh SIMD model UMA multiprocessor model Hypercube Multicomputer
Amit Chhabra,Deptt. of Computer Sci. & Engg. 15
REDUCTION
Cost Optimal PRAM Algorithm for the Reduction Problem Cost optimal PRAM algorithm complexity: O(logn) (using n div 2 processors) Example for n=8 and p=4 processors
a0 a1 a2 a3 a4 a5 a6 a7
j=0
P0
P1
P2
P3
j=1
P0
P2
j=2
P0
Amit Chhabra,Deptt. of Computer Sci. & Engg. 16
P6
P0
P0 P2
P0 P5
P2 P7 P1 P3 P1
P1
P3
17
18
19
20
PRAM strengths
Natural extension of RAM It is simple and easy to understand
Communication and synchronization issues are hided.
PRAM weaknesses
Model inaccuracies
Unbounded local memory (register) All operations take unit time
Unaccounted costs
Non-local memory access Latency Bandwidth Memory access contention
Algorithm designer is misled to use IPC without hesitation. Synchronized processors No local memory
PRAM variations
Bounded memory PRAM, PRAM(m)
In a given step, only m memory accesses can be serviced.
BPRAM(BLOCK PRAM)
L units for the first message B units for subsequent messages
PRAM summary
The RAM model is widely used. PRAM is simple and easy to understand
This model never reaches beyond the algorithm community. It is getting more important as threaded programming becomes more popular.
BSP Model
BSP(Bulk Synchronous Parallel) Model
Proposed by Leslie Valiant of Harvard University Developed by W.F.McColl of Oxford University
Illustration of BSP
Node (w) Node Node
Three Parameters
w parameter
Maximum computation time within each superstep Computation operation takes at most w cycles.
g parameter
# of cycles for communication of unit message when all processors are involved in communication - network bandwidth (total number of local operations performed by all processors in one second) / (total number of words delivered by the communication network in one second) h relation coefficient Communication operation takes gh cycles.
l parameter
Barrier synchronization takes l cycles.
BSP Program
A BSP computation consists of S super steps. A superstep is a sequence of steps followed by a barrier synchronization. Superstep
Any remote memory accesses take effect at barrier - loosely synchronous
BSP Program
P1 Superstep 1 P2 P3 P4
Computation
Communication
Barrier
Superstep 2
Time complexity
Valiant's original:
McColl's revised:
max{w, gh, l}
N + gh + l = Ntot + Ncom g + sl
Algorithm design
Ntot = T / p, minimize Ncom and S.
Example
Inner product with 8 processors
Superstep 1
Computation: local sum w = 2N/8 Communication: 1 relation (procs. 0,2,4,6 -> procs. 1,3,5,7)
Superstep 2
Computation: one addition w = 1 Communication: 1 relation (procs. 1 5 -> procs. 3.7)
Superstep 3
Computation: one addition w = 1 Communication: 1 relation (proc. 3 -> proc. 7)
Example (Contd)
Superstep 4
Computation: one addition w = 1
Programming Example
All sums using the logarithmic technique
Calculate the partial sums of p integers stored on p processors log p supersteps 4 processors
10
LogP model
PRAM model: shared memory
P network
Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such organization.
+ limited capacity
no consensus on programming model => should not enforce one
LogP
P ( processors ) P o (overhead) L (latency) Interconnection Network Limited Volume ( L/ g to or from a proc) M P M o g (gap) P M
Latency in sending a (small) mesage between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/BW) Processors
time
LogP philosophy
Think about: mapping of a task onto P processors computation within a processor, its cost, and balance communication between processors, its cost, and balance given a charaterization of processor and network performance Do not think about what happens within the network
BSP differs from LogP in three ways: LogP uses a form of message passing based on pairwise synchronization. LogP adds an extra parameter representing the overhead involved in sending a message. Applies to every communication! LogP defines g in local terms. It regards the network as having a finite capacity and treats g as the minimal permissible gap between message sends from a single process. The parameter g in both cases is the reciprocal of the available per-processor network bandwidth: BSP takes a global view of g, LogP takes a local view of g.
When analyzing the performance of LogP model, it is often necessary (or convenient) to use barriers. Message overhead is present but decreasing Only overhead is from transferring the message from user space to a system buffer. LogP + barriers - overhead = BSP Both models can efficiently simulate the other.
BSP can be regarded as a generalization of the PRAM model. If the BSP architecture has a small value of g (g=1), then it can be regarded as PRAM. Use hashing to automatically achieve efficient memory management.
The value of l determines the degree of parallel slackness required to achieve optimal efficiency. If l = g = 1 corresponds to idealized PRAM where no slackness is required.
BSPlib
Supports a SPMD style of programming. Library is available in C and FORTRAN. Implementations available (several years ago) for: Cray T3E IBM SP2 SGI PowerChallenge Convex Exemplar Hitachi SR2001 Various Workstation Clusters
BSPlib
Initialization Functions bsp_init() Simulate dynamic processes bsp_begin() Start of SPMD code bsp_end() End of SPMD code Enquiry Functions bsp_pid() find my process id bsp_nprocs() number of processes bsp_time() local time Synchronization Functions bsp_sync() barrier synchronization DRMA Functions bsp_pushregister() make region globally visible bsp_popregister() remove global visibility bsp_put() push to remote memory bsp_get() pull from remote memory
BSPlib
BSMP Functions bsp_set_tag_size() choose tag size bsp_send() send to remote queue bsp_get_tag() match tag with message bsp_move() fetch from queue
High Performance Functions bsp_hpput() bsp_hpget() bsp_hpmove() These are unbuffered versions of communication primitives
BSPlib Examples
int nprocs; /* global variable */ void spmd_part( void ) { bsp_begin( nprocs ); printf( Hello BSP from %d of %d\n, bsp_pid(), bsp_nprocs()); } void main( void ) { bsp_init( spmd_part, argc, argv ); nprocs = ReadInteger(); spmd_part(); }
BSPlib Examples
void main( void ) { int ii; bsp_begin( bsp_nprocs()); for( ii=0; ii<bsp_nprocs(); ii++ ) { if( bsp_pid() == ii ) printf( Hello BSP from %d of %d\n, bsp_pid(), bsp_nprocs()); fflush( stdout ); bsp_sync(); } bsp_end(); }
BSPlib Examples
All sums version 1 ( lg( p ) supersteps ) int bsp_allsums1( int x ){ int ii, left, right; bsp_pushregister( &left, sizeof( int )); bsp_sync(); right = x; for( ii=1; ii<bsp_nprocs(); ii*=2 ){ if( bsp_pid()+I < bsp_nprocs()) bsp_put( bsp_pid()+I, &right, &left, 0, sizeof( int )); bsp_sync(); if( bsp_pid() >= I ) right = left + right; } bsp_popregister( &left ); return( right ); }
BSPlib Examples
All sums version 2 (one superstep) int bsp_allsums2( int x ) { int ii, result, *array = calloc( bsp_nprocs(), sizeof(int)); if( array == NULL ) bsp_abort( Unable to allocate %d element array, bsp_nprocs()); bsp_pushregister( array, bsp_nprocs()*sizeof( int)); bsp_sync(); for( ii=bsp_pid(); ii<bsp_nprocs(); ii++ ) bsp_put( ii, &x, array, bsp_pid()*sizeof(int), sizeof(int)); bsp_sync(); result = array[0]; for( ii=1; ii<bsp_pid(); ii++ ) result += array[ii]; free(array); bsp_popregister(array); return( result ); }
Both are based on pair-wise synchronization, rather than barrier synchronization. No simple cost model for performance prediction. No simple means of examining the global state. BSP could be implemented using a small, carefully chosen subset of MPI subroutines.
Conclusion
BSP is a computational model of parallel computing based on the concept of supersteps. BSP does not use locality of reference for the assignment of processes to processors.
Data Parallelism
Independent tasks apply same operation to different elements of a data set
for i 0 to 99 do a[i] b[i] + c[i] endfor
Functional Parallelism
Independent tasks apply different operations to different data elements
a2 b3 m (a + b) / 2 s (a2 + b2) / 2 v s - m2
First and second statements Third and fourth statements Speedup: Limited by amount of concurrent sub-tasks
Another example
The Sieve of Eratosthenes- A classical prime finding algorithm Here we want to find prime numbers less than or equal to some positive integer n. Let us do this example for n=30
67
Sieve of Eratosthenes
68
69
Sieve of Eratosthenes
70
Sieve of Eratosthenes
Data parallelism approach
71
72