0% found this document useful (0 votes)
205 views72 pages

Ram, Pram, and Logp Models

This document discusses machine models for analyzing parallel algorithms, including the RAM, PRAM, and LogP models. It provides details on the RAM and PRAM models: the RAM models sequential computation with infinite memory and unit-time operations, while the PRAM extends this to model parallel computation using a collection of synchronous processors that access a shared memory. The document discusses variations of the PRAM based on how memory conflicts are handled, provides examples of parallel algorithms and their analysis on the PRAM model, and notes strengths and weaknesses of the PRAM for modeling parallelism.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
205 views72 pages

Ram, Pram, and Logp Models

This document discusses machine models for analyzing parallel algorithms, including the RAM, PRAM, and LogP models. It provides details on the RAM and PRAM models: the RAM models sequential computation with infinite memory and unit-time operations, while the PRAM extends this to model parallel computation using a collection of synchronous processors that access a shared memory. The document discusses variations of the PRAM based on how memory conflicts are handled, provides examples of parallel algorithms and their analysis on the PRAM model, and notes strengths and weaknesses of the PRAM for modeling parallelism.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 72

RAM, PRAM, and LogP models

Why models?
What is a machine model?
An abstraction describes the operation of a machine. Allowing to associate a value (cost) to each machine operation.

Why do we need models?


Make it easy to reason algorithms Hide the machine implementation details so that general results that apply to a broad class of machines to be obtained. Analyze the achievable complexity (time, space, etc) bounds Analyze maximum parallelism Models are directly related to algorithms.

RAM (random access machine) model


Memory consists of infinite array (memory cells). Instructions executed sequentially one at a time All instructions take unit time
Load/store Arithmetic Logic

Running time of an algorithm is the number of instructions executed. Memory requirement is the number of memory cells used in the algorithm.

RAM (random access machine) model


The RAM model is the base of algorithm analysis for sequential algorithms although it is not perfect bcos of following reasons.
Memory not infinite Not all memory access take the same time Not all arithmetic operations take the same time Instruction pipelining is not taken into consideration.

The RAM model (with asymptotic analysis) often gives relatively realistic results.

PRAM (Parallel RAM)


A unbounded collection of processors. Each process has infinite number of registers. A unbounded collection of shared memory cells. All processors can access all memory cells in unit time (when there is no memory conflict). All processors execute PRAM instructions synchronously (some processors may idle). Each PRAM instruction executes in 3-phase cycles
Read from a share memory cell (if needed) Computation Write to a share memory cell (if needed)

PRAM (Parallel RAM)


The only way processors exchange data is through the shared memory. Parallel time complexity: the number of synchronous steps in the algorithm Space complexity: the number of share memory Parallelism: the number of processors used

Illustration of PRAM
Single program executed in MIMD mode
CLK

Each processor has a unique index.

P1

P2

P3

Pp

Shared Memory

P processors connected to a single shared memory

All processors can do things in a synchronous manner (with infinite shared Memory and infinite local memory),

Communication in PRAM

Amit Chhabra,Deptt. of Computer Sci. & Engg.

PRAM further refinement


PRAMs are further classifed based on how the memory conflicts are resolved.
Read
Exclusive Read (ER) all processors can only simultaneously read from distinct memory location (but not the same location).
What if two processors want to read from the same location?

Concurrent Read (CR) all processors can simultaneously read from all memory locations.

PRAM further refinement


PRAMs are further classified based on how the memory conflicts are resolved.
Write
Exclusive Write (EW) all processors can only simultaneously write to distinct memory location (but not the same location). Concurrent Write (CR) all processors can simultaneously write to all memory locations.
Common CW: only allow same value to be written to the same location simultaneously. Random CW: randomly pick a value Priority CW: processors have priority, the value in the highest priority processor wins.

PRAM model variations


EREW, CREW, CRCW (common), CRCW (random), CRCW (Priority)
Which model is closer to the practical SMP machines?

Model A is computationally stronger than model B if and only if any algorithm written in B will run unchange in A.
EREW <= CREW <= CRCW (common) <= CRCW (random)

PRAM algorithm example


SUM: Add N numbers in memory M*0, 1, , N1] Sequential SUM algorithm (O(N) complexity)
for (i=0; i<N; i++) sum = sum + M[i];

PRAM SUM algorithm?

PRAM SUM algorithm

Which mo

Which PRAM model?

PRAM SUM algorithm complexity


Time complexity? Number of processors needed? Speedup (vs. sequential program):

Basic Parallel Algorithms


3 elementary problems to be considered
Reduction Broadcast Prefix sums

Target Architectures
Hypercube SIMD model 2D-mesh SIMD model UMA multiprocessor model Hypercube Multicomputer
Amit Chhabra,Deptt. of Computer Sci. & Engg. 15

REDUCTION

Cost Optimal PRAM Algorithm for the Reduction Problem Cost optimal PRAM algorithm complexity: O(logn) (using n div 2 processors) Example for n=8 and p=4 processors
a0 a1 a2 a3 a4 a5 a6 a7

j=0

P0

P1

P2

P3

j=1

P0

P2

j=2

P0
Amit Chhabra,Deptt. of Computer Sci. & Engg. 16

Reduction using Hyper-cube SIMD


P4

P6
P0

P0 P2

P0 P5

P2 P7 P1 P3 P1

P1

P3

Amit Chhabra,Deptt. of Computer Sci. & Engg.

17

Reduction using 2-D mesh

Amit Chhabra,Deptt. of Computer Sci. & Engg.

18

Reduction using 2-D mesh


Example: compute the total sum on a 4*4 mesh Stage 1

Amit Chhabra,Deptt. of Computer Sci. & Engg.

19

Reduction using 2-D mesh


Stage 2

Amit Chhabra,Deptt. of Computer Sci. & Engg.

20

Parallel search algorithm


P processors PRAM with unsorted N numbers (P<=N) Does x exist in the N numbers? p_0 has x initially, p_0 must know the answer at the end. PRAM Algorithm:
Step 1: Inform everyone what x is Step 2: every processor checks N/P numbers and sets a flag Step 3: Check if any flag is set to 1.

Parallel search algorithm


PRAM Algorithm:
Step 1: Inform everyone what x is Step 2: every processor checks N/P numbers and sets a flag Step 3: Check if any flag is set to 1.

EREW: O(log(p)) step 1, O(N/P) step 2, and O(log(p)) step 3.


CREW: O(1) step 1, O(N/P) step 2, and O(log(p)) step 3. CRCW (common): O(1) step 1, O(N/P) step 2, and O(1) step 3.

PRAM strengths
Natural extension of RAM It is simple and easy to understand
Communication and synchronization issues are hided.

Can be used as a benchmarks


If an algorithm performs badly in the PRAM model, it will perform badly in reality. A good PRAM program may not be practical though.

PRAM weaknesses
Model inaccuracies
Unbounded local memory (register) All operations take unit time

Unaccounted costs
Non-local memory access Latency Bandwidth Memory access contention

Pros & Cons


Pros
Simple and clean semantics The majority of theoretical parallel algorithms are specified with the PRAM model. Complexity of parallel algorithm Independent of the communication network topology

Pros & Cons (Contd)


Cons
Not realistic
too powerful communication model

Algorithm designer is misled to use IPC without hesitation. Synchronized processors No local memory

PRAM variations
Bounded memory PRAM, PRAM(m)
In a given step, only m memory accesses can be serviced.

Bounded number of processors PRAM


Any problem that can be solved by a p processor PRAM in t steps can be solved by a p processor PRAM in t = O(tp/p) steps.

LPRAM(Local Memory PRAM)


L units to access global memory Any algorithm that runs in a p processor PRAM can run in LPRAM with a loss of a factor of L. different access cost between local memory and remote memory

BPRAM(BLOCK PRAM)
L units for the first message B units for subsequent messages

PRAM summary
The RAM model is widely used. PRAM is simple and easy to understand
This model never reaches beyond the algorithm community. It is getting more important as threaded programming becomes more popular.

BSP Model
BSP(Bulk Synchronous Parallel) Model
Proposed by Leslie Valiant of Harvard University Developed by W.F.McColl of Oxford University

BSP Computer Model


Distributed memory architecture 3 components
Node
Processor Local memory

Router (Communication Network)


Point-to-point, message passing (or shared variable)

Barrier synchronizing facility


All or subset

Illustration of BSP
Node (w) Node Node

Barrier (l) Communication Network (g)

Three Parameters
w parameter
Maximum computation time within each superstep Computation operation takes at most w cycles.

g parameter
# of cycles for communication of unit message when all processors are involved in communication - network bandwidth (total number of local operations performed by all processors in one second) / (total number of words delivered by the communication network in one second) h relation coefficient Communication operation takes gh cycles.

l parameter
Barrier synchronization takes l cycles.

BSP Program
A BSP computation consists of S super steps. A superstep is a sequence of steps followed by a barrier synchronization. Superstep
Any remote memory accesses take effect at barrier - loosely synchronous

BSP Program
P1 Superstep 1 P2 P3 P4

Computation

Communication

Barrier

Superstep 2

BSP Algorithm Design


2 variables for time complexity
N: # time steps (local operations) per super step. h-relation: each node sends and receives at most h messages
gh is time steps for communication within a super step

Time complexity
Valiant's original:
McColl's revised:

max{w, gh, l}
N + gh + l = Ntot + Ncom g + sl

overlapped communication and computation

Algorithm design
Ntot = T / p, minimize Ncom and S.

Example
Inner product with 8 processors
Superstep 1
Computation: local sum w = 2N/8 Communication: 1 relation (procs. 0,2,4,6 -> procs. 1,3,5,7)

Superstep 2
Computation: one addition w = 1 Communication: 1 relation (procs. 1 5 -> procs. 3.7)

Superstep 3
Computation: one addition w = 1 Communication: 1 relation (proc. 3 -> proc. 7)

Example (Contd)
Superstep 4
Computation: one addition w = 1

Total execution time = 2N/n + log n(g+l+1)


log gn: communication overhead, log nl: synchronization overhead

BSP Programming Library


More abstract level of programming than message passing BSPlib library: Oxford BSP library + Green BSP
One-sided DRMA (direct remote memory access) Share-memory operations: dynamic sharedvariables through registrations BSMP (bulk synchronous message passing): combine small messages into a large one Additional features: message reordering

BSP Programming Library (Contd)


Extensions: Paderborn University BSP supports collective communication and group management facilities
Cost model for performance analysis and prediction Good for debugging: the global state is visible at the super-step boundary

Programming Example
All sums using the logarithmic technique
Calculate the partial sums of p integers stored on p processors log p supersteps 4 processors

BSP Program Example


int bsp_allsum(int x) { int i, left, right; bsp_push_register(&left, sizeof(int)); bsp_sync(); right = x; for(i=1; i<bsp_nprocs(); i*=2) { bsp_put(bsp_pid()+i, &right, &left, 0, sizeof(int)); bsp_sync(); if(bsp_pid() >= i) right = left + right; } bsp_pop_register(&left); return right; }

All Sums Using Logarithmic Technique


1 2 3 4

10

LogP model
PRAM model: shared memory

P network

Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such organization.

Deriving LogP model


Processing powerful microprocessor, large DRAM, cache => P Communication + significant latency => L + limited bandwidth => g + significant overhead => o
- on both ends no consensus on topology => should not exploit structure

+ limited capacity
no consensus on programming model => should not enforce one

LogP
P ( processors ) P o (overhead) L (latency) Interconnection Network Limited Volume ( L/ g to or from a proc) M P M o g (gap) P M

Latency in sending a (small) mesage between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/BW) Processors

Using the model


o g o L L o o

time

Send n messages from proc to proc in time 2o + L + g(n-1)


each processor does o n cycles of overhead has (g-o)(n-1) + L available compute cycles Send n messages from one to many in same time Send n messages from many to one in same time all but L/g processors block so fewer available cycles

Using the model


Two processors send n words to each other:
2o + L + g(n-1)

Assumes no network contention Can under-estimate the communication time.

LogP philosophy
Think about: mapping of a task onto P processors computation within a processor, its cost, and balance communication between processors, its cost, and balance given a charaterization of processor and network performance Do not think about what happens within the network

Develop optimal broadcast algorithm based on the LogP model


Broadcast a single datum to P-1 processors

Strengths of the LogP model


Simple, 4 parameters Can easily be used to guide the algorithm development, especially algorithms for communication routines.
This model has been used to analyze many collective communication algorithms.

Weaknesses of the LogP model


Accurate only at the very low level (machine instruction level)
Inaccurate for more practical communication systems with layers of protocols (e.g. TCP/IP) Many variations. LogP family models: LogGP, logGPC, pLogP, etc
Making the model more accurate and more complex

BSP vs. LogP

BSP differs from LogP in three ways: LogP uses a form of message passing based on pairwise synchronization. LogP adds an extra parameter representing the overhead involved in sending a message. Applies to every communication! LogP defines g in local terms. It regards the network as having a finite capacity and treats g as the minimal permissible gap between message sends from a single process. The parameter g in both cases is the reciprocal of the available per-processor network bandwidth: BSP takes a global view of g, LogP takes a local view of g.

BSP vs. LogP

When analyzing the performance of LogP model, it is often necessary (or convenient) to use barriers. Message overhead is present but decreasing Only overhead is from transferring the message from user space to a system buffer. LogP + barriers - overhead = BSP Both models can efficiently simulate the other.

BSP vs. PRAM

BSP can be regarded as a generalization of the PRAM model. If the BSP architecture has a small value of g (g=1), then it can be regarded as PRAM. Use hashing to automatically achieve efficient memory management.

The value of l determines the degree of parallel slackness required to achieve optimal efficiency. If l = g = 1 corresponds to idealized PRAM where no slackness is required.

BSPlib

Supports a SPMD style of programming. Library is available in C and FORTRAN. Implementations available (several years ago) for: Cray T3E IBM SP2 SGI PowerChallenge Convex Exemplar Hitachi SR2001 Various Workstation Clusters

Allows for direct remote memory access or message passing.


Includes support for unbuffered messages for high performance computing.

BSPlib
Initialization Functions bsp_init() Simulate dynamic processes bsp_begin() Start of SPMD code bsp_end() End of SPMD code Enquiry Functions bsp_pid() find my process id bsp_nprocs() number of processes bsp_time() local time Synchronization Functions bsp_sync() barrier synchronization DRMA Functions bsp_pushregister() make region globally visible bsp_popregister() remove global visibility bsp_put() push to remote memory bsp_get() pull from remote memory

BSPlib

BSMP Functions bsp_set_tag_size() choose tag size bsp_send() send to remote queue bsp_get_tag() match tag with message bsp_move() fetch from queue

High Performance Functions bsp_hpput() bsp_hpget() bsp_hpmove() These are unbuffered versions of communication primitives

Halt Functions bsp_abort() one process halts all

BSPlib Examples

Static Hello World

Dynamic Hello World

void main( void ) { bsp_begin( bsp_nprocs());

printf( Hello BSP from %d of %d\n, bsp_pid(), bsp_nprocs());


bsp_end();

int nprocs; /* global variable */ void spmd_part( void ) { bsp_begin( nprocs ); printf( Hello BSP from %d of %d\n, bsp_pid(), bsp_nprocs()); } void main( void ) { bsp_init( spmd_part, argc, argv ); nprocs = ReadInteger(); spmd_part(); }

BSPlib Examples

Serialize Printing of Hello World (shows synchronization)

void main( void ) { int ii; bsp_begin( bsp_nprocs()); for( ii=0; ii<bsp_nprocs(); ii++ ) { if( bsp_pid() == ii ) printf( Hello BSP from %d of %d\n, bsp_pid(), bsp_nprocs()); fflush( stdout ); bsp_sync(); } bsp_end(); }

BSPlib Examples

All sums version 1 ( lg( p ) supersteps ) int bsp_allsums1( int x ){ int ii, left, right; bsp_pushregister( &left, sizeof( int )); bsp_sync(); right = x; for( ii=1; ii<bsp_nprocs(); ii*=2 ){ if( bsp_pid()+I < bsp_nprocs()) bsp_put( bsp_pid()+I, &right, &left, 0, sizeof( int )); bsp_sync(); if( bsp_pid() >= I ) right = left + right; } bsp_popregister( &left ); return( right ); }

BSPlib Examples

All sums version 2 (one superstep) int bsp_allsums2( int x ) { int ii, result, *array = calloc( bsp_nprocs(), sizeof(int)); if( array == NULL ) bsp_abort( Unable to allocate %d element array, bsp_nprocs()); bsp_pushregister( array, bsp_nprocs()*sizeof( int)); bsp_sync(); for( ii=bsp_pid(); ii<bsp_nprocs(); ii++ ) bsp_put( ii, &x, array, bsp_pid()*sizeof(int), sizeof(int)); bsp_sync(); result = array[0]; for( ii=1; ii<bsp_pid(); ii++ ) result += array[ii]; free(array); bsp_popregister(array); return( result ); }

BSPlib vs. PVM and/or MPI


MPI/PVM are widely implemented and widely used.
Both have HUGE APIs!!! Both may be inefficient on (distributed-)shared memory systems. where the communication and synchronization are decoupled.
True for DSM machines with one sided communication.

Both are based on pair-wise synchronization, rather than barrier synchronization. No simple cost model for performance prediction. No simple means of examining the global state. BSP could be implemented using a small, carefully chosen subset of MPI subroutines.

Conclusion
BSP is a computational model of parallel computing based on the concept of supersteps. BSP does not use locality of reference for the assignment of processes to processors.

Predictability is defined in terms of three parameters.


BSP is a generalization of PRAM.

BSP = LogP + barriers - overhead


BSPlib has a much smaller API as compared to MPI/PVM.

Data Parallelism
Independent tasks apply same operation to different elements of a data set
for i 0 to 99 do a[i] b[i] + c[i] endfor

Okay to perform operations concurrently Speedup: potentially p-fold, p=#processors

Functional Parallelism
Independent tasks apply different operations to different data elements
a2 b3 m (a + b) / 2 s (a2 + b2) / 2 v s - m2

First and second statements Third and fourth statements Speedup: Limited by amount of concurrent sub-tasks

Another example
The Sieve of Eratosthenes- A classical prime finding algorithm Here we want to find prime numbers less than or equal to some positive integer n. Let us do this example for n=30

Amit Chhabra,Deptt. of Computer Sci. & Engg.

67

Sieve of Eratosthenes

Amit Chhabra,Deptt. of Computer Sci. & Engg.

68

Pseudo code serial

Amit Chhabra,Deptt. of Computer Sci. & Engg.

69

Sieve of Eratosthenes

Control parallel solution

Amit Chhabra,Deptt. of Computer Sci. & Engg.

70

Sieve of Eratosthenes
Data parallelism approach

Amit Chhabra,Deptt. of Computer Sci. & Engg.

71

Data parallel contd

Amit Chhabra,Deptt. of Computer Sci. & Engg.

72

You might also like