Lecture06 Sharedmem jwd15
Lecture06 Sharedmem jwd15
Programming:
Lecture 6
James Demmel
www.cs.berkeley.edu/~demmel/cs267_S
pr15/
CS267 Lecture 6 1
Outline
• Parallel Programming with Threads
• Parallel Programming with OpenMP
• See parlab.eecs.berkeley.edu/2012bootcampagenda
• 2 OpenMP lectures (slides and video) by Tim Mattson
• openmp.org/wp/resources/
• computing.llnl.gov/tutorials/openMP/
• portal.xsede.org/online-training
• www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf
• Slides on OpenMP derived from: U.Wisconsin tutorial, which in
turn were from LLNL, NERSC, U. Minn, and OpenMP.org
• See tutorial by Tim Mattson and Larry Meadows presented at
SC08, at OpenMP.org; includes programming exercises
• (There are other Shared Memory Models: CILK, TBB…)
• Performance comparison
• Summary
02/05/2015 CS267 Lecture 6 2
Parallel
Programming with
Threads
CS267 Lecture 6 3
Recall Programming Model 1: Shared Memory
• Program is a collection of threads of control.
• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables
• Also a set of shared variables, e.g., static variables, shared common
blocks, or global heap.
• Threads communicate implicitly by writing and reading shared
variables.
• Threads coordinate by synchronizing on shared variables
Shared memory
s s = ...
y = ..s ...
i: 2 i: 5 Private i: 8
memory
P0 P1 Pn
Signature:
int pthread_create(pthread_t *,
const pthread_attr_t *,
void * (*)(void *),
void *);
Example call:
errcode = pthread_create(&thread_id; &thread_attribute
&thread_fun; &fun_arg);
int main() {
pthread_t threads[16];
int tn;
for(tn=0; tn<16; tn++) {
pthread_create(&threads[tn], NULL, SayHello, NULL);
}
for(tn=0; tn<16 ; tn++) {
pthread_join(threads[tn], NULL);
}
return 0;
}
Others:
• pthread_t me; me = pthread_self();
• Allows a pthread to obtain its own identifier pthread_t thread;
• pthread_detach(thread);
• Informs the library that the thread’s exit status will not be needed by
subsequent pthread_join calls resulting in better thread performance.
For more information consult the library or the man pages, e.g.,
man -k pthread
02/05/2015 Kathy Yelick 11
Recall Data Race Example
static int s = 0;
Thread 1 Thread 2
• Pitfalls
• Data race bugs are very nasty to find because they can be
intermittent
• Deadlocks are usually easier, but can also be intermittent
CS267 Lecture 6 18
Introduction to OpenMP
• What is OpenMP?
• Open specification for Multi-Processing, latest version 4.0, July 2013
• “Standard” API for defining multi-threaded shared-memory programs
• openmp.org – Talks, examples, forums, etc.
• See parlab.eecs.berkeley.edu/2012bootcampagenda
• 2 OpenMP lectures (slides and video) by Tim Mattson
• computing.llnl.gov/tutorials/openMP/
• portal.xsede.org/online-training
• www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf
• High-level API
• Preprocessor (compiler) directives ( ~ 80% )
• Library Calls ( ~ 19% )
• Environment Variables ( ~ 1% )
• OpenMP will:
• Allow a programmer to separate a program into serial regions and
parallel regions, rather than T concurrently-executing threads.
• Hide stack management
• Provide synchronization constructs
int main() {
return 0;
}
int main() {
omp_set_num_threads(16);
return 0;
}
?
• Preprocessor calculates loop
bounds for each thread directly
from serial source
• Barrier directives
#pragma omp barrier
CS267 Lecture 6 33
Shared Memory
Hardware
and
Memory
Consistency
CS267 Lecture 6 42
Basic Shared Memory Architecture
• Processors all connected to a large shared memory
• Where are caches?
P1 P2 Pn
interconnect
memory
P1 Pn
$ $
Bus
P1 P2 P3
u=? 3
u= ?
4 5 $
$ $
u :5 u :5u= 7
I/O devices
1
2
u :5
• Things to note: Memory
• Processors could see different values for u after event 3
• With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
• How to fix with a bus: Coherence Protocol
• Use bus to broadcast writes or invalidations
• Simple protocols rely on presence of broadcast medium
• Bus not scalable beyond about 64 processors (max)
• Capacity, bandwidth limitations
02/05/2015 Slide source: John Kubiatowicz
Scalable Shared Memory: Directories
P P
P0 P1 P2 P3
memory
data 1
data 0 data 0
p1 p2
02/05/2015 CS267 Lecture 6 50
Snoopy Cache-Coherence Protocols
State
Address
P0 Pn
Data
$ bus snoop $
memory bus
memory op from Pn
Mem Mem
• Memory bus is a broadcast medium
• Caches contain information on which addresses they store
• Cache Controller “snoops” all transactions on the bus
• A transaction is a relevant transaction if it involves a cache block currently
contained in this cache
• Take action to ensure coherence
• invalidate, update, or supply value
• Many possible designs (see CS252 or CS258)
Assume:
I/O MEM ° ° ° MEM
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
140 MB/s Suppose 98% inst hit rate and 95% data hit
°°°
rate
cache cache => 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor
5.2 GB/s 140 MB/s combined BW
PROC PROC
Assuming 1 GB/s bus bandwidth
8 processors will saturate bus
PCI bus
PCI
PCI bus
I/O MIU
cards
1-, 2-, or 4-way
interleaved
DRAM
CPU/mem
P P cards
$2
$2 Mem ctrl
• Coherent Bus interface/switch
• Up to 16 processor and/or
memory-I/O cards Gigaplane bus (256 data, 41 addr ess, 83 MHz)
I/O cards
Bus interface
2 FiberChannel
100bT, SCSI
• L1 not coherent, L2 shared
SBUS
SBUS
SBUS
02/05/2015 CS267 Lecture 6 53
Directory Based Memory/Cache Coherence
• Keep Directory to keep track of which memory stores latest
copy of data
• Directory, like cache, may keep information such as:
• Valid/invalid
• Dirty (inconsistent with memory)
• Shared (in another caches)
• When a processor executes a write operation to shared
data, basic design choices are:
• With respect to memory:
• Write through cache: do the write in memory as well as cache
• Write back cache: wait and do the write later, when the item is flushed
• With respect to other cached copies
• Update: give all other processors the new value
• Invalidate: all other processors remove from cache
• See CS252 or CS258 for details
02/05/2015 CS267 Lecture 6 54
SGI Altix
3000
CS267 Lecture 6 61
Sequential Consistency Example
Thread A Thread B
Thread C Thread D
For an 8-way
superscalar.
From: Tullsen,
Eggers, and
Levy,
“Simultaneous
Multithreading:
Maximizing On-
02/05/2015
chip Slide source: John Kubiatow
Simultaneous Multi-threading ...
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9