0% found this document useful (0 votes)

19 views

Lecture06 Sharedmem jwd15

This document provides an overview of shared memory parallel programming with threads and OpenMP. It begins with an outline describing parallel programming with threads using Pthreads and OpenMP. It then discusses programming with threads in more detail, covering thread creation in Pthreads, barriers for synchronization, and mutexes for mutual exclusion. The document provides examples and functions for implementing parallelism using these shared memory programming models.

Uploaded by

Dr. V. Padmavathi Associate Professor

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Lecture06 Sharedmem jwd15

Uploaded by

Dr. V. Padmavathi Associate Professor

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 60

Shared Memory

Programming:

Threads and OpenMP

Lecture 6
James Demmel
www.cs.berkeley.edu/~demmel/cs267_S
pr15/

CS267 Lecture 6 1
Outline
• Parallel Programming with Threads
• Parallel Programming with OpenMP
• See parlab.eecs.berkeley.edu/2012bootcampagenda
• 2 OpenMP lectures (slides and video) by Tim Mattson
• openmp.org/wp/resources/
• computing.llnl.gov/tutorials/openMP/
• portal.xsede.org/online-training
• www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf
• Slides on OpenMP derived from: U.Wisconsin tutorial, which in
turn were from LLNL, NERSC, U. Minn, and OpenMP.org
• See tutorial by Tim Mattson and Larry Meadows presented at
SC08, at OpenMP.org; includes programming exercises
• (There are other Shared Memory Models: CILK, TBB…)
• Performance comparison
• Summary
02/05/2015 CS267 Lecture 6 2
Parallel
Programming with
Threads

CS267 Lecture 6 3
Recall Programming Model 1: Shared Memory
• Program is a collection of threads of control.
• Can be created dynamically, mid-execution, in some languages
• Each thread has a set of private variables, e.g., local stack variables
• Also a set of shared variables, e.g., static variables, shared common
blocks, or global heap.
• Threads communicate implicitly by writing and reading shared
variables.
• Threads coordinate by synchronizing on shared variables

Shared memory
s s = ...
y = ..s ...
i: 2 i: 5 Private i: 8
memory
P0 P1 Pn

02/05/2015 CS267 Lecture 6 4

Shared Memory Programming
Several Thread Libraries/systems
• PTHREADS is the POSIX Standard
• Relatively low level
• Portable but possibly slow; relatively heavyweight
• OpenMP standard for application level programming
• Support for scientific programming on shared memory
• openmp.org
• TBB: Thread Building Blocks
• Intel
• CILK: Language of the C “ilk”
• Lightweight threads embedded into C
• Java threads
• Built on top of POSIX threads
• Object within Java language
02/05/2015 CS267 Lecture 6 5
Common Notions of Thread Creation
• cobegin/coend
cobegin • Statements in block may run in parallel
job1(a1); • cobegins may be nested
job2(a2); • Scoped, so you cannot have a missing coend
coend
• fork/join
tid1 = fork(job1, a1);
job2(a2); • Forked procedure runs in parallel
join tid1; • Wait at join point if it’s not finished
• future
• Future expression evaluated in parallel
v = future(job1(a1));
• Attempt to use return value will wait
… = …v…;
• Cobegin cleaner than fork, but fork is more general
• Futures require some compiler (and likely hardware) support

02/05/2015 CS267 Lecture 6 6

Overview of POSIX Threads
• POSIX: Portable Operating System Interface
• Interface to Operating System utilities
• PThreads: The POSIX threading interface
• System calls to create and synchronize threads
• Should be relatively uniform across UNIX-like OS
platforms
• PThreads contain support for
• Creating parallelism
• Synchronizing
• No explicit support for communication, because
shared memory is implicit; a pointer to shared data is
passed to a thread

02/05/2015 CS267 Lecture 6 7

Forking Posix Threads

Signature:
int pthread_create(pthread_t *,
const pthread_attr_t *,
void * (*)(void *),
void *);
Example call:
errcode = pthread_create(&thread_id; &thread_attribute
&thread_fun; &fun_arg);

• thread_id is the thread id or handle (used to halt, etc.)

• thread_attribute various attributes
• Standard default values obtained by passing a NULL pointer
• Sample attributes: minimum stack size, priority
• thread_fun the function to be run (takes and returns void*)
• fun_arg an argument can be passed to thread_fun when it starts
• errorcode will be set nonzero if the create operation fails
02/05/2015 CS267 Lecture 6 8
Simple Threading Example
void* SayHello(void *foo) {
printf( "Hello, world!\n" );
return NULL; Compile using gcc –lpthread
}

int main() {
pthread_t threads[16];
int tn;
for(tn=0; tn<16; tn++) {
pthread_create(&threads[tn], NULL, SayHello, NULL);
}
for(tn=0; tn<16 ; tn++) {
pthread_join(threads[tn], NULL);
}
return 0;
}

02/05/2015 CS267 Lecture 6 9

Loop Level Parallelism
• Many scientific application have parallelism in loops
• With threads:
… my_stuff [n][n];
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
… pthread_create (update_cell[i][j], …,
my_stuff[i][j]);

• But overhead of thread creation is nontrivial

• update_cell should have a significant amount of work
• 1/p-th of total work if possible

02/05/2015 CS267 Lecture 6 10

Some More Pthread Functions
• pthread_yield();
• Informs the scheduler that the thread is willing to yield its quantum,
requires no arguments.
• pthread_exit(void *value);
• Exit thread and pass value to joining thread (if exists)
• pthread_join(pthread_t *thread, void **result);
• Wait for specified thread to finish. Place exit value into *result.

Others:
• pthread_t me; me = pthread_self();
• Allows a pthread to obtain its own identifier pthread_t thread;
• pthread_detach(thread);
• Informs the library that the thread’s exit status will not be needed by
subsequent pthread_join calls resulting in better thread performance.
For more information consult the library or the man pages, e.g.,
man -k pthread
02/05/2015 Kathy Yelick 11
Recall Data Race Example

static int s = 0;

Thread 1 Thread 2

for i = 0, n/2-1 for i = n/2, n-1

s = s + f(A[i]) s = s + f(A[i])

• Problem is a race condition on variable s in the program

• A race condition or data race occurs when:
- two processors (or two threads) access the same
variable, and at least one does a write.
- The accesses are concurrent (not synchronized) so
they could happen simultaneously

02/05/2015 CS267 Lecture 6 12

Basic Types of Synchronization: Barrier
Barrier -- global synchronization
• Especially common when running multiple copies of
the same function in parallel
• SPMD “Single Program Multiple Data”
• simple use of barriers -- all threads hit the same one
work_on_my_subgrid();
barrier;
read_neighboring_values();
barrier;
• more complicated -- barriers on branches (or loops)
if (tid % 2 == 0) {
work1();
barrier
} else { barrier }
• barriers are not provided in all thread libraries
02/05/2015 CS267 Lecture 6 13
Creating and Initializing a Barrier
• To (dynamically) initialize a barrier, use code similar to
this (which sets the number of threads to 3):
pthread_barrier_t b;
pthread_barrier_init(&b,NULL,3);

• The second argument specifies an attribute object for

finer control; using NULL yields the default attributes.

• To wait at a barrier, a process executes:

pthread_barrier_wait(&b);

02/05/2015 CS267 Lecture 6 14

Basic Types of Synchronization: Mutexes
Mutexes -- mutual exclusion aka locks
• threads are working mostly independently
• need to access common data structure
lock *l = alloc_and_init(); /* shared */
acquire(l);
access data
release(l);
• Locks only affect processors using them:
• If a thread accesses the data without doing the
acquire/release, locks by others will not help
• Java and other languages have lexically scoped
synchronization, i.e., synchronized methods/blocks
• Can’t forgot to say “release”
• Semaphores generalize locks to allow k threads
simultaneous access; good for limited resources
02/05/2015 CS267 Lecture 6 15
Mutexes in POSIX Threads
• To create a mutex:
#include <pthread.h>
pthread_mutex_t amutex = PTHREAD_MUTEX_INITIALIZER;
// or pthread_mutex_init(&amutex, NULL);
• To use it:
int pthread_mutex_lock(amutex);
int pthread_mutex_unlock(amutex);
• To deallocate a mutex
int pthread_mutex_destroy(pthread_mutex_t *mutex);
• Multiple mutexes may be held, but can lead to problems:
thread1 thread2
lock(a) lock(b) deadloc
lock(b) lock(a) k
• Deadlock results if both threads acquire one of their locks,
so that neither can acquire the second
02/05/2015 CS267 Lecture 6 16
Summary of Programming with Threads
• POSIX Threads are based on OS features
• Can be used from multiple languages (need appropriate header)
• Familiar language for most of program
• Ability to shared data is convenient

• Pitfalls
• Data race bugs are very nasty to find because they can be
intermittent
• Deadlocks are usually easier, but can also be intermittent

• Researchers look at transactional memory an alternative

• OpenMP is commonly used today as an alternative

02/05/2015 CS267 Lecture 6 17

Parallel
Programming in
OpenMP

CS267 Lecture 6 18
Introduction to OpenMP
• What is OpenMP?
• Open specification for Multi-Processing, latest version 4.0, July 2013
• “Standard” API for defining multi-threaded shared-memory programs
• openmp.org – Talks, examples, forums, etc.
• See parlab.eecs.berkeley.edu/2012bootcampagenda
• 2 OpenMP lectures (slides and video) by Tim Mattson
• computing.llnl.gov/tutorials/openMP/
• portal.xsede.org/online-training
• www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf

• High-level API
• Preprocessor (compiler) directives ( ~ 80% )
• Library Calls ( ~ 19% )
• Environment Variables ( ~ 1% )

02/05/2015 CS267 Lecture 6 19

A Programmer’s View of OpenMP
• OpenMP is a portable, threaded, shared-memory
programming specification with “light” syntax
• Exact behavior depends on OpenMP implementation!
• Requires compiler support (C, C++ or Fortran)

• OpenMP will:
• Allow a programmer to separate a program into serial regions and
parallel regions, rather than T concurrently-executing threads.
• Hide stack management
• Provide synchronization constructs

• OpenMP will not:

• Parallelize automatically
• Guarantee speedup
• Provide freedom from data races

02/05/2015 CS267 Lecture 6 20

Motivation – OpenMP

int main() {

// Do this part in parallel

printf( "Hello, World!\n" );

return 0;
}

02/05/2015 CS267 Lecture 6 21

Motivation – OpenMP

int main() {

omp_set_num_threads(16);

// Do this part in parallel

#pragma omp parallel
{
printf( "Hello, World!\n" );
}

return 0;
}

02/05/2015 CS267 Lecture 6 22

Programming Model – Concurrent Loops
• OpenMP easily parallelizes loops
• Requires: No data dependencies
(reads/write or write/write pairs)
between iterations!

?
• Preprocessor calculates loop
bounds for each thread directly
from serial source

#pragma omp parallel for

for( i=0; i < 25; i++ )

{
printf(“Foo”);
?
}
02/05/2015 CS267 Lecture 6 23
Programming Model – Loop Scheduling
• schedule clause determines how loop iterations are
divided among the thread team; no one best way
• static([chunk]) divides iterations statically between
threads (default if no hint)
• Each thread receives [chunk] iterations, rounding as necessary to
account for all iterations
• Default [chunk] is ceil( # iterations / # threads )
• dynamic([chunk]) allocates [chunk] iterations per thread,
allocating an additional [chunk] iterations when a thread
finishes
• Forms a logical work queue, consisting of all loop iterations
• Default [chunk] is 1
• guided([chunk]) allocates dynamically, but [chunk] is
exponentially reduced with each allocation

02/05/2015 CS267 Lecture 6 24

Programming Model – Data Sharing
• Parallel programs often employ
// shared, globals
two types of data
• Shared data, visible to all int bigdata[1024];
threads, similarly named
• Private data, visible to a single
void* foo(void* bar) {
thread (often stack-allocated)
intprivate,
// tid; stack
• PThreads:
• Global-scoped variables are int tid;
shared #pragma omp parallel \
• Stack-allocated variables are
private shared
/* ( bigdata
Calculation ) \
goes
private ( tid )
here */
• OpenMP:
• shared variables are shared } {
• private variables are private /* Calc. here */
}
}
02/05/2015 CS267 Lecture 6 25
Programming Model - Synchronization
• OpenMP Synchronization
#pragma omp critical
• OpenMP Critical Sections
{
• Named or unnamed /* Critical code here */
• No explicit locks / mutexes }

• Barrier directives
#pragma omp barrier

• Explicit Lock functions omp_set_lock( lock l );

• When all else fails – may /* Code goes here */
require flush directive omp_unset_lock( lock l );

• Single-thread regions within #pragma omp single

parallel regions {
• master, single directives /* Only executed once */
}

02/05/2015 CS267 Lecture 6 26

Microbenchmark: Grid Relaxation (Stencil)

for( t=0; t < t_steps; t++) {

#pragma omp parallel for \

shared(grid,x_dim,y_dim) private(x,y)

for( x=0; x < x_dim; x++) {

for( y=0; y < y_dim; y++) {
grid[x][y] = /* avg of neighbors */
}
}
// Implicit Barrier Synchronization
temp_grid = grid;
} grid = other_grid;
other_grid = temp_grid;

02/05/2015 CS267 Lecture 6 27

Microbenchmark: Structured Grid
• ocean_dynamic – Traverses entire ocean, row-
by-row, assigning row iterations to threads with
dynamic scheduling.

• ocean_static – Traverses entire ocean, row-

by-row, assigning row iterations to threads with
static scheduling. OpenMP

• ocean_squares – Each thread traverses a

square-shaped section of the ocean. Loop-level
scheduling not used—loop bounds for each thread
are determined explicitly.

• ocean_pthreads – Each thread traverses a

square-shaped section of the ocean. Loop bounds
for each thread are determined explicitly. PThreads

02/05/2015 CS267 Lecture 6 28

Microbenchmark: Ocean

02/05/2015 CS267 Lecture 6 29

Microbenchmark: Ocean

02/05/2015 CS267 Lecture 6 30

Evaluation
• OpenMP scales to 16-processor systems
• Was overhead too high?
• In some cases, yes (when too little work per processor)
• Did compiler-generated code compare to hand-written code?
• Yes!
• How did the loop scheduling options affect performance?
• dynamic or guided scheduling helps loops with variable
iteration runtimes
• static or predicated scheduling more appropriate for shorter
loops

• OpenMP is a good tool to parallelize (at least some!)

applications

02/05/2015 CS267 Lecture 6 31

OpenMP Summary
• OpenMP is a compiler-based technique to create
concurrent code from (mostly) serial code
• OpenMP can enable (easy) parallelization of loop-based
code
• Lightweight syntactic language extensions

• OpenMP performs comparably to manually-coded

threading
• Scalable
• Portable

• Not a silver bullet for all (more irregular) applications

• Lots of detailed tutorials/manuals on-line

02/05/2015 CS267 Lecture 6 32
Extra Slides

CS267 Lecture 6 33
Shared Memory
Hardware
and
Memory
Consistency

CS267 Lecture 6 42
Basic Shared Memory Architecture
• Processors all connected to a large shared memory
• Where are caches?

P1 P2 Pn

interconnect

memory

• Now take a closer look at structure, costs, limits,

programming
02/05/2015 CS267 Lecture 6 43
What About Caching???

P1 Pn

$ $
Bus

Mem I/O devices

• Want high performance for shared memory: Use Caches!

• Each processor has its own cache (or multiple caches)
• Place data from memory into cache
• Writeback cache: don’t send all writes over bus to memory
• Caches reduce average latency
• Automatic replication closer to processor
• More important to multiprocessor than uniprocessor: latencies longer
• Normal uniprocessor mechanisms to access data
• Loads and Stores form very low-overhead communication primitive
• Problem: Cache Coherence!
02/05/2015 Slide source: John Kubiatowicz
Example Cache Coherence Problem

P1 P2 P3
u=? 3
u= ?
4 5 $
$ $

u :5 u :5u= 7

I/O devices
1
2
u :5
• Things to note: Memory
• Processors could see different values for u after event 3
• With write back caches, value written back to memory depends on
happenstance of which cache flushes or writes back value when
• How to fix with a bus: Coherence Protocol
• Use bus to broadcast writes or invalidations
• Simple protocols rely on presence of broadcast medium
• Bus not scalable beyond about 64 processors (max)
• Capacity, bandwidth limitations
02/05/2015 Slide source: John Kubiatowicz
Scalable Shared Memory: Directories
P P

Cache Cache • k processors.

• With each cache-block in memory:
Interconnection Network k presence-bits, 1 dirty-bit
• With each cache-block in cache:
Memory •• • Directory 1 valid bit, and 1 dirty (owner) bit

presence bits dirty bit

• Every memory block has associated directory information

• keeps track of copies of cached blocks and their states
• on a miss, find directory entry, look it up, and communicate only with the nodes that
have copies if necessary
• in scalable networks, communication with directory and copies is through network
transactions
• Each Reader recorded in directory
• Processor asks permission of memory before writing:
• Send invalidation to each cache with read-only copy
• Wait for acknowledgements before returning permission for writes
02/05/2015 Slide source: John Kubiatowicz
Intuitive Memory Model
• Reading an address should return the last
value written to that address
• Easy in uniprocessors
• except for I/O
• Cache coherence problem in MPs is more
pervasive and more performance critical
• More formally, this is called sequential
consistency:
“A multiprocessor is sequentially consistent if the result
of any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual processor
appear in this sequence in the order specified by its
program.” [Lamport, 1979]
02/05/2015 CS267 Lecture 6 47
Sequential Consistency Intuition
• Sequential consistency says the machine behaves as if
it does the following

P0 P1 P2 P3

memory

02/05/2015 CS267 Lecture 6 48

Memory Consistency Semantics

What does this imply about program behavior?

• No process ever sees “garbage” values, i.e., average of 2 values
• Processors always see values written by some processor
• The value seen is constrained by program order on all
processors
• Time always moves forward If P2 sees the new value of
flag (=1), it must see the
• Example: spin lock
new value of data (=1)
• P1 writes data=1, then writes flag=1
• P2 waits until flag=1, then reads data
If P2 Then P2 may
initially: flag=0 reads flag read data
data=0 0 1
P1 P2 0 0
data = 1 10: if flag=0, goto 10 1 1
flag = 1 …= data
02/05/2015 CS267 Lecture 6 49
Are Caches “Coherent” or Not?
• Coherence means different copies of same location have same
value, incoherent otherwise:
• p1 and p2 both have cached copies of data (= 0)
• p1 writes data=1
• May “write through” to memory
• p2 reads data, but gets the “stale” cached copy
• This may happen even if it read an updated value of another
variable, flag, that came from memory
data = 0

data 1
data 0 data 0

p1 p2
02/05/2015 CS267 Lecture 6 50
Snoopy Cache-Coherence Protocols

State
Address
P0 Pn
Data

$ bus snoop $
memory bus
memory op from Pn
Mem Mem
• Memory bus is a broadcast medium
• Caches contain information on which addresses they store
• Cache Controller “snoops” all transactions on the bus
• A transaction is a relevant transaction if it involves a cache block currently
contained in this cache
• Take action to ensure coherence
• invalidate, update, or supply value
• Many possible designs (see CS252 or CS258)

02/05/2015 CS267 Lecture 6 51

Limits of Bus-Based Shared Memory

Assume:
I/O MEM ° ° ° MEM
1 GHz processor w/o cache
=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store

140 MB/s Suppose 98% inst hit rate and 95% data hit
°°°
rate
cache cache => 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor
5.2 GB/s 140 MB/s combined BW
PROC PROC
Assuming 1 GB/s bus bandwidth
 8 processors will saturate bus

02/05/2015 CS267 Lecture 6

Sample Machines
CPU
• Intel Pentium Pro Quad Interrupt
controller
256-KB
L2 $
P-Pro
module
P-Pro
module
P-Pro
module

• Coherent Bus interface

• 4 processors P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

PCI PCI Memory

bridge bridge controller

PCI bus
PCI

PCI bus
I/O MIU
cards
1-, 2-, or 4-way
interleaved
DRAM

CPU/mem
P P cards

• Sun Enterprise server $ $

$2
$2 Mem ctrl
• Coherent Bus interface/switch

• Up to 16 processor and/or
memory-I/O cards Gigaplane bus (256 data, 41 addr ess, 83 MHz)

I/O cards
Bus interface

• IBM Blue Gene/L

2 FiberChannel
100bT, SCSI
• L1 not coherent, L2 shared
SBUS

SBUS
SBUS
02/05/2015 CS267 Lecture 6 53
Directory Based Memory/Cache Coherence
• Keep Directory to keep track of which memory stores latest
copy of data
• Directory, like cache, may keep information such as:
• Valid/invalid
• Dirty (inconsistent with memory)
• Shared (in another caches)
• When a processor executes a write operation to shared
data, basic design choices are:
• With respect to memory:
• Write through cache: do the write in memory as well as cache
• Write back cache: wait and do the write later, when the item is flushed
• With respect to other cached copies
• Update: give all other processors the new value
• Invalidate: all other processors remove from cache
• See CS252 or CS258 for details
02/05/2015 CS267 Lecture 6 54
SGI Altix
3000

• A node contains up to 4 Itanium 2 processors and 32GB of memory

• Network is SGI’s NUMAlink, the NUMAflex interconnect technology.
• Uses a mixture of snoopy and directory-based coherence
• Up to 512 processors that are cache coherent (global address space
is possible for larger machines)

02/05/2015 CS267 Lecture 6

Sharing: A Performance Problem
• True sharing
• Frequent writes to a variable can create a bottleneck
• OK for read-only or infrequently written data
• Technique: make copies of the value, one per processor, if this
is possible in the algorithm
• Example problem: the data structure that stores the
freelist/heap for malloc/free
• False sharing
• Cache block may also introduce artifacts
• Two distinct variables in the same cache block
• Technique: allocate data used by each processor contiguously,
or at least avoid interleaving in memory
• Example problem: an array of ints, one written frequently by
each processor (many ints per cache line)

02/05/2015 CS267 Lecture 6

Cache Coherence and Sequential Consistency
• There is a lot of hardware/work to ensure coherent caches
• Never more than 1 version of data for a given address in caches
• Data is always a value written by some processor
• But other HW/SW features may break sequential consistency (SC):
• The compiler reorders/removes code (e.g., your spin lock, see next slide)
• The compiler allocates a register for flag on Processor 2 and spins on that
register value without ever completing
• Write buffers (place to store writes while waiting to complete)
• Processors may reorder writes to merge addresses (not FIFO)
• Write X=1, Y=1, X=2 (second write to X may happen before Y’s)
• Prefetch instructions cause read reordering (read data before flag)
• The network reorders the two write messages.
• The write to flag is nearby, whereas data is far away.
• Some of these can be prevented by declaring variables “volatile”
• Most current commercial SMPs give up SC
• A correct program on a SC processor may be incorrect on one that is not

02/05/2015 CS267 Lecture 6 57

Example: Coherence not Enough
P1 P2
/*Assume initial value of A and ag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;

• Intuition not guaranteed by coherence

• expect memory to respect order between accesses to
different locations issued by a given process
• to preserve orders among accesses to same location by different
processes
• Coherence is not enough!
• pertains only to single location P1 Pn
• Need statement about ordering
between multiple locations.
Conceptual
Picture Mem
02/05/2015 Slide source: John Kubiatowicz
Programming with Weaker Memory Models than SC
• Possible to reason about machines with fewer
properties, but difficult
• Some rules for programming with these models
• Avoid race conditions
• Use system-provided synchronization primitives
• At the assembly level, may use “fences” (or analogs)
directly
• The high level language support for these differs
• Built-in synchronization primitives normally include the
necessary fence operations
• lock (), … only one thread at a time allowed here…. unlock()
• Region between lock/unlock called critical region
• For performance, need to keep critical region short

02/05/2015 CS267 Lecture 6 59

What to Take Away?
• Programming shared memory machines
• May allocate data in large shared region without too many
worries about where
• Memory hierarchy is critical to performance
• Even more so than on uniprocessors, due to coherence traffic
• For performance tuning, watch sharing (both true and false)
• Semantics
• Need to lock access to shared variable for read-modify-write
• Sequential consistency is the natural semantics
• Write race-free programs to get this
• Architects worked hard to make this work
• Caches are coherent with buses or directories
• No caching of remote data on shared address space machines
• But compiler and processor may still get in the way
• Non-blocking writes, read prefetching, code motion…
• Avoid races or use machine-specific fences carefully
02/05/2015 CS267 Lecture 6 60
Extra Slides

CS267 Lecture 6 61
Sequential Consistency Example

Processor 1 Processor 2 One Consistent Serial Order

LD1 A  5 LD5 B  2 LD1 A  5

LD2 B  7 … LD2 B  7
ST1 A,6 LD6 A  6 LD5 B  2
… ST4 B,21 ST1 A,6
LD3 A  6 … LD6 A  6
LD4 B  21 LD7 A  6 ST4 B,21
ST2 B,13 … LD3 A  6
ST3 B,4 LD8 B  4 LD4 B  21
LD7 A  6
ST2 B,13
ST3 B,4
02/05/2015
LD8
Slide source: John Kubiatowicz B  4
Multithreaded Execution
• Multitasking operating system:
• Gives “illusion” that multiple things happening at same time
• Switches at a course-grained time quanta (for instance: 10ms)
• Hardware Multithreading: multiple threads share
processor simultaneously (with little OS help)
• Hardware does switching
• HW for fast thread switch in small number of cycles
• much faster than OS switch which is 100s to 1000s of clocks
• Processor duplicates independent state of each thread
• e.g., a separate copy of register file, a separate PC, and for running
independent programs, a separate page table
• Memory shared through the virtual memory mechanisms, which already
support multiple processes
• When to switch between threads?
• Alternate instruction per thread (fine grain)
• When a thread is stalled, perhaps for a cache miss, another thread can
be executed (coarse grain)

02/05/2015 Slide source: John Kubiatowicz

Thread Scheduling
main thread Time

Thread A Thread B

Thread C Thread D

• Once created, when will a given thread run?

• It is up to the Operating System or hardware, but it will run eventually,
even if you have more threads than cores
• But – scheduling may be non-ideal for your application
• Programmer can provide hints or affinity in some cases
• E.g., create exactly P threads and assign to P cores
• Can provide user-level scheduling for some systems
• Application-specific tuning based on programming model
• Work in the ParLAB on making user-level scheduling easy to do (Lithe)

02/05/2015 Slide source: John Kubiatowicz

What about combining ILP and TLP?

• TLP and ILP exploit two different kinds of

parallel structure in a program
• Could a processor oriented at ILP benefit from
exploiting TLP?
• functional units are often idle in data path designed for ILP
because of either stalls or dependences in the code
• TLP used as a source of independent instructions that might
keep the processor busy during stalls
• TLP be used to occupy functional units that would otherwise lie
idle when insufficient ILP exists
• Called “Simultaneous Multithreading”
• Intel renamed this “Hyperthreading”

02/05/2015 Slide source: John Kubiatowicz

Quick Recall: Many Resources IDLE!

For an 8-way
superscalar.

From: Tullsen,
Eggers, and
Levy,
“Simultaneous
Multithreading:
Maximizing On-
02/05/2015
chip Slide source: John Kubiatow
Simultaneous Multi-threading ...

One thread, 8 units Two threads, 8 units

Cycle M M FX FX FP FP BRCC Cycle M M FX FX FP FP BRCC
1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC =

Condition
02/05/2015 Codes Slide source: John Kubiatowicz
Power 5 dataflow ...

• Why only two threads?

• With 4, one of the shared resources (physical registers,
cache, memory bandwidth) would be prone to bottleneck
• Cost:
• The Power5 core is about 24% larger than the Power4 core
because of the addition of SMT support
02/05/2015

Neptune 3000 Technical Manual 0.9
No ratings yet
Neptune 3000 Technical Manual 0.9
106 pages
Objective:: King Fahd University of Petroleum and Minerals
No ratings yet
Objective:: King Fahd University of Petroleum and Minerals
14 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
RA Maturity List
No ratings yet
RA Maturity List
30 pages
The University of Texas at Arlington
No ratings yet
The University of Texas at Arlington
45 pages
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Programming Shared Address Space Platforms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
6-Posix Threads
No ratings yet
6-Posix Threads
32 pages
Programming with Shared Memory: Nguyễn Quang Hùng
No ratings yet
Programming with Shared Memory: Nguyễn Quang Hùng
54 pages
Pthread PDF
No ratings yet
Pthread PDF
33 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
04 Pthread
No ratings yet
04 Pthread
42 pages
Lab Manual 09 (1)
No ratings yet
Lab Manual 09 (1)
6 pages
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
No ratings yet
Threads: Multicore Programming Multithreading Models Thread Libraries Threading Issues Operating System Examples
22 pages
CSE2005 ETH Reference Material I Module2 Threads
No ratings yet
CSE2005 ETH Reference Material I Module2 Threads
39 pages
OS Module 1 Slides-2
No ratings yet
OS Module 1 Slides-2
47 pages
Programming Shared Address Space Platforms
No ratings yet
Programming Shared Address Space Platforms
44 pages
CS211 Lec 15
No ratings yet
CS211 Lec 15
21 pages
Lect9 Pthread
No ratings yet
Lect9 Pthread
24 pages
Pthread Tutorial by Peter (Good One)
No ratings yet
Pthread Tutorial by Peter (Good One)
29 pages
Chap3 Pthread
No ratings yet
Chap3 Pthread
33 pages
Pthreads
No ratings yet
Pthreads
70 pages
MAP - Unit2
No ratings yet
MAP - Unit2
134 pages
Concurrent Programming Using Threads
No ratings yet
Concurrent Programming Using Threads
25 pages
C++ Pthread Tutorial
No ratings yet
C++ Pthread Tutorial
26 pages
High Performance Computing
No ratings yet
High Performance Computing
67 pages
IPC Linux
No ratings yet
IPC Linux
58 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
POSIX Thread Programming: Pthreads
No ratings yet
POSIX Thread Programming: Pthreads
26 pages
Programming Assignment 2
No ratings yet
Programming Assignment 2
5 pages
Lab 5
No ratings yet
Lab 5
8 pages
Pthreads: Thanks To LLNL For Their Tutorial From Which These Slides Are Derived
No ratings yet
Pthreads: Thanks To LLNL For Their Tutorial From Which These Slides Are Derived
26 pages
Lec06 Synchronization
No ratings yet
Lec06 Synchronization
34 pages
Help Lab Threads
No ratings yet
Help Lab Threads
17 pages
Multi-Threaded Programming With POSIX Threads - Linux Systems Programming
No ratings yet
Multi-Threaded Programming With POSIX Threads - Linux Systems Programming
2,608 pages
Parallel Progamming With Pthreads
No ratings yet
Parallel Progamming With Pthreads
79 pages
DD The Superior College Lahore: Bscs 5C
No ratings yet
DD The Superior College Lahore: Bscs 5C
15 pages
Week 4 - Threads
No ratings yet
Week 4 - Threads
37 pages
Unix Threads
No ratings yet
Unix Threads
36 pages
Threads and Multithreading
No ratings yet
Threads and Multithreading
36 pages
Chapter 4: Threads: in This Chapter Our Focus Is On: Multithreading Models Thread Libraries Threading Issues
No ratings yet
Chapter 4: Threads: in This Chapter Our Focus Is On: Multithreading Models Thread Libraries Threading Issues
23 pages
BLG305 Ders4-En
No ratings yet
BLG305 Ders4-En
34 pages
Shared Memory Programming Pthreads: DR Matthew Grove Slides
No ratings yet
Shared Memory Programming Pthreads: DR Matthew Grove Slides
41 pages
POSIX Threads Programming: Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
POSIX Threads Programming: Author: Blaise Barney, Lawrence Livermore National Laboratory
26 pages
A ShortIntroductiontoPOSIX Threads
No ratings yet
A ShortIntroductiontoPOSIX Threads
8 pages
Class17 Threads1
No ratings yet
Class17 Threads1
44 pages
14 Concurrency Threads
No ratings yet
14 Concurrency Threads
37 pages
02+Parallel+Processing+-+CPP+and+OMP
No ratings yet
02+Parallel+Processing+-+CPP+and+OMP
64 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Chapter 4: Threads: in This Chapter Our Focus Is On: Multithreading Models Thread Libraries Threading Issues
No ratings yet
Chapter 4: Threads: in This Chapter Our Focus Is On: Multithreading Models Thread Libraries Threading Issues
23 pages
Lecture 9 Programming Shared Address Space Platforms using POSIX Thread API.pptx
No ratings yet
Lecture 9 Programming Shared Address Space Platforms using POSIX Thread API.pptx
35 pages
chapter 4
No ratings yet
chapter 4
18 pages
Operating System 4
No ratings yet
Operating System 4
33 pages
System Programming - II Threads
No ratings yet
System Programming - II Threads
46 pages
Lecture09 ConcurrentProgramming 02 Synchronization
No ratings yet
Lecture09 ConcurrentProgramming 02 Synchronization
30 pages
POSIX Threads Programming: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
POSIX Threads Programming: Blaise Barney, Lawrence Livermore National Laboratory
36 pages
CS303-Lab 5 (2)
No ratings yet
CS303-Lab 5 (2)
8 pages
Pthreads Mod
No ratings yet
Pthreads Mod
110 pages
POSIX Thread (Pthread)
No ratings yet
POSIX Thread (Pthread)
14 pages
Threads
No ratings yet
Threads
48 pages
Chapter 4
No ratings yet
Chapter 4
25 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Li Model-Contrastive Federated Learning CVPR 2021 Paper
No ratings yet
Li Model-Contrastive Federated Learning CVPR 2021 Paper
10 pages
1 s2.0 S0950705121000381 Main
No ratings yet
1 s2.0 S0950705121000381 Main
11 pages
v1 Covered
No ratings yet
v1 Covered
16 pages
1 s2.0 S0925231223010202 Main
No ratings yet
1 s2.0 S0925231223010202 Main
18 pages
Dopamine: Differentially Private Federated Learning On Medical Data
No ratings yet
Dopamine: Differentially Private Federated Learning On Medical Data
9 pages
Securing Federated Learning With Blockchain: A Systematic Literature Review
No ratings yet
Securing Federated Learning With Blockchain: A Systematic Literature Review
35 pages
Sec22 Stevens
No ratings yet
Sec22 Stevens
18 pages
F Ml-He: A E H - E - B P - P F L S: ED N Fficient Omomorphic Ncryption Ased Rivacy Reserving Ederated Earning Ystem
No ratings yet
F Ml-He: A E H - E - B P - P F L S: ED N Fficient Omomorphic Ncryption Ased Rivacy Reserving Ederated Earning Ystem
20 pages
Entropy 23 00460 v2
No ratings yet
Entropy 23 00460 v2
14 pages
1 s2.0 S016740482300007X Main
No ratings yet
1 s2.0 S016740482300007X Main
18 pages
An Approachto Work With Quantum Data in
No ratings yet
An Approachto Work With Quantum Data in
6 pages
Quantum Federated Learning Remarks and Challenges
No ratings yet
Quantum Federated Learning Remarks and Challenges
5 pages
12 MPIProgramPerformance
No ratings yet
12 MPIProgramPerformance
33 pages
MPI Pacheco Ch3
No ratings yet
MPI Pacheco Ch3
124 pages
Qtutorial
No ratings yet
Qtutorial
7 pages
Cybersecurity
No ratings yet
Cybersecurity
11 pages
Eng4 Las Q4 W4
No ratings yet
Eng4 Las Q4 W4
5 pages
12 Tips On Django Best Practices
No ratings yet
12 Tips On Django Best Practices
26 pages
Human-Following of Mobile Robots Based On Object Tracking and Depth Vision
No ratings yet
Human-Following of Mobile Robots Based On Object Tracking and Depth Vision
5 pages
Katha Mala
No ratings yet
Katha Mala
4 pages
As 16969 Nu-En8n Im 96M11141 GB WW 1051-4
No ratings yet
As 16969 Nu-En8n Im 96M11141 GB WW 1051-4
2 pages
Addressing Modes
No ratings yet
Addressing Modes
39 pages
Factorial Cal Cu
No ratings yet
Factorial Cal Cu
2 pages
Chapter 18 Presentation
No ratings yet
Chapter 18 Presentation
19 pages
Asbari Et Al. - 2020 - The Role of Knowledge Transfer and Organizational Learning To Build Innovation Capability Evidence From Indonesia
No ratings yet
Asbari Et Al. - 2020 - The Role of Knowledge Transfer and Organizational Learning To Build Innovation Capability Evidence From Indonesia
15 pages
Daily Lesson Log Tle 7 Week 01
0% (1)
Daily Lesson Log Tle 7 Week 01
3 pages
Aparna Shukla
No ratings yet
Aparna Shukla
1 page
12 Use Case Document
No ratings yet
12 Use Case Document
36 pages
Nxlog Reference Manual
No ratings yet
Nxlog Reference Manual
422 pages
Modern Micro Turbines
No ratings yet
Modern Micro Turbines
4 pages
SQL Select: Database Systems Lecture 7 Natasha Alechina
No ratings yet
SQL Select: Database Systems Lecture 7 Natasha Alechina
30 pages
Live Book
No ratings yet
Live Book
748 pages
Nisha A Resume
No ratings yet
Nisha A Resume
1 page
Unit 4 - Operating System - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Operating System - WWW - Rgpvnotes.in
23 pages
Lecture 8 Disturbance Rejection: AME 455 Control Systems Design
No ratings yet
Lecture 8 Disturbance Rejection: AME 455 Control Systems Design
15 pages
Brajesh Thakur
No ratings yet
Brajesh Thakur
3 pages
LinuxCNC User Manual
No ratings yet
LinuxCNC User Manual
207 pages
MC33907NL 33908nlsmug
No ratings yet
MC33907NL 33908nlsmug
50 pages
Form Checklist Test Comm Rcu
No ratings yet
Form Checklist Test Comm Rcu
1 page
Week 2 Assignment 2
No ratings yet
Week 2 Assignment 2
2 pages
OpenCom 130 150 Install
No ratings yet
OpenCom 130 150 Install
152 pages
DFD and Data Dictionary - SAD 6e
100% (1)
DFD and Data Dictionary - SAD 6e
43 pages
MM300 Series: Models: MM301A MM301U MM311U
No ratings yet
MM300 Series: Models: MM301A MM301U MM311U
52 pages
Compiler Notes Introduction
No ratings yet
Compiler Notes Introduction
8 pages