SlideShare a Scribd company logo
Beyond The Critical Section
Introduction Tony Albrecht Senior Programmer for Pandemic Studios Brisbane Email: Tony.Albrecht0(at)gmail(dot)com
Overview Justify myself Start at the bottom Continue from the top Quick look in the middle
Parallel Programming: Why? Moore’s Law Limits to sequential CPUs – parallel processing is how we avoid those limits. Programs  must  be parallel to get Moore level speedups. Applies to programming in general.
Moore’s Law
“ Waaaah!” “ Parallel programming is hard.” “ My code already runs incredibly fast – it doesn’t need to go any faster.” “ It’s impossible to parallelise this algorithm.” “ Only the rendering pipeline needs to be parallel.” “ that’s only for super computers.”
Console trends
So? ~2011 ~6TFlop machine Next console will have between 64 and 128 processors  4 to 8GB of memory 128 processors!!!!
How can we utilise 100+ CPUS? Start now Design Implement Iterate Learn
The Problems Race conditions
Race Condition Example x++ x++ x=0 x=? Thread A Thread B
Race Condition Example R1 = 0 x=0 Thread A Thread B
Race Condition Example R1 = 0+1 x=0 Thread A Thread B
Race Condition Example R1 = 1 R1 = 0 x=0 Thread A Thread B
Race Condition Example R1 = 1 R1 = 0+1 x=1 Thread A Thread B
Race Condition Example Solution requires atomics or locking. R1 = 1 R1 = 1 x=1 Thread A Thread B
Atomics Atomic operations are uninterruptable, singular operations Get/Set Inc/Dec (Add/Sub) Compare And Swap Plus other variations
Compare And Swap CAS(memory, oldValue, newValue) If(memory==oldValue)   memory=newValue; Surprisingly useful. Simple locking primitive while(CAS(&lock,0,1)!=0)   ;
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1
Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1 x=2
Locking Used to serialise access to code. Like a key to a coffee shop toilet  one key,  one toilet,  queue for access. Lock()/Unlock() … Code… Lock(); // protected region Unlock(); ...more code…
Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
The Problems Race conditions Deadlocks
Deadlock “   When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.  ”      — Kansas Legislature Deadlock can occur when 2 or more processes require resource(s) from another.
Deadlock Thread 1   Thread 2 Generally can be considered to be a logic error Can be painfully subtle and rare. Lock A Lock B Lock B Lock A Unl0ck A Unlock B
The Problems Race conditions Deadlocks Read/write tearing
Read/write tearing More that one thread writing to the same memory at the same time. The more data, the more likely Solve with synchronisation primitives. “ AAAAAAAA” “ BBBBBBBB” “ AAAABBBB”
The Problems Race conditions Deadlocks Read/write tearing Priority Inversion
Priority Inversion Consider  threads with different priorities Low priority thread holds a shared resource High priority thread tries to acquire that resource High priority thread is blocked by the low Medium priority threads will execute at the expense of the low  and  the high threads.
The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem
The ABA problem Thread 1 reads ‘A’ from memory. Thread 2 modifies memory value ‘A’ to ‘B’ and back to ‘A’ again. Thread 1 resumes and assumes nothing has changed (using CAS) Often associated with dynamically allocated memory
Consider a list and a thread pool… head a c b … ..
Thread A about to CAS head from a to b head a c b … .. CAS(&head->next,a,b);
Threads B: deq a & b head c … .. a b A & B are released into thread local pools
Thread B enq A - reused head a c … .. b A is added back
Thread A executes CAS head a c … .. b CAS(&head->next,a,b);
Thread A executes CAS successfully! head a c … .. b CAS(&head->next,a,b);
ABA Solution Tag each pointer with a count Each time you use the ptr, inc the tag Must do it atomically
The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem Thread scheduling problems
Convoy/Stampede Convoy Multiple threads restricted by a bottleneck. Stampede Multiple threads being started at once.
Higher Level Locking Primitives SpinLock Mutex Barrier RWlock Semaphore
SpinLock Loop until a value is set. No OS overhead with thread management Doesn’t sleep thread Handy if you will never wait for long. Very bad if you need to wait for a long time Can embed sleep() or Yield() But these can be perilous
Mutex Mutual Exclusion A simple lock/unlock primitive Otherwise known as a CriticalSection Used to serialise access to code. Often overused. More than just a spinlock  can release thread Be aware of overhead
Barrier Will block until ‘n’ threads signal it Useful for ensuring that all threads have finished a particular task.
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Done
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Signal
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Do other stuff
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Done Calculating
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Signal Done
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(1) Use results Do stuff More code Signal
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff Calc pi
Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff
RWLock Allows many readers But exclusive writing Writing blocks writers and readers. Writing waits until all readers have finished.
Semaphore Generalisation of mutex Allows  ‘c’ threads access  to critical code at once. Basically an atomic integer Wait() will block if value == 0; then dec & cont. Signal() increments value (allows a waiting thread to unblock) Conceptually,  Mutexes stop other threads from running code Semaphores tell other threads to run code
Parallel Patterns Why patterns? A set of templates to aid design A common language Aids education Provides a familiar base to start implementation
So, how do we start? Analyse your problem Identify tasks that can execute concurrently Identify data local to each task Identify task order/schedule Analyse dependencies between tasks. Consider the HW you are running on
Problem Decomposition Problem From “Patterns for Parallel Programming”
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Parallel Programming”
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive From “Patterns for Parallel Programming”
Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination From “Patterns for Parallel Programming”
Task Parallelism Task dominant, linear Functionally driven problem Many tasks that may depend on each other Try to minimise dependencies Key elements: Tasks Dependencies Schedule
Divide and Conquer Task Dominant, recursive Problem solved by splitting it into smaller sub-problems and solving them independently. Generally its easy to take a sequential Divide and Conquer implementation and parallelise it.
Geometric Decomposition Data dominant, linear Decompose the data into chunks Solve for chunks independently. Beware of edge dependencies.
Recursive Data Pattern Data dominant, recursive Operations on trees, lists, graphs Dependencies can often prohibit parallelism Often requires tricky recasting of problem ie operate on all tree elements in parallel More work, but distributed across more cores
Pipeline Pattern Data flow dominant, linear Sets of data flowing through a sequence of stages Each stage is independent Easy to understand - simple, dedicated code
Event-Based Coordination Data flow, recursive Groups of semi-independent tasks interacting in an irregular fashion. Tasks sending events to other tasks which send tasks… Can be highly complex Tricky to load balance
Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
SPMD Single Program, Multiple Data Single source code image running on multiple threads Very common Easy to maintain Easy to understand
Master/Worker Dominant force is the need to dynamically load balance  Tasks are highly variable ie duration/cost Program structure doesn’t map onto loops Cores vary in performance. “ Bag of Tasks” Master sets up tasks and waits for completion Workers grab task from queue, execute and then grab the next one.
Loop Parallelism Dominated by computationally expensive loops Split iterations of the loop out to threads Be careful of memory use and process granularity
Fork/Join The number of concurrent tasks varies over the life of the execution. Complex or recursive relations between tasks Either Direct task/core mapping Thread pool
Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
Shared Data Required when At least one data structure is accessed by multiple tasks At least one task modifies the shared data The tasks potentially need to use the modified value. Solutions Serialise execution – mutual exclusion Noninterfering sets of operations RWlocks
Distributed Array How can we distribute an array across many threads? Used in Geometric Decomposition Break array into thread specific parts Maximise locality per thread Be wary of cache line overlap Keep data distribution coarse
Shared Queue Extremely valuable construct  Fundamental part of Master/Worker (“Bag of Tasks”) Must be consistent and work with many competing threads. Must be as efficient as possible Preferably lock free
Lock free programming Locks  Simple, easy to use and implement But serialise code execution Lock Free Tricky to implement and debug
Lock Free linked list Lock free linked list (ordered) Easily generalised to other container classes Stacks Queues Relatively simple to understand
Adding a node to a list head a c tail b
Adding a node: Step 1 head a c tail b Find where to insert
Adding a node: Step 2 head a c tail b newNode->Next = prev->Next;
Adding a node: Step 3 head a c tail b prev->Next = newNode;
Extending to multiple threads What could go wrong?
Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert
Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;
Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;
Add ‘b’ and ‘c’ concurrently head a d tail b c
Extending to multiple threads What could go wrong? Add another node between a & c A or c could be deleted A concurrent read could reach a dangling pointer. Any number of multiples of the above If anything can go wrong, it will. So, how do we make it thread safe? Lets examine some solutions
Coarse Grained Locking Lock the list for each add or remove Also lock for reads (find, iterators) Will effectively serialise the list Only one thread at a time can access the list. Correctness at the expense of performance .
A concrete example 10 producers Add 500 random numbers in a tight loop 10 consumers Remove the 500 numbers in a tight loop Each in its own thread 21 threads Running on PS3 using SNTuner to profile
Coarse Grain head a c tail b
Step 1: Lock list b head a c tail
Step 2 & 3:Find then Insert  b head a c tail
Step 4:Unlock head a c tail b
Coarse Grained locking Wide green bars are active locks Little blips are adds or removes Execution took 416ms (profiling significantly impacts performance)
Fine Grained Locking Add and Remove only affects neighbours Give each Node a lock (So, creating a node creates a mutex) Lock only neighbours when adding or removing. When iterating along the list you must lock/unlock as you go.
Fine Grained Locking head a c tail b
Fine Grained Locking a c tail b head
Fine Grained Locking c tail b head a
Fine Grained Locking head tail b a c
Fine Grained Locking head tail b a c
Fine Grained Locking head a c tail b
Fine Grained Locking Blocking is much longer – due to overhead in creating a mutex Very slow > 1200ms Better solution would have been to have a pool of mutexes that could be used
Optimistic Locking Search without locking Lock nodes once found, then validate them Valid if you can navigate to it from head. If invalid, search from head again.
Optimistic: Add(“g”) head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 2: Lock head a c d tail m g f k
Step 3: Validate head a c d tail m g f k
Step 3: Validate head a c d tail m g f k
Step 3: Validate - FAIL head a tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (retry) head a e tail m g d f k
Step 3a: Validate (success) head a e tail m g d f k
Step 4: Add head a e tail m g d f k
Step 5: Unlock head a e tail f k m g d
Optimistic Caveat We can’t delete nodes immediately Another thread could be reading it Can’t rely on memory not being changed. Use deferred garbage collection Delete in a ‘safe’ part of a frame. Or use invasive lists (supply own nodes) Find() requires validation (Locks).
Delete Caveat: Validate head a e tail m g d f k
Delete Caveat: Validate head a e tail m g d f k
Delete Caveat: delete ‘d’ head a e tail m g f k d
Delete Caveat: Validate head a e tail m g f k d
Delete Caveat: Validate head a e tail m g f k d
Delete Caveat: Valid! head a e tail m g f k d
Optimistic Synchronisation ~540ms Most time was spent validating Plus there was overhead in creating a mutex per node for the lock. Again, a pool of mutexes would benefit.
Lazy Synchronization Attempt to speed up Optimistic Validation Store a deleted flag in each node Find() is then lock free Just check the deleted flag on success.
Lazy: Add(“g”) head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1: Search head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1a: Search (delete c) head a c d tail f k m g
Step 1b: Search (lock) head d tail f k m g a c
Step 1c: Search (mark) head d tail f k m g a c
Step 2d: lock (skip/unlock) head a c d tail m g f k
Step 3: Add/Validate head a d tail m g c f k
Step 4: Unlock head a d tail f k m g c
Lazy Synchronisation Still need to keep the deleted nodes. Faster than Optimistic Still serialises.
Lazy Synchronisation ~330ms
Lock free (Non-Blocking) Can’t we just modify Lazy Sync to use CAS?
Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next;  |  prev->next=b;
Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.
Introducing the AtomicMarkedPtr<> Wrapper on uint32 Encapsulates an atomic pointer and a flag Allows testing of a flag and updating of a pointer atomically. Use LSB for the flag AtomicMarkedPtr<Node> next; next->CompareAndSet(eValue, nValue,eFlag, nFlag);
AtomicMarkedPtr<> We can now use CAS to set a pointer  and  check a flag in a single atomic action. ie. check deleted status and change pointer at same time. class Node { public: Node(); AtomicMarkedPtr<Node> m_Next; T m_Data; int32 m_Key; };
Lock Free: Remove ‘d’ head a c d tail f k m Start loop:
Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;
Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;
Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);
LockFree: InternalFind() Finds pred and curr Skips marked nodes. Consider the list at Step 2 in previous example and, lets introduce a second thread calling InternalFind();
Second InternalFind() head a c tail f k m pred curr succ pred curr succ d
If succ is marked… head a c tail f k m pred curr succ pred curr succ d
… Skip it head a c tail f k m pred curr succ pred curr succ d
Lock Free Synchronisation No blocking at all List is always in a consistent state. Faster threads help out slower ones.
Lock free Full thread usage ~60ms High thread coherency
Performance comparison
Real world considerations Cost of locking Context switching Memory coherency/latency Size/granularity of tasks
Advice Build a set of lock free containers Design around data flow Minimise locking You can have more than ‘n’ threads on an ‘n’ core machine Profile, profile, profile.
References Patterns for Parallel Programming – T. Mattson et.al. The Art of Multiprocessor Programming – M Herlihy and Nir Shavit http://www.top500.org/ Flow Based Programming - http://www.jpaulmorrison.com/fbp/index.shtml http://www.valvesoftware.com/publications/2007/GDC2007_SourceMulticore.pdf http://www.netrino.com/node/202 http://blogs.intel.com/research/2007/08/what_makes_parallel_programmin.php The Little book of Semaphores -  http://www.greenteapress.com/semaphores/ My Blog: 7DOF -  http://seven-degrees-of-freedom.blogspot.com/
Ad

More Related Content

What's hot (20)

C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Jason Hearne-McGuiness
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
Prabhakaran V M
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...
Jason Hearne-McGuiness
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Bishnu Rawal
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
Dhaval Kaneria
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
jbp4444
 
Matlab Serial Port
Matlab Serial PortMatlab Serial Port
Matlab Serial Port
Roberto Meattini
 
Open mp
Open mpOpen mp
Open mp
Gopi Saiteja
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
Akhila Prabhakaran
 
Parallel computation
Parallel computationParallel computation
Parallel computation
Jayanti Prasad Ph.D.
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
Dragos Sbîrlea
 
XML / JSON Data Exchange with PLC
XML / JSON Data Exchange with PLCXML / JSON Data Exchange with PLC
XML / JSON Data Exchange with PLC
Feri Handoyo
 
Class notes(week 5) on command line arguments
Class notes(week 5) on command line argumentsClass notes(week 5) on command line arguments
Class notes(week 5) on command line arguments
Kuntal Bhowmick
 
Unit v memory &amp; programmable logic devices
Unit v   memory &amp; programmable logic devicesUnit v   memory &amp; programmable logic devices
Unit v memory &amp; programmable logic devices
KanmaniRajamanickam
 
Nbvtalkataitamimageprocessingconf
NbvtalkataitamimageprocessingconfNbvtalkataitamimageprocessingconf
Nbvtalkataitamimageprocessingconf
Nagasuri Bala Venkateswarlu
 
Openmp
OpenmpOpenmp
Openmp
Amirali Sharifian
 
OpenMP
OpenMPOpenMP
OpenMP
Eric Cheng
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
SANKARSAN BOSE
 
Erlang
ErlangErlang
Erlang
Aaron Spiegel
 
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Jason Hearne-McGuiness
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...The Challenges facing Libraries and Imperative Languages from Massively Paral...
The Challenges facing Libraries and Imperative Languages from Massively Paral...
Jason Hearne-McGuiness
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Bishnu Rawal
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
Dhaval Kaneria
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
jbp4444
 
XML / JSON Data Exchange with PLC
XML / JSON Data Exchange with PLCXML / JSON Data Exchange with PLC
XML / JSON Data Exchange with PLC
Feri Handoyo
 
Class notes(week 5) on command line arguments
Class notes(week 5) on command line argumentsClass notes(week 5) on command line arguments
Class notes(week 5) on command line arguments
Kuntal Bhowmick
 
Unit v memory &amp; programmable logic devices
Unit v   memory &amp; programmable logic devicesUnit v   memory &amp; programmable logic devices
Unit v memory &amp; programmable logic devices
KanmaniRajamanickam
 
Parallel Programming in .NET
Parallel Programming in .NETParallel Programming in .NET
Parallel Programming in .NET
SANKARSAN BOSE
 

Similar to Parallel Programming: Beyond the Critical Section (20)

what every web and app developer should know about multithreading
what every web and app developer should know about multithreadingwhat every web and app developer should know about multithreading
what every web and app developer should know about multithreading
Ilya Haykinson
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
Skills Matter
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
Takayuki Muranushi
 
A Survey of Concurrency Constructs
A Survey of Concurrency ConstructsA Survey of Concurrency Constructs
A Survey of Concurrency Constructs
Ted Leung
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
Konrad Malawski
 
Concurrency Constructs Overview
Concurrency Constructs OverviewConcurrency Constructs Overview
Concurrency Constructs Overview
stasimus
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applications
Konrad Malawski
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
DataWorks Summit/Hadoop Summit
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
timcrack
 
Here comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdfHere comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdf
Krystian Zybała
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
JAXLondon2014
 
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
PROIDEA
 
Lock free algorithms
Lock free algorithmsLock free algorithms
Lock free algorithms
Pan Ip
 
Lessons learnt on a 2000-core cluster
Lessons learnt on a 2000-core clusterLessons learnt on a 2000-core cluster
Lessons learnt on a 2000-core cluster
Eugene Kirpichov
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
Daniel Blezek
 
Java Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
Java Core | Modern Java Concurrency | Martijn Verburg & Ben EvansJava Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
Java Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
JAX London
 
Towards a Scalable Non-Blocking Coding Style
Towards a Scalable Non-Blocking Coding StyleTowards a Scalable Non-Blocking Coding Style
Towards a Scalable Non-Blocking Coding Style
Azul Systems Inc.
 
what every web and app developer should know about multithreading
what every web and app developer should know about multithreadingwhat every web and app developer should know about multithreading
what every web and app developer should know about multithreading
Ilya Haykinson
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
Skills Matter
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
Takayuki Muranushi
 
A Survey of Concurrency Constructs
A Survey of Concurrency ConstructsA Survey of Concurrency Constructs
A Survey of Concurrency Constructs
Ted Leung
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
Konrad Malawski
 
Concurrency Constructs Overview
Concurrency Constructs OverviewConcurrency Constructs Overview
Concurrency Constructs Overview
stasimus
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applications
Konrad Malawski
 
10 Multicore 07
10 Multicore 0710 Multicore 07
10 Multicore 07
timcrack
 
Here comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdfHere comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdf
Krystian Zybała
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
JAXLondon2014
 
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
PROIDEA
 
Lock free algorithms
Lock free algorithmsLock free algorithms
Lock free algorithms
Pan Ip
 
Lessons learnt on a 2000-core cluster
Lessons learnt on a 2000-core clusterLessons learnt on a 2000-core cluster
Lessons learnt on a 2000-core cluster
Eugene Kirpichov
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Medical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUsMedical Image Processing Strategies for multi-core CPUs
Medical Image Processing Strategies for multi-core CPUs
Daniel Blezek
 
Java Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
Java Core | Modern Java Concurrency | Martijn Verburg & Ben EvansJava Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
Java Core | Modern Java Concurrency | Martijn Verburg & Ben Evans
JAX London
 
Towards a Scalable Non-Blocking Coding Style
Towards a Scalable Non-Blocking Coding StyleTowards a Scalable Non-Blocking Coding Style
Towards a Scalable Non-Blocking Coding Style
Azul Systems Inc.
 
Ad

Recently uploaded (20)

Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
The Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdfThe Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdf
Precisely
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
TrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI PaymentsTrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI Payments
Trs Labs
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
Foundations of Cybersecurity - Google Certificate
Foundations of Cybersecurity - Google CertificateFoundations of Cybersecurity - Google Certificate
Foundations of Cybersecurity - Google Certificate
VICTOR MAESTRE RAMIREZ
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
The Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdfThe Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdf
Precisely
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
TrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI PaymentsTrsLabs - Leverage the Power of UPI Payments
TrsLabs - Leverage the Power of UPI Payments
Trs Labs
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
Foundations of Cybersecurity - Google Certificate
Foundations of Cybersecurity - Google CertificateFoundations of Cybersecurity - Google Certificate
Foundations of Cybersecurity - Google Certificate
VICTOR MAESTRE RAMIREZ
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
Ad

Parallel Programming: Beyond the Critical Section

  • 2. Introduction Tony Albrecht Senior Programmer for Pandemic Studios Brisbane Email: Tony.Albrecht0(at)gmail(dot)com
  • 3. Overview Justify myself Start at the bottom Continue from the top Quick look in the middle
  • 4. Parallel Programming: Why? Moore’s Law Limits to sequential CPUs – parallel processing is how we avoid those limits. Programs must be parallel to get Moore level speedups. Applies to programming in general.
  • 6. “ Waaaah!” “ Parallel programming is hard.” “ My code already runs incredibly fast – it doesn’t need to go any faster.” “ It’s impossible to parallelise this algorithm.” “ Only the rendering pipeline needs to be parallel.” “ that’s only for super computers.”
  • 8. So? ~2011 ~6TFlop machine Next console will have between 64 and 128 processors 4 to 8GB of memory 128 processors!!!!
  • 9. How can we utilise 100+ CPUS? Start now Design Implement Iterate Learn
  • 10. The Problems Race conditions
  • 11. Race Condition Example x++ x++ x=0 x=? Thread A Thread B
  • 12. Race Condition Example R1 = 0 x=0 Thread A Thread B
  • 13. Race Condition Example R1 = 0+1 x=0 Thread A Thread B
  • 14. Race Condition Example R1 = 1 R1 = 0 x=0 Thread A Thread B
  • 15. Race Condition Example R1 = 1 R1 = 0+1 x=1 Thread A Thread B
  • 16. Race Condition Example Solution requires atomics or locking. R1 = 1 R1 = 1 x=1 Thread A Thread B
  • 17. Atomics Atomic operations are uninterruptable, singular operations Get/Set Inc/Dec (Add/Sub) Compare And Swap Plus other variations
  • 18. Compare And Swap CAS(memory, oldValue, newValue) If(memory==oldValue) memory=newValue; Surprisingly useful. Simple locking primitive while(CAS(&lock,0,1)!=0) ;
  • 19. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B
  • 20. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1
  • 21. Race Condition Solution A AtomicInc(x) AtomicInc(x) x=0 Thread A Thread B x=1 x=2
  • 22. Locking Used to serialise access to code. Like a key to a coffee shop toilet one key, one toilet, queue for access. Lock()/Unlock() … Code… Lock(); // protected region Unlock(); ...more code…
  • 23. Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 24. Race Condition Solution B x=0 Thread A Thread B Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 25. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 26. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 27. Race Condition Solution B x=0 Thread A Thread B x=1 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 28. Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 29. Race Condition Solution B x=0 Thread A Thread B x=1 x=2 Lock A x++ Unl0ck A Lock A x++ Unl0ck A
  • 30. The Problems Race conditions Deadlocks
  • 31. Deadlock “ When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone. ”   — Kansas Legislature Deadlock can occur when 2 or more processes require resource(s) from another.
  • 32. Deadlock Thread 1 Thread 2 Generally can be considered to be a logic error Can be painfully subtle and rare. Lock A Lock B Lock B Lock A Unl0ck A Unlock B
  • 33. The Problems Race conditions Deadlocks Read/write tearing
  • 34. Read/write tearing More that one thread writing to the same memory at the same time. The more data, the more likely Solve with synchronisation primitives. “ AAAAAAAA” “ BBBBBBBB” “ AAAABBBB”
  • 35. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion
  • 36. Priority Inversion Consider threads with different priorities Low priority thread holds a shared resource High priority thread tries to acquire that resource High priority thread is blocked by the low Medium priority threads will execute at the expense of the low and the high threads.
  • 37. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem
  • 38. The ABA problem Thread 1 reads ‘A’ from memory. Thread 2 modifies memory value ‘A’ to ‘B’ and back to ‘A’ again. Thread 1 resumes and assumes nothing has changed (using CAS) Often associated with dynamically allocated memory
  • 39. Consider a list and a thread pool… head a c b … ..
  • 40. Thread A about to CAS head from a to b head a c b … .. CAS(&head->next,a,b);
  • 41. Threads B: deq a & b head c … .. a b A & B are released into thread local pools
  • 42. Thread B enq A - reused head a c … .. b A is added back
  • 43. Thread A executes CAS head a c … .. b CAS(&head->next,a,b);
  • 44. Thread A executes CAS successfully! head a c … .. b CAS(&head->next,a,b);
  • 45. ABA Solution Tag each pointer with a count Each time you use the ptr, inc the tag Must do it atomically
  • 46. The Problems Race conditions Deadlocks Read/write tearing Priority Inversion The ABA Problem Thread scheduling problems
  • 47. Convoy/Stampede Convoy Multiple threads restricted by a bottleneck. Stampede Multiple threads being started at once.
  • 48. Higher Level Locking Primitives SpinLock Mutex Barrier RWlock Semaphore
  • 49. SpinLock Loop until a value is set. No OS overhead with thread management Doesn’t sleep thread Handy if you will never wait for long. Very bad if you need to wait for a long time Can embed sleep() or Yield() But these can be perilous
  • 50. Mutex Mutual Exclusion A simple lock/unlock primitive Otherwise known as a CriticalSection Used to serialise access to code. Often overused. More than just a spinlock can release thread Be aware of overhead
  • 51. Barrier Will block until ‘n’ threads signal it Useful for ensuring that all threads have finished a particular task.
  • 52. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff
  • 53. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Calculating
  • 54. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Done
  • 55. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(3) Use results Do stuff Signal
  • 56. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Do other stuff
  • 57. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Calculating
  • 58. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Done Calculating
  • 59. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(2) Use results Do stuff Signal Done
  • 60. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(1) Use results Do stuff More code Signal
  • 61. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff Calc pi
  • 62. Barrier example Thread 1 Thread 2 Thread 3 Thread 4 Barrier(0) Use results Do stuff
  • 63. RWLock Allows many readers But exclusive writing Writing blocks writers and readers. Writing waits until all readers have finished.
  • 64. Semaphore Generalisation of mutex Allows ‘c’ threads access to critical code at once. Basically an atomic integer Wait() will block if value == 0; then dec & cont. Signal() increments value (allows a waiting thread to unblock) Conceptually, Mutexes stop other threads from running code Semaphores tell other threads to run code
  • 65. Parallel Patterns Why patterns? A set of templates to aid design A common language Aids education Provides a familiar base to start implementation
  • 66. So, how do we start? Analyse your problem Identify tasks that can execute concurrently Identify data local to each task Identify task order/schedule Analyse dependencies between tasks. Consider the HW you are running on
  • 67. Problem Decomposition Problem From “Patterns for Parallel Programming”
  • 68. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow From “Patterns for Parallel Programming”
  • 69. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive From “Patterns for Parallel Programming”
  • 70. Problem Decomposition Problem Organise By Tasks Organise By Data Decomposition Organise By Data Flow Linear Recursive Linear Recursive Linear Recursive Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event-Based Coordination From “Patterns for Parallel Programming”
  • 71. Task Parallelism Task dominant, linear Functionally driven problem Many tasks that may depend on each other Try to minimise dependencies Key elements: Tasks Dependencies Schedule
  • 72. Divide and Conquer Task Dominant, recursive Problem solved by splitting it into smaller sub-problems and solving them independently. Generally its easy to take a sequential Divide and Conquer implementation and parallelise it.
  • 73. Geometric Decomposition Data dominant, linear Decompose the data into chunks Solve for chunks independently. Beware of edge dependencies.
  • 74. Recursive Data Pattern Data dominant, recursive Operations on trees, lists, graphs Dependencies can often prohibit parallelism Often requires tricky recasting of problem ie operate on all tree elements in parallel More work, but distributed across more cores
  • 75. Pipeline Pattern Data flow dominant, linear Sets of data flowing through a sequence of stages Each stage is independent Easy to understand - simple, dedicated code
  • 76. Event-Based Coordination Data flow, recursive Groups of semi-independent tasks interacting in an irregular fashion. Tasks sending events to other tasks which send tasks… Can be highly complex Tricky to load balance
  • 77. Supporting Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
  • 78. Program Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
  • 79. SPMD Single Program, Multiple Data Single source code image running on multiple threads Very common Easy to maintain Easy to understand
  • 80. Master/Worker Dominant force is the need to dynamically load balance Tasks are highly variable ie duration/cost Program structure doesn’t map onto loops Cores vary in performance. “ Bag of Tasks” Master sets up tasks and waits for completion Workers grab task from queue, execute and then grab the next one.
  • 81. Loop Parallelism Dominated by computationally expensive loops Split iterations of the loop out to threads Be careful of memory use and process granularity
  • 82. Fork/Join The number of concurrent tasks varies over the life of the execution. Complex or recursive relations between tasks Either Direct task/core mapping Thread pool
  • 83. Supporting Data Structures SPMD Master/Worker Loop Parallelism Fork/Join Program Structures Data Structures Shared Data Distributed Array Shared Queue
  • 84. Shared Data Required when At least one data structure is accessed by multiple tasks At least one task modifies the shared data The tasks potentially need to use the modified value. Solutions Serialise execution – mutual exclusion Noninterfering sets of operations RWlocks
  • 85. Distributed Array How can we distribute an array across many threads? Used in Geometric Decomposition Break array into thread specific parts Maximise locality per thread Be wary of cache line overlap Keep data distribution coarse
  • 86. Shared Queue Extremely valuable construct Fundamental part of Master/Worker (“Bag of Tasks”) Must be consistent and work with many competing threads. Must be as efficient as possible Preferably lock free
  • 87. Lock free programming Locks Simple, easy to use and implement But serialise code execution Lock Free Tricky to implement and debug
  • 88. Lock Free linked list Lock free linked list (ordered) Easily generalised to other container classes Stacks Queues Relatively simple to understand
  • 89. Adding a node to a list head a c tail b
  • 90. Adding a node: Step 1 head a c tail b Find where to insert
  • 91. Adding a node: Step 2 head a c tail b newNode->Next = prev->Next;
  • 92. Adding a node: Step 3 head a c tail b prev->Next = newNode;
  • 93. Extending to multiple threads What could go wrong?
  • 94. Add ‘b’ and ‘c’ concurretly head a d tail b c Find where to insert
  • 95. Add ‘b’ and ‘c’ concurretly head a d tail b c newNode->Next = prev->Next;
  • 96. Add ‘b’ and ‘c’ concurrently head a d tail b c prev->Next = newNode;
  • 97. Add ‘b’ and ‘c’ concurrently head a d tail b c
  • 98. Extending to multiple threads What could go wrong? Add another node between a & c A or c could be deleted A concurrent read could reach a dangling pointer. Any number of multiples of the above If anything can go wrong, it will. So, how do we make it thread safe? Lets examine some solutions
  • 99. Coarse Grained Locking Lock the list for each add or remove Also lock for reads (find, iterators) Will effectively serialise the list Only one thread at a time can access the list. Correctness at the expense of performance .
  • 100. A concrete example 10 producers Add 500 random numbers in a tight loop 10 consumers Remove the 500 numbers in a tight loop Each in its own thread 21 threads Running on PS3 using SNTuner to profile
  • 101. Coarse Grain head a c tail b
  • 102. Step 1: Lock list b head a c tail
  • 103. Step 2 & 3:Find then Insert b head a c tail
  • 104. Step 4:Unlock head a c tail b
  • 105. Coarse Grained locking Wide green bars are active locks Little blips are adds or removes Execution took 416ms (profiling significantly impacts performance)
  • 106. Fine Grained Locking Add and Remove only affects neighbours Give each Node a lock (So, creating a node creates a mutex) Lock only neighbours when adding or removing. When iterating along the list you must lock/unlock as you go.
  • 107. Fine Grained Locking head a c tail b
  • 108. Fine Grained Locking a c tail b head
  • 109. Fine Grained Locking c tail b head a
  • 110. Fine Grained Locking head tail b a c
  • 111. Fine Grained Locking head tail b a c
  • 112. Fine Grained Locking head a c tail b
  • 113. Fine Grained Locking Blocking is much longer – due to overhead in creating a mutex Very slow > 1200ms Better solution would have been to have a pool of mutexes that could be used
  • 114. Optimistic Locking Search without locking Lock nodes once found, then validate them Valid if you can navigate to it from head. If invalid, search from head again.
  • 115. Optimistic: Add(“g”) head a c d tail f k m g
  • 116. Step 1: Search head a c d tail f k m g
  • 117. Step 1: Search head a c d tail f k m g
  • 118. Step 1: Search head a c d tail f k m g
  • 119. Step 1: Search head a c d tail f k m g
  • 120. Step 1: Search head a c d tail f k m g
  • 121. Step 1: Search head a c d tail f k m g
  • 122. Step 2: Lock head a c d tail m g f k
  • 123. Step 3: Validate head a c d tail m g f k
  • 124. Step 3: Validate head a c d tail m g f k
  • 125. Step 3: Validate - FAIL head a tail m g d f k
  • 126. Step 3a: Validate (retry) head a e tail m g d f k
  • 127. Step 3a: Validate (retry) head a e tail m g d f k
  • 128. Step 3a: Validate (retry) head a e tail m g d f k
  • 129. Step 3a: Validate (retry) head a e tail m g d f k
  • 130. Step 3a: Validate (success) head a e tail m g d f k
  • 131. Step 4: Add head a e tail m g d f k
  • 132. Step 5: Unlock head a e tail f k m g d
  • 133. Optimistic Caveat We can’t delete nodes immediately Another thread could be reading it Can’t rely on memory not being changed. Use deferred garbage collection Delete in a ‘safe’ part of a frame. Or use invasive lists (supply own nodes) Find() requires validation (Locks).
  • 134. Delete Caveat: Validate head a e tail m g d f k
  • 135. Delete Caveat: Validate head a e tail m g d f k
  • 136. Delete Caveat: delete ‘d’ head a e tail m g f k d
  • 137. Delete Caveat: Validate head a e tail m g f k d
  • 138. Delete Caveat: Validate head a e tail m g f k d
  • 139. Delete Caveat: Valid! head a e tail m g f k d
  • 140. Optimistic Synchronisation ~540ms Most time was spent validating Plus there was overhead in creating a mutex per node for the lock. Again, a pool of mutexes would benefit.
  • 141. Lazy Synchronization Attempt to speed up Optimistic Validation Store a deleted flag in each node Find() is then lock free Just check the deleted flag on success.
  • 142. Lazy: Add(“g”) head a c d tail f k m g
  • 143. Step 1: Search head a c d tail f k m g
  • 144. Step 1: Search head a c d tail f k m g
  • 145. Step 1: Search head a c d tail f k m g
  • 146. Step 1a: Search (delete c) head a c d tail f k m g
  • 147. Step 1a: Search (delete c) head a c d tail f k m g
  • 148. Step 1a: Search (delete c) head a c d tail f k m g
  • 149. Step 1a: Search (delete c) head a c d tail f k m g
  • 150. Step 1b: Search (lock) head d tail f k m g a c
  • 151. Step 1c: Search (mark) head d tail f k m g a c
  • 152. Step 2d: lock (skip/unlock) head a c d tail m g f k
  • 153. Step 3: Add/Validate head a d tail m g c f k
  • 154. Step 4: Unlock head a d tail f k m g c
  • 155. Lazy Synchronisation Still need to keep the deleted nodes. Faster than Optimistic Still serialises.
  • 157. Lock free (Non-Blocking) Can’t we just modify Lazy Sync to use CAS?
  • 158. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
  • 159. Delete ‘a’ and add ‘b’ concurrently head a c tail b prev->next=curr->next; | prev->next=b;
  • 160. Delete ‘a’ and add ‘b’ concurrently head a c tail b head->next=a->next; | prev->next=b;
  • 161. Delete ‘a’ and add ‘b’ concurrently head a c tail b Effectively deletes ‘a’ and ‘b’.
  • 162. Introducing the AtomicMarkedPtr<> Wrapper on uint32 Encapsulates an atomic pointer and a flag Allows testing of a flag and updating of a pointer atomically. Use LSB for the flag AtomicMarkedPtr<Node> next; next->CompareAndSet(eValue, nValue,eFlag, nFlag);
  • 163. AtomicMarkedPtr<> We can now use CAS to set a pointer and check a flag in a single atomic action. ie. check deleted status and change pointer at same time. class Node { public: Node(); AtomicMarkedPtr<Node> m_Next; T m_Data; int32 m_Key; };
  • 164. Lock Free: Remove ‘d’ head a c d tail f k m Start loop:
  • 165. Step 1: Find ‘d’ head a c tail f k m pred curr succ d if(!InternalFind(‘d’)) continue;
  • 166. Step 2: Mark ‘d’ head a c tail f k m pred curr succ d if(!curr->next->CAS(succ,succ,false,true)) continue;
  • 167. Step 3: Skip ‘d’ head a c tail f k m pred curr succ d pred->next->CAS(curr,succ,false,false);
  • 168. LockFree: InternalFind() Finds pred and curr Skips marked nodes. Consider the list at Step 2 in previous example and, lets introduce a second thread calling InternalFind();
  • 169. Second InternalFind() head a c tail f k m pred curr succ pred curr succ d
  • 170. If succ is marked… head a c tail f k m pred curr succ pred curr succ d
  • 171. … Skip it head a c tail f k m pred curr succ pred curr succ d
  • 172. Lock Free Synchronisation No blocking at all List is always in a consistent state. Faster threads help out slower ones.
  • 173. Lock free Full thread usage ~60ms High thread coherency
  • 175. Real world considerations Cost of locking Context switching Memory coherency/latency Size/granularity of tasks
  • 176. Advice Build a set of lock free containers Design around data flow Minimise locking You can have more than ‘n’ threads on an ‘n’ core machine Profile, profile, profile.
  • 177. References Patterns for Parallel Programming – T. Mattson et.al. The Art of Multiprocessor Programming – M Herlihy and Nir Shavit http://www.top500.org/ Flow Based Programming - http://www.jpaulmorrison.com/fbp/index.shtml http://www.valvesoftware.com/publications/2007/GDC2007_SourceMulticore.pdf http://www.netrino.com/node/202 http://blogs.intel.com/research/2007/08/what_makes_parallel_programmin.php The Little book of Semaphores - http://www.greenteapress.com/semaphores/ My Blog: 7DOF - http://seven-degrees-of-freedom.blogspot.com/