SlideShare a Scribd company logo
Transactional Memory
Yuuki Takano
ytakanoster@gmail.com
Why Transactional
Memory?
• lock is difficult to manage.
• deadlock
• starvation
• priority Inversion
• lock convoy
• transactional memory mitigates these problems
2
Deadlock
3
t
Thread 1
Thread 2
Lock B
Lock A
try to acquire A
and fail
try to acquire B
and fail
Starvation
4
t
High PriorityThread (acquire A)
High PriorityThread (acquire B)
Lock B
Lock A
Low PriorityThread (acquire A and B)
try to acquire A
and fail
Lock A
try to acuire B
and fail
Lock A
Release A
Priority Inversion
5
t
High PriorityThread
Low PriorityThread
acquiring lock
try to acquire
and fail
Lock Convoy
6
Scheduler
Thread1
Thread2
Thread3
ThreadN
1. contention
Thread2
4. acquire
2. event
3. contention (spin lock)
4. reschedule
high overhead when many threads
Complexity of
Multithread Programming
7
algorithm data structure
ideal world
algorithm data structure
parallelism
parallel algorithm parallel data structure
real world
complicated source code
simple source code
buggy
difficult to maintain
actually we want
Lock and
Transactional Memory
• Lock
• execute critical section exclusively
• only one code enter the critical section
• Transactional Memory
• execute critical section speculatively
• multiple codes enter same critical section simultaneously
• conflicts are detected both while executing critical section and the end
of critical section
8
Spin-lock by Atomic Operation
• CAS (compare-and-swap)
• compare and swap are performed atomically
• test-and-set, compare-and-add, etc…
• spin-lock is achieved by using CAS
9
int locked;
lock_spin() {
while (__sync_lock_test_and_set(&locked, 1)) {
while (locked) ; // busy-wait
}
}
unlock_spin() {
__sync_lock_release(&locked);
}
if locked is 0, set 1
Syntax of Transactional Memory
atomic, retry, orElse
10
atomic {
// transaction
if (q.size() == 0) {
// rollback and retry
// transactions is restarted when
// read-set is updated
retry;
}
… // do something
} orElse {
// detect rollback and retry
}
Software
Transactional Memory
11
Software Transactional
Memory
• TL2
• Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II , 20th International Conference on
Distributed Computing , DISC 2006
• LSA
• Torvald Riegel, Pascal Felber, and Christof Fetzer, A Lazy Snapshot Algorithm with Eager
Validation , 20th International Conference on Distributed Computing, DISC 2006
• LogTM
• Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, David A. Wood, LogTM: log-
based transactional memory , HPCA 2006: 254-265
• DEUCE
• Guy Korland, Nir Shavit and Pascal Felber, Noninvasive Java Concurrency with Deuce STM ,
MultiProg 2010
12 etc
Summary of TL2
• prepare a variable called global clock
• associate memory regions with version numbers
• update version numbers when writing
• detect conflicts when reading and writing by comparing the
global clock with memory version number
• retry transaction when detecting conflicts
• otherwise commit
13
TL2 - Variables
14
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
TL2 - Algorithm (1)
15
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
transaction {
load var1;
load var2;
…
store var3;
}
1. load the global version clock and store it in a
thread local read-version number.
1.
TL2 - Algorithm (2)
16
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 12. run through a speculative execution
transaction {
load var1;
load var2;
…
store var3;
}
2. run
TL2 - Algorithm (3)
17
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
2.1. log read addresses to the read-set
transaction {
load var1;
load var2;
…
store var3;
}
2. log read-set
TL2 - Algorithm (4)
18
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
2.2. log write addresses and values to
the write-set
transaction {
load var1;
load var2;
…
store var3;
}
2.2 log write-set
TL2 - Algorithm (5)
19
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
thread 1
pointer to 1
pointer to 2
read-set 1
pointer to 3
write-set 1
pointer to 3
value of 3
variable 3 is stored and loaded
Note that if a variable in the read-set already appears
in the write-set, refer to the variable in the write-set
from to avoid read-after-write hazard.
TL2 - Algorithm (6)
20
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
2.3. check variables are not modified when
loading. make sure that version numbers are
less than the read-version number.
transaction {
load var1;
load var2;
…
store var3;
}
<=
if modified, abort transaction
TL2 - Algorithm (7)
21
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
2.4. check write-locks are free?
transaction {
load var1;
load var2;
…
store var3;
}
free?
free?
if locked, abort transaction
TL2 - Algorithm (8)
22
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
3. acquire write-locks using bounded spin lock
transaction {
load var1;
load var2;
…
store var3;
}
lock
if failed to acquire write-locklocked, abort transaction
TL2 - Algorithm (9)
23
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
4. increment the global version clock (CAS operation)
and store it to the write-version number.
transaction {
load var1;
load var2;
…
store var3;
}
increment
and store
TL2 - Algorithm (10)
24
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
5.1. check variables are not modified when
loading. make sure that version numbers are
less than the read-version number.
transaction {
load var1;
load var2;
…
store var3;
}
<=
if modified, abort transaction
TL2 - Algorithm (11)
25
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
5.2. check write-locks are free?
transaction {
load var1;
load var2;
…
store var3;
}
free?
free?
if locked, abort transaction
TL2 - Algorithm (12)
26
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
transaction {
load var1;
load var2;
…
store var3;
}
rv + 1 = wv?
5.3. in the special case (where read-version
number + 1 = write-version number) it is not
necessary to validate the read-set 

TL2 - Algorithm (13)
27
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
transaction {
load var1;
load var2;
…
store var3;
}
6.1. commit values of the write-set

TL2 - Algorithm (14)
28
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
transaction {
load var1;
load var2;
…
store var3;
}
6.2. update version numbers by the
write version number
release
TL2 - Algorithm (15)
29
global version clock
variable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1
variable 2
version number 2
write-lock 2
variable 3
version number 3
write-lock 3
write-version number 1
thread 1
read-set 1
write-set 1
transaction {
load var1;
load var2;
…
store var3;
}
6.3. release the write-locks
release
Hardware
Transactional Memory
30
Hardware Transactional
Memory
• use CPU cache to detect conflicts
• modify cache coherence algorithm to
achieve transactional memory
31
Cache Coherence
• MESI protocol
• There are 4 states
• Modified, Exclusive, Shared, Invalid
32
MESI
Modified State
33
main memory
CPU0 CPU1
cache 0 cache 1
cache line
dirty, must write back
not shared with other CPU
MESI
Exclusive State
34
main memory
CPU0 CPU1
cache 0 cache 1
cache line
not modified
not shared with other CPU
MESI
Shared State
35
main memory
CPU0 CPU1
cache 0 cache 1
cache line
not modified
shared with other CPU
MESI
Invalid State
36
main memory
CPU0 CPU1
cache 0 cache 1
cache line
no meaningful data
MESI
Exclusive Load
37
main memory
CPU0 CPU1
cache 0 cache 1
1. request exclusive load
2. write back if modified
3. change state to invalid
4. load state with exclusive state
MESI
Shared Load
38
main memory
CPU0 CPU1
cache 0 cache 1
1. request shared load
2. write back if modified
3. change state to shared
4. load state with shared state
MESI
eviction
39
main memory
CPU0 CPU1
cache 0 cache 1
1. write back if modified
2. discard
Transactional
Cache Coherence (1)
40
main memory
CPU0 CPU1
cache 0 cache 1
0
prepare transactional bit in each cache line
0: not in transaction
1: in transaction
Transactional
Cache Coherence (2)
41
main memory
CPU0 CPU1
cache 0 cache 1
1
abort transaction if MESI protocol invalidates transaction entry
shared or exclusive state
Transactional
Cache Coherence (3)
42
main memory
CPU0 CPU1
cache 0 cache 1
1
discard modified value and abort transaction
if MESI protocol invalidates or evicts transaction entry
modified
Transactional
Cache Coherence (4)
43
main memory
CPU0 CPU1
cache 0 cache 1
1
abort transaction if MESI protocol evicts transaction entry
because cache coherence protocol cannot detect conflicts
evicted
Problems
44
Problem (1)
• infinite loop in transaction
• detection of variable version in loops should reduce
performance significantly
• requirement of closed memory management
• codes out of transaction can refer and update variables
in transaction in languages like C, C++
• compiler or running environment should care about
45
Problem (2)
46
atomic {
…
launchMissile();
…
}
Missiles may be
launched many times!
IO in transaction must causes abort
Problem (3)
• livelock
47
Implementation
48
Software Transactional Memory (STM)
in Haskell
• Haskell provides STM by concurrent module
• STM monad is provided to achieve STM
• example implementation
• https://gist.github.com/ytakano/
228b68ef099c7bdd2f2c
49
Hardware Transactional Memory (HTM)
Intel TSX
• HTM is available from Haswell
• Intel TSX HLE
• xacquire and xrelease instructions
• Intel TSX RTM
• xbegin and xend instructions
50
Intel TSX RTM
51
xbegin ABORT
. . .
xend
ABORT:
// fallback
if aborted sometimes, must go to fallback codes (such as spin lock)
Lock by using tsx-tools
https://github.com/andikleen/tsx-tools
52
volatile int lock = 0;
rtm_lock() {
for (int i = 0; i < RTM_MAX_RETRY; i++) {
unsigned status = _xbegin();
if (status == _XBEGIN_STARTED) {
if (! lock)
return; // successfully started
_xabort(0xff);
}
if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff &&
! (status & _XABORT_NESTED) {
while (lock) _mm_pause(); // busy-wait
} else if (!(status & _XABORT_RETRY)) {
break;
}
}
while (__sync_lock_test_and_set(&lock, 1)) { // fallback to spin-lock
while (lock) _mm_pause(); // busy-wait
}
}
lock by using Intel TSX RTM
Unlock by using tsx-tools
https://github.com/andikleen/tsx-tools
53
rtm_unlock() {
if (lock) {
__sync_lock_release(&lock);
} else {
_xend();
}
}
unlock by using Intel TSX RTM
Performance of Intel TSX
• Intel says that codes of coarse-grained
lock can compare with codes of fine-
grained lock
• easy to write core scalable codes
54
55
Applying Intel® TSX
scaling
Threads
scaling
Threads
Application with
Coarse Grain Lock
Application re-written
with Finer Grain Locks
An example of secondary benefits
of Intel® TSX
Coarse Grain Lock
Coarse Grain Lock
+ Intel® TSX
Fine Grain Locks
Fine Grain Locks
+ Intel® TSX
Fine Grain Behavior at Coarse Grain Effort
from Intel Developer Forum 2012
56
Intel® TSX Can Enable Simpler Scalable Algorithms
Enabling Simpler Algorithms
Lock-Free Algorithm
• Don’t use critical section locks
• Developer manages concurrency
• Very difficult to get correct & optimize
– Constrain data structure selection
– Highly contended atomic operations
State of the art lock-free algorithm
Ops/sec
Threads
Ops/sec
Threads
TSX lock based algorithm
Lock-Based + Intel® TSX
• Use critical section locks for ease
• Let hardware extract concurrency
• Enables algorithm simplification
– Flexible data structure selection
– Equivalent data structure lock-free
algorithm very hard to verify
Real World Example
from Intel Developer Forum 2012
EOF
57

More Related Content

What's hot (20)

PPTX
Transactional Memory
Smruti Sarangi
 
PPTX
Semaphore
Arafat Hossan
 
PPT
Virtual memory
Muhammad Farooq
 
PPTX
Block cipher modes of operation
harshit chavda
 
PPTX
Semaphore
LakshmiSamivel
 
PPTX
Kernels and its types
ARAVIND18MCS1004
 
PPTX
Process synchronization in Operating Systems
Ritu Ranjan Shrivastwa
 
PDF
Operating System-Ch4.processes
Syaiful Ahdan
 
PPTX
Memory Management
lavanya marichamy
 
PPTX
Critical section problem in operating system.
MOHIT DADU
 
PPTX
Linux I2C
KaidenYu
 
PPTX
greedy algorithm Fractional Knapsack
Md. Musfiqur Rahman Foysal
 
PPT
Memory management
Vishal Singh
 
PDF
Semaphores
Mohd Arif
 
PDF
Address/Thread/Memory Sanitizer
Platonov Sergey
 
PPTX
Bottom half in linux kernel
KrishnaPrasad630
 
PPTX
Transposition cipher techniques
SHUBHA CHATURVEDI
 
PDF
Classical encryption techniques
Dr.Florence Dayana
 
PPT
Linux Crash Dump Capture and Analysis
Paul V. Novarese
 
PPT
Operating Systems - "Chapter 5 Process Synchronization"
Ra'Fat Al-Msie'deen
 
Transactional Memory
Smruti Sarangi
 
Semaphore
Arafat Hossan
 
Virtual memory
Muhammad Farooq
 
Block cipher modes of operation
harshit chavda
 
Semaphore
LakshmiSamivel
 
Kernels and its types
ARAVIND18MCS1004
 
Process synchronization in Operating Systems
Ritu Ranjan Shrivastwa
 
Operating System-Ch4.processes
Syaiful Ahdan
 
Memory Management
lavanya marichamy
 
Critical section problem in operating system.
MOHIT DADU
 
Linux I2C
KaidenYu
 
greedy algorithm Fractional Knapsack
Md. Musfiqur Rahman Foysal
 
Memory management
Vishal Singh
 
Semaphores
Mohd Arif
 
Address/Thread/Memory Sanitizer
Platonov Sergey
 
Bottom half in linux kernel
KrishnaPrasad630
 
Transposition cipher techniques
SHUBHA CHATURVEDI
 
Classical encryption techniques
Dr.Florence Dayana
 
Linux Crash Dump Capture and Analysis
Paul V. Novarese
 
Operating Systems - "Chapter 5 Process Synchronization"
Ra'Fat Al-Msie'deen
 

Viewers also liked (20)

PDF
Implementing STM in Java
Misha Kozik
 
PPTX
Motion capture technology
ARUN S L
 
PPT
Motion capture technology
Anvesh Ranga
 
PDF
Rolling with the Times: Using wheels, pbr, and Twine for Distributing and Ins...
doughellmann
 
PDF
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Kinson Chan
 
PDF
Multi-core Parallelization in Clojure - a Case Study
elliando dias
 
PDF
Transactional Memory for Smalltalk
Lukas Renggli
 
PDF
Understanding Hardware Transactional Memory
C4Media
 
PDF
いいかげんな人のためのTransactional Memory Primer
Yuto Hayamizu
 
PDF
20161110 cmlee opnfv_colorado1.0_분석
Cheolmin Lee
 
PPTX
Motion capture technology
Arun MK
 
PPTX
Motion Capturing Technology
Murlidhar Sarda
 
PPTX
Motion capture
Aswanth Talaseela
 
DOCX
Motion capture technology
Anvesh Ranga
 
PDF
DSL in Clojure
Misha Kozik
 
PPT
Secure shell ppt
sravya raju
 
PDF
visible light communication
Hossam Zein
 
PPT
55 New Features in Java 7
Boulder Java User's Group
 
PPTX
Rain technology
Yamuna Devi
 
PPTX
MultiTouch
Rishabha Garg
 
Implementing STM in Java
Misha Kozik
 
Motion capture technology
ARUN S L
 
Motion capture technology
Anvesh Ranga
 
Rolling with the Times: Using wheels, pbr, and Twine for Distributing and Ins...
doughellmann
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Kinson Chan
 
Multi-core Parallelization in Clojure - a Case Study
elliando dias
 
Transactional Memory for Smalltalk
Lukas Renggli
 
Understanding Hardware Transactional Memory
C4Media
 
いいかげんな人のためのTransactional Memory Primer
Yuto Hayamizu
 
20161110 cmlee opnfv_colorado1.0_분석
Cheolmin Lee
 
Motion capture technology
Arun MK
 
Motion Capturing Technology
Murlidhar Sarda
 
Motion capture
Aswanth Talaseela
 
Motion capture technology
Anvesh Ranga
 
DSL in Clojure
Misha Kozik
 
Secure shell ppt
sravya raju
 
visible light communication
Hossam Zein
 
55 New Features in Java 7
Boulder Java User's Group
 
Rain technology
Yamuna Devi
 
MultiTouch
Rishabha Garg
 
Ad

Similar to Transactional Memory (20)

PPTX
Paper_Scalable database logging for multicores
Hyo jeong Lee
 
PDF
Fluentd vs. Logstash for OpenStack Log Management
NTT Communications Technology Development
 
PDF
MongoDB World 2019: MongoDB Read Isolation: Making Your Reads Clean, Committe...
MongoDB
 
PDF
Erlang Lightning Talk
GiltTech
 
PDF
Building a Distributed Message Log from Scratch
Tyler Treat
 
ODP
Java memory model
Michał Warecki
 
PDF
Column Stride Fields aka. DocValues
Lucidworks (Archived)
 
PDF
Column Stride Fields aka. DocValues
Lucidworks (Archived)
 
PDF
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
lucenerevolution
 
PDF
Raft After ScyllaDB 5.2: Safe Topology Changes
ScyllaDB
 
PDF
Lecture 6 Kernel Debugging + Ports Development
Mohammed Farrag
 
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
PPTX
Memory model
MingdongLiao
 
PDF
Prerequisite knowledge for shared memory concurrency
Viller Hsiao
 
PDF
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PPTX
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Sneeker Yeh
 
PDF
Optimizing Parallel Reduction in CUDA : NOTES
Subhajit Sahu
 
PDF
Profiling the logwriter and database writer
Enkitec
 
PDF
GCC LTO
Wang Hsiangkai
 
Paper_Scalable database logging for multicores
Hyo jeong Lee
 
Fluentd vs. Logstash for OpenStack Log Management
NTT Communications Technology Development
 
MongoDB World 2019: MongoDB Read Isolation: Making Your Reads Clean, Committe...
MongoDB
 
Erlang Lightning Talk
GiltTech
 
Building a Distributed Message Log from Scratch
Tyler Treat
 
Java memory model
Michał Warecki
 
Column Stride Fields aka. DocValues
Lucidworks (Archived)
 
Column Stride Fields aka. DocValues
Lucidworks (Archived)
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
lucenerevolution
 
Raft After ScyllaDB 5.2: Safe Topology Changes
ScyllaDB
 
Lecture 6 Kernel Debugging + Ports Development
Mohammed Farrag
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Hsien-Hsin Sean Lee, Ph.D.
 
Memory model
MingdongLiao
 
Prerequisite knowledge for shared memory concurrency
Viller Hsiao
 
Demystifying MySQL Replication Crash Safety
Jean-François Gagné
 
Introduction to Apache Kafka
Shiao-An Yuan
 
Dead Lock Analysis of spin_lock() in Linux Kernel (english)
Sneeker Yeh
 
Optimizing Parallel Reduction in CUDA : NOTES
Subhajit Sahu
 
Profiling the logwriter and database writer
Enkitec
 
Ad

More from Yuuki Takano (16)

PDF
アクターモデル
Yuuki Takano
 
PDF
π計算
Yuuki Takano
 
PDF
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
Yuuki Takano
 
PDF
リアクティブプログラミング
Yuuki Takano
 
PDF
Tutorial of SF-TAP Flow Abstractor
Yuuki Takano
 
PDF
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
Yuuki Takano
 
PDF
CUDAメモ
Yuuki Takano
 
PDF
【やってみた】リーマン多様体へのグラフ描画アルゴリズムの実装【実装してみた】
Yuuki Takano
 
PDF
SF-TAP: L7レベルネットワークトラフィック解析器
Yuuki Takano
 
PDF
MindYourPrivacy: Design and Implementation of a Visualization System for Thir...
Yuuki Takano
 
PDF
SF-TAP: 柔軟で規模追従可能なトラフィック解析基盤の設計
Yuuki Takano
 
PDF
Measurement Study of Open Resolvers and DNS Server Version
Yuuki Takano
 
PPTX
Security workshop 20131220
Yuuki Takano
 
PDF
Security workshop 20131213
Yuuki Takano
 
PDF
Security workshop 20131127
Yuuki Takano
 
PDF
A Measurement Study of Open Resolvers and DNS Server Version
Yuuki Takano
 
アクターモデル
Yuuki Takano
 
π計算
Yuuki Takano
 
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
Yuuki Takano
 
リアクティブプログラミング
Yuuki Takano
 
Tutorial of SF-TAP Flow Abstractor
Yuuki Takano
 
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
Yuuki Takano
 
CUDAメモ
Yuuki Takano
 
【やってみた】リーマン多様体へのグラフ描画アルゴリズムの実装【実装してみた】
Yuuki Takano
 
SF-TAP: L7レベルネットワークトラフィック解析器
Yuuki Takano
 
MindYourPrivacy: Design and Implementation of a Visualization System for Thir...
Yuuki Takano
 
SF-TAP: 柔軟で規模追従可能なトラフィック解析基盤の設計
Yuuki Takano
 
Measurement Study of Open Resolvers and DNS Server Version
Yuuki Takano
 
Security workshop 20131220
Yuuki Takano
 
Security workshop 20131213
Yuuki Takano
 
Security workshop 20131127
Yuuki Takano
 
A Measurement Study of Open Resolvers and DNS Server Version
Yuuki Takano
 

Recently uploaded (20)

PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
SalesForce Managed Services Benefits (1).pdf
TechForce Services
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
SalesForce Managed Services Benefits (1).pdf
TechForce Services
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Farrell_Programming Logic and Design slides_10e_ch02_PowerPoint.pptx
bashnahara11
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
The Future of AI & Machine Learning.pptx
pritsen4700
 

Transactional Memory

  • 2. Why Transactional Memory? • lock is difficult to manage. • deadlock • starvation • priority Inversion • lock convoy • transactional memory mitigates these problems 2
  • 3. Deadlock 3 t Thread 1 Thread 2 Lock B Lock A try to acquire A and fail try to acquire B and fail
  • 4. Starvation 4 t High PriorityThread (acquire A) High PriorityThread (acquire B) Lock B Lock A Low PriorityThread (acquire A and B) try to acquire A and fail Lock A try to acuire B and fail Lock A Release A
  • 5. Priority Inversion 5 t High PriorityThread Low PriorityThread acquiring lock try to acquire and fail
  • 6. Lock Convoy 6 Scheduler Thread1 Thread2 Thread3 ThreadN 1. contention Thread2 4. acquire 2. event 3. contention (spin lock) 4. reschedule high overhead when many threads
  • 7. Complexity of Multithread Programming 7 algorithm data structure ideal world algorithm data structure parallelism parallel algorithm parallel data structure real world complicated source code simple source code buggy difficult to maintain actually we want
  • 8. Lock and Transactional Memory • Lock • execute critical section exclusively • only one code enter the critical section • Transactional Memory • execute critical section speculatively • multiple codes enter same critical section simultaneously • conflicts are detected both while executing critical section and the end of critical section 8
  • 9. Spin-lock by Atomic Operation • CAS (compare-and-swap) • compare and swap are performed atomically • test-and-set, compare-and-add, etc… • spin-lock is achieved by using CAS 9 int locked; lock_spin() { while (__sync_lock_test_and_set(&locked, 1)) { while (locked) ; // busy-wait } } unlock_spin() { __sync_lock_release(&locked); } if locked is 0, set 1
  • 10. Syntax of Transactional Memory atomic, retry, orElse 10 atomic { // transaction if (q.size() == 0) { // rollback and retry // transactions is restarted when // read-set is updated retry; } … // do something } orElse { // detect rollback and retry }
  • 12. Software Transactional Memory • TL2 • Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II , 20th International Conference on Distributed Computing , DISC 2006 • LSA • Torvald Riegel, Pascal Felber, and Christof Fetzer, A Lazy Snapshot Algorithm with Eager Validation , 20th International Conference on Distributed Computing, DISC 2006 • LogTM • Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, David A. Wood, LogTM: log- based transactional memory , HPCA 2006: 254-265 • DEUCE • Guy Korland, Nir Shavit and Pascal Felber, Noninvasive Java Concurrency with Deuce STM , MultiProg 2010 12 etc
  • 13. Summary of TL2 • prepare a variable called global clock • associate memory regions with version numbers • update version numbers when writing • detect conflicts when reading and writing by comparing the global clock with memory version number • retry transaction when detecting conflicts • otherwise commit 13
  • 14. TL2 - Variables 14 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1
  • 15. TL2 - Algorithm (1) 15 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 transaction { load var1; load var2; … store var3; } 1. load the global version clock and store it in a thread local read-version number. 1.
  • 16. TL2 - Algorithm (2) 16 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 12. run through a speculative execution transaction { load var1; load var2; … store var3; } 2. run
  • 17. TL2 - Algorithm (3) 17 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 2.1. log read addresses to the read-set transaction { load var1; load var2; … store var3; } 2. log read-set
  • 18. TL2 - Algorithm (4) 18 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 2.2. log write addresses and values to the write-set transaction { load var1; load var2; … store var3; } 2.2 log write-set
  • 19. TL2 - Algorithm (5) 19 global version clock variable 1 version number 1 Global Memory Thread Local Memory write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 thread 1 pointer to 1 pointer to 2 read-set 1 pointer to 3 write-set 1 pointer to 3 value of 3 variable 3 is stored and loaded Note that if a variable in the read-set already appears in the write-set, refer to the variable in the write-set from to avoid read-after-write hazard.
  • 20. TL2 - Algorithm (6) 20 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 2.3. check variables are not modified when loading. make sure that version numbers are less than the read-version number. transaction { load var1; load var2; … store var3; } <= if modified, abort transaction
  • 21. TL2 - Algorithm (7) 21 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 2.4. check write-locks are free? transaction { load var1; load var2; … store var3; } free? free? if locked, abort transaction
  • 22. TL2 - Algorithm (8) 22 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 3. acquire write-locks using bounded spin lock transaction { load var1; load var2; … store var3; } lock if failed to acquire write-locklocked, abort transaction
  • 23. TL2 - Algorithm (9) 23 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 4. increment the global version clock (CAS operation) and store it to the write-version number. transaction { load var1; load var2; … store var3; } increment and store
  • 24. TL2 - Algorithm (10) 24 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 5.1. check variables are not modified when loading. make sure that version numbers are less than the read-version number. transaction { load var1; load var2; … store var3; } <= if modified, abort transaction
  • 25. TL2 - Algorithm (11) 25 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 5.2. check write-locks are free? transaction { load var1; load var2; … store var3; } free? free? if locked, abort transaction
  • 26. TL2 - Algorithm (12) 26 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 transaction { load var1; load var2; … store var3; } rv + 1 = wv? 5.3. in the special case (where read-version number + 1 = write-version number) it is not necessary to validate the read-set 

  • 27. TL2 - Algorithm (13) 27 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 transaction { load var1; load var2; … store var3; } 6.1. commit values of the write-set

  • 28. TL2 - Algorithm (14) 28 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 transaction { load var1; load var2; … store var3; } 6.2. update version numbers by the write version number release
  • 29. TL2 - Algorithm (15) 29 global version clock variable 1 version number 1 Global Memory Thread Local Memory read-version number 1 write-lock 1 variable 2 version number 2 write-lock 2 variable 3 version number 3 write-lock 3 write-version number 1 thread 1 read-set 1 write-set 1 transaction { load var1; load var2; … store var3; } 6.3. release the write-locks release
  • 31. Hardware Transactional Memory • use CPU cache to detect conflicts • modify cache coherence algorithm to achieve transactional memory 31
  • 32. Cache Coherence • MESI protocol • There are 4 states • Modified, Exclusive, Shared, Invalid 32
  • 33. MESI Modified State 33 main memory CPU0 CPU1 cache 0 cache 1 cache line dirty, must write back not shared with other CPU
  • 34. MESI Exclusive State 34 main memory CPU0 CPU1 cache 0 cache 1 cache line not modified not shared with other CPU
  • 35. MESI Shared State 35 main memory CPU0 CPU1 cache 0 cache 1 cache line not modified shared with other CPU
  • 36. MESI Invalid State 36 main memory CPU0 CPU1 cache 0 cache 1 cache line no meaningful data
  • 37. MESI Exclusive Load 37 main memory CPU0 CPU1 cache 0 cache 1 1. request exclusive load 2. write back if modified 3. change state to invalid 4. load state with exclusive state
  • 38. MESI Shared Load 38 main memory CPU0 CPU1 cache 0 cache 1 1. request shared load 2. write back if modified 3. change state to shared 4. load state with shared state
  • 39. MESI eviction 39 main memory CPU0 CPU1 cache 0 cache 1 1. write back if modified 2. discard
  • 40. Transactional Cache Coherence (1) 40 main memory CPU0 CPU1 cache 0 cache 1 0 prepare transactional bit in each cache line 0: not in transaction 1: in transaction
  • 41. Transactional Cache Coherence (2) 41 main memory CPU0 CPU1 cache 0 cache 1 1 abort transaction if MESI protocol invalidates transaction entry shared or exclusive state
  • 42. Transactional Cache Coherence (3) 42 main memory CPU0 CPU1 cache 0 cache 1 1 discard modified value and abort transaction if MESI protocol invalidates or evicts transaction entry modified
  • 43. Transactional Cache Coherence (4) 43 main memory CPU0 CPU1 cache 0 cache 1 1 abort transaction if MESI protocol evicts transaction entry because cache coherence protocol cannot detect conflicts evicted
  • 45. Problem (1) • infinite loop in transaction • detection of variable version in loops should reduce performance significantly • requirement of closed memory management • codes out of transaction can refer and update variables in transaction in languages like C, C++ • compiler or running environment should care about 45
  • 46. Problem (2) 46 atomic { … launchMissile(); … } Missiles may be launched many times! IO in transaction must causes abort
  • 49. Software Transactional Memory (STM) in Haskell • Haskell provides STM by concurrent module • STM monad is provided to achieve STM • example implementation • https://gist.github.com/ytakano/ 228b68ef099c7bdd2f2c 49
  • 50. Hardware Transactional Memory (HTM) Intel TSX • HTM is available from Haswell • Intel TSX HLE • xacquire and xrelease instructions • Intel TSX RTM • xbegin and xend instructions 50
  • 51. Intel TSX RTM 51 xbegin ABORT . . . xend ABORT: // fallback if aborted sometimes, must go to fallback codes (such as spin lock)
  • 52. Lock by using tsx-tools https://github.com/andikleen/tsx-tools 52 volatile int lock = 0; rtm_lock() { for (int i = 0; i < RTM_MAX_RETRY; i++) { unsigned status = _xbegin(); if (status == _XBEGIN_STARTED) { if (! lock) return; // successfully started _xabort(0xff); } if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff && ! (status & _XABORT_NESTED) { while (lock) _mm_pause(); // busy-wait } else if (!(status & _XABORT_RETRY)) { break; } } while (__sync_lock_test_and_set(&lock, 1)) { // fallback to spin-lock while (lock) _mm_pause(); // busy-wait } } lock by using Intel TSX RTM
  • 53. Unlock by using tsx-tools https://github.com/andikleen/tsx-tools 53 rtm_unlock() { if (lock) { __sync_lock_release(&lock); } else { _xend(); } } unlock by using Intel TSX RTM
  • 54. Performance of Intel TSX • Intel says that codes of coarse-grained lock can compare with codes of fine- grained lock • easy to write core scalable codes 54
  • 55. 55 Applying Intel® TSX scaling Threads scaling Threads Application with Coarse Grain Lock Application re-written with Finer Grain Locks An example of secondary benefits of Intel® TSX Coarse Grain Lock Coarse Grain Lock + Intel® TSX Fine Grain Locks Fine Grain Locks + Intel® TSX Fine Grain Behavior at Coarse Grain Effort from Intel Developer Forum 2012
  • 56. 56 Intel® TSX Can Enable Simpler Scalable Algorithms Enabling Simpler Algorithms Lock-Free Algorithm • Don’t use critical section locks • Developer manages concurrency • Very difficult to get correct & optimize – Constrain data structure selection – Highly contended atomic operations State of the art lock-free algorithm Ops/sec Threads Ops/sec Threads TSX lock based algorithm Lock-Based + Intel® TSX • Use critical section locks for ease • Let hardware extract concurrency • Enables algorithm simplification – Flexible data structure selection – Equivalent data structure lock-free algorithm very hard to verify Real World Example from Intel Developer Forum 2012