Dynamic_Simultaneous_Multithreaded_Architecture[1]
Dynamic_Simultaneous_Multithreaded_Architecture[1]
1
Execution Unit
Next_PC
Load/Store
Scheduler
Contexts
MOB
Loop Good/bad loop
Detection DSMT mode
PC TCIU
L2 Cache D-cache
Main Memory
run-time. Data obtained during the non-speculative exe- 6, 11, 14 20, 23, 29]. These proposals share many simi-
cution phase of DSMT is used as a hint to speculate the larities on how thread-level speculation is supported and
posterior behavior of multiple threads. In contrast to oth- basically differ on how much of this support is provided
er similar architectures, DSMT employs a simple mecha- in hardware versus software. Since static techniques for
nism based on state bits to keep track of inter-thread de- exploiting TLP using binary annotators or parallelizing
pendencies in registers and memory, synchronize thread compilers have already been discussed in [6, 22, 19], this
execution, and to recover from misspeculation. Moreo- section briefly describes architectures similar to DSMT
ver, DSMT utilizes a novel greedy policy to choose those that dynamically exploit TLP.
sections of code that provide the highest performance In Clustered Multithreaded Architecture (CMA), a
based on their past execution history. To assess the per- control speculation mechanism dynamically identifies
formance of DSMT, a new cycle-accurate, execution- threads from different iterations of a loop. These threads
driven simulator called DSMTSim was developed. This are then executed concurrently on several thread units [9,
simulator is capable of reproducing in detail the complex 11]. Thread units are interconnected through a ring to-
dynamic behavior of DSMT. pology and iterations are allocated to thread units based
This paper is organized as follows. Section 2 describes on their execution order. Each thread unit has its own
architectures closely related to DSMT. Section 3 pro- physical register, register map table, instruction queue,
vides a detailed description of the DSMT microarchitec- functional units, local memory, and reorder buffer. Inter-
ture, including details of how threads are generated and thread data dependencies through registers and memory
spawned and the mechanisms used to keep track of inter- are predicted with the help of a history table called the
thread dependencies in registers and memory. The dis- loop iteration table. When a speculative thread is created,
cussion of DSMTsim and the simulation results are pre- its logical register file and its register map table are cop-
sented in Section 4. Finally, Section 5 concludes the pa- ied from its predecessor. At the same time, the increment
per and discusses future expansions to DSMT. predictor will initialize any live and predictable register.
When a thread finishes, its output predictions are verified
2. Related Work and mispredictions are handled by selective re-execution.
Inter-thread memory dependence speculation is per-
formed by means of a multi-value cache. This cache
The proposed architecture for DSMT was drawn from a memory stores, for each address, as many different data
plethora of related works that studied different ways of words as the number of thread units.
exploiting both ILP and TLP from a single program [1, 2,
2
TCIU
Iteration LSST
Counter
Continuation Conf Op Rd Immd
Reg. Read
Confidence
Table
J0 J1 J2 … JN-1
D_Anchor
Context 0
PC V S H
D L R Register 0
D L R Register 1
Context 1 Context N-1
…. …
…
D L R Register 63
CMA share many similarities with DSMT, especially thread stops fetching instructions when it reaches the start
dynamic thread generation and a ring topology used to of the next thread in the order list. If for some reason a
communicate register values among contexts. However, thread never reaches this point, it is considered misspecu-
the main differences between the two architectures are the lated and consequently squashed. Threads communicate
following: (1) CMA requires special hardware mecha- through registers and memory. Communication between
nisms such as the multi-value cache, (2) CMA does not threads is one way only and dictated by their order.
exploit nested loops, and (3) CMA is scalable but not Loads are issued to memory speculatively assuming that
based on a SMT core. Moreover, performance results of there are no dependencies with stores from previous
CMA, as well as some other previous architectures de- threads. However, since threads do not wait for their in-
rived from the same research, were obtained with a trace puts to be ready data misspeculation is common. DMT
simulator called ATOM [23, 9, 10, 11]. These results uses selective recovery on misspeculated instructions,
showed the potential performance benefit that can be ob- which is initiated as soon as the correct input is available.
tained by exploiting only loops. However, since multi- Trace buffers outside the main pipeline hold all specula-
threading exhibits dynamic behavior, trace simulators will tive instructions and their results. During recovery, in-
not accurately reproduce misspeculations [2]. Moreover, structions are fetched from the trace buffers and re-
as it will be shown later in this paper, frequent misspecu- dispatched into the execution pipeline.
lations cause significant degradation in performance. Our The major differences between DMT and DSMT are
study of DSMT is based on a cycle-accurate, execution- (1) DMT exploits procedures and loop continuations, (2)
based simulator capable of accurately reproducing the DMT employs multiple levels of speculation, (3) DMT
effect of misspeculations on the processor’s performance. employs more complex mechanisms for recovering from
Work closest to ours is Dynamic Multithreaded Archi- misspeculations. DMT and DSMT are complementary
tecture (DMT) [1]. DMT is designed around a SMT pro- approaches to exploiting TLP. In particular, given that
cessor core. DMT also generates threads dynamically at DMT exploits procedures and loop continuation code,
run time and is capable of executing in parallel loops, integer applications will benefit more from the DMT ar-
procedures, and the code after the procedure. To relax the chitecture. In contrast, numerical applications consisting
limitations imposed by register and memory dependen- mainly of loops will execute more efficiently on DSMT.
cies, thread-level dataflow and data value prediction is
used. A spawned thread uses as its input the register con- 3. DSMT Microarchitecture
text from the thread that spawned it. Data speculation on
the inputs to a thread allows new speculative threads to
immediately start execution. Control logic keeps a list of Figure 1 shows the organization of the DSMT microarchi-
the thread order and the starting PC of each thread. A tecture. Its core consists of a generic superscalar proces-
3
sor organized into six pipelined stages: Fetch, De- determine whether a loop is “good” or “bad” for specula-
code/Dispatch, Issue, Execute, Write-back, and Commit tive execution: (1) The number of iterations a loop exe-
stages. The Fetch stage fetches a block of instructions cutes, (2) the number of contexts currently available, (3)
from a thread in the usual manner, but can also fetch in- how much overlapped execution cloned threads exhibit,
structions from different threads based on the scheduling and (4) thread run-length. The first two criteria determine
policy offered by the Scheduler. To support simultaneous the potential TLP in the cloned threads. However, even
execution of multiple threads, each thread has its own set in loops with a large number of iterations, it is possible
of Instruction Queue (IQ), Reorder Buffer (ROB), and that they may exhibit very low ILP during execution.
Context. Each Context represents the state of a thread This could be caused by the presence of a large number of
and the multiple Contexts are also interfaced to the inter-thread dependencies in the loop, or by frequent
Thread Creation and Initiation Unit (TCIU), which con- miss-speculation of loop iterations. In this case, the third
trols how threads are cloned and executed. It also con- criterion is used to determine the effectiveness of the
tains the Loop Detection Unit, which is responsible for DSMT execution mode. This criterion associates an In-
detecting loops and supplying target addresses so that structions-per-Clock (IPC) measure during the execution
multiple threads can be cloned by TCIU. The following of a loop in DSMT mode. On the other hand, the fourth
subsections highlight the functionality of the various criterion together with the first two indicates how sustain-
components. able the overlapped execution can be. The first three cri-
teria are combined to form the sustained IPC (SIPC)
3.1 Loop Detection Unit measure for a loop. A loop is labeled as good or bad
based on a “break even” policy, where the observed IPC
during DSMT execution is compared against the observed
During the execution of a program, DSMT operates in IPC for the non-DSMT execution of the same loop meas-
either DSMT or non-DSMT mode. In the non-DSMT ured during pre-DSMT mode. If the IPC measured
mode, there is only a single thread of execution and thus breaks even then the loop is labeled as a “good” loop for
the processor behaves as a superscalar processor. When a speculative execution. This way, we can guarantee that
loop is detected, the processor enters the pre-DSMT DSMT execution mode will result in as good or better
mode, and later if it is determined that multithreaded exe- performance than the non-DSMT mode of execution.
cution will improve performance it enters the full-DSMT Nested loops provide opportunities for DSMT to se-
mode. During pre-DSMT mode, the processor detects lect the appropriate thread granularity. For these loops,
live registers and the information required to speculative- the SIPC measure is also used to select a particular loop
ly predict register values. In full-DSMT mode, the over- in the nested loop structure that provides the best perfor-
lapped execution of loop iterations occurs. Moreover, mance. The control of nested loop execution is handled
there is always a single non-speculative context, which is by a special stack structure associated with Loop Detec-
the only thread permitted to clone speculative threads. tion Unit. This mechanism stores the branch and the tar-
This policy guarantees precise interrupts and reduces the get addresses of a loop in a stack of loops. Inner loops
complexity that would be required to control multiple are stored at the bottom of the stack and outer loops at the
speculative stages. top of stack. When a new loop is detected, its branch and
Whenever a taken backward-branch instruction is de- target addresses are compared with the corresponding
tected, branch and target addresses are recorded in the addresses stored at the top of the stack. If the new loop’s
Loop Detection Unit. Later, if another branch instruction branch and target addresses are in the range of addresses
with the same branch target address is found, the proces- stored at the top of the stack, the loop is pushed onto the
sor enters the pre-DSMT mode. The structure of the stack. Later during DSMT mode, the stack of loops is
Loop Detection Unit consists of a specially modified BTB accessed based on the SIPC value obtained during the
augmented with additional fields to facilitate loop identi- execution of each nesting level. The loop with highest
fication. In addition to the typical fields found in modern SIPC in the nest is chosen, and all others are discarded.
superscalar processors’ BTB (e.g., branch address, target
address, and branch prediction information), it contains
the following information: (1) A flag indicating that the
3.2 Thread Creation and Initiation
target address of this branch is the starting address of a Unit and Multiple Contexts
loop; (2) the number of iterations that this loop has exe-
cuted in the past (i.e., the number of consecutive taken The structure of TCIU and the multiple contexts are show
branches); and (3) a type information indicating whether in Figure 2. When the Loop Detection Unit detects a
this is a “good” or “bad” loop for speculative execution loop, using the policies described earlier, it latches the
based on its previous behavior. target address of the thread to be cloned to the Continua-
Loop Detection Unit also contains a field that provides tion register and sets the M-bit to indicate the processor is
feedback on how loops behaved in their previous pre- and in pre-DSMT mode. TCIU also has a set of anchor bits
full-DSMT modes of execution. Four criteria are used to
4
Table 2: DSMT Functional Unit Configuration Table 1: DSMT Configuration
called D_Anchor and R_Anchor bits, which are updated at struction(s) in the local ROB that will commit to this reg-
the start of the second iteration with the status of D and R ister. R bits reflect whether registers can be read from
bits (which will be explained shortly), respectively, of the their own context or speculatively read from a predeces-
non-speculative context just before threads are cloned. sor context. This flag also indicates that successor con-
These anchor bits provide a means of (1) speculating texts need look no further than this context to get the val-
whether cloned threads should read the registers of their ue they need.
own context or from the registers of other contexts and Dependency (D) bits – Keeps track of registers that have
(2) generating a new set of bits for future speculation. inter-thread dependencies. When a register is read, if its
Afterwards, TCIU sends a signal to all other units indicat- R-bit is zero and there are no other instructions in the
ing that full-DSMT mode of execution has started. ROB that will commit to the register, a check is then
In full-DSMT mode, the thread cloning process starts made to see if its R_Anchor-bit is set. If both of these
by copying the target address of a loop in the Continuation conditions are true (i.e., register was not written in the
register to the PCs of each cloned context. To “jump current context but other context previously wrote to that
start” each thread, immediate values of instructions of the register), it means inter-thread dependence exists for the
form addi rd,rd,#immd are predicted and stored in the register and the D bit is set. These bits will serve as
Loop Stride Speculation Table (LSST). The compiler D_Anchor bits to facilitate speculation on how registers
generally uses rd as induction variables or to access data are accessed.
in regular patterns. The speculation used by DSMT is to Load (L) bit – When set, it indicates that the register has
predict the contents of a source register rd in LSST as been speculatively read from a predecessor context. When
rd=rd+iteration*immd, where iteration is the current itera- an instruction attempts to read a register in its own con-
tion number. To avoid blind speculation on rd, each entry text, L-bit is set if its R-bit is zero and there is no instruc-
in LSST is associated with confidence bits based on 2-bit tion(s) in the ROB that will commit to this register. If
saturation counters. Next, the values stored in the non- later on, it is determined that a register was actually writ-
speculative register file are copied to the Registers in each ten in a context when the successor thread L-bit is set,
new context, and Valid (V) and Speculative (S) mode bits then all successor threads are squashed. Since these regis-
are set. The V-bit indicates that the context is valid and, ter reads are speculative, a confidence based on 2-bit satu-
therefore, the Fetch unit can start fetching instructions ration counter is associated with each register and stored
from its PC. The S-bit, when set, indicates the context is in the Register Read Confidence Table.
running speculatively. Therefore, a single non- Each context has an associated Join (J) bit, which is
speculative context owns the precise state of the processor set to indicate that an iteration has completed. This bit is
in its private register file. used to synchronize multiple threads that have reached
Each context is also equipped with a register file that this state. At this point, speculative contexts must wait
provides with a distinct logical view of its state, allowing until the non-speculative context commits its results and
fast register access in a context. The multiple contexts transfers the non-speculative register values to a new con-
are interconnected in a ring fashion, and the Head and Tail text. Transferring the flag from the non-speculative con-
registers of TCIU determine the first and the last thread text comprises copying the value of each register in the
currently running on the DSMT processor. non-speculative context register file to the new non-
To keep track of inter-thread register dependencies, speculative context. However, the copying process skips
each register is associated with a set of utility bits (in ad- those registers values that were identified as live during
dition to the usual ROB entry tags and Busy bits found in the execution of the new non-speculative context.
the register file of modern superscalar processors). They At this point, there are two important implementation
are: details in the DSMT architecture that need to mentioned.
Ready (R) bit – When set, it indicates some instruction(s) First, before threads can be cloned, a couple of loop itera-
logically preceding this one in the thread’s program order tions must be executed in non-DSMT mode to establish
has committed a value to the register; otherwise, no value the contents of the Continuation register, LSST entries,
has been committed to the register and there are no in- and D_Anchor and R_Anchor bits. Second, all cloned
5
threads execute speculatively. Thus, when the non- iteration is initiated in the context whenever any other
speculative thread completes, its successor thread be- completes. Thus, when a thread completes a single itera-
comes the new non-speculative thread. Therefore, tion, that context will set the appropriate J-bit in the
D_Anchor and R_Anchor are updated with the values of TCIU. Since the just completed iteration has properly
the R and D bits of the just completed, non-speculative updated its R and D bits, these bits become the new set of
thread and the Head register is updated to point to the anchor bits. In addition, the just completed thread’s suc-
new non-speculative thread. cessor now becomes the non-speculative thread. There-
fore, TCIU can reinitiate the next iteration by appropriate-
3.3 Resolution of Inter-thread Depend- ly changing the Head and Tail registers and cloning a new
thread.
encies in Registers and Memory In DSMT, loads from different threads, either specula-
tive or non-speculative, can be executed speculatively.
Register dependencies between iterations are resolved by However, only the non-speculative threads are allowed to
speculatively accessing registers based on the D_Anchor perform stores. To ensure that the sequential semantics is
bits. If R-bit is set for a register, the register value can be not violated, the Memory Dataflow Resolution Table
read directly from its own context. Otherwise, first-level (MDRT) is used. Load/store operations are kept in one of
speculation, called register dependence speculation, is the Load/Store Queues (LSQs) according to its tag. In
performed based on its D_Anchor bit to determine which addition to allowing loads to bypass stores and forward-
thread the value should be read from. For example, when ing values from stores to loads that have been disambigu-
an instruction in a thread tries to read its own register ated, the LSQ also acts as a buffer so that stores can spec-
with its R-bit equal to zero, the D_Anchor bit for the reg- ulatively commit locally. This prevents uncommitted
ister is checked. If the D_Anchor bit is set, it indicates stores from speculative threads from blocking the ROB.
that previous executions of the iterations had an inter- A special logic selects load/store operations, giving pri-
thread dependency on the register. Therefore, the specu- ority to those corresponding to the non-speculative con-
lation assumes that this inter-thread dependency will like- text, and forwards them to the memory subsystem.
ly exist in the current execution of speculative threads, MDRT checks these operations to ensure the correct state
and the register, when ready (i.e., R-bit = 1), is read from of the memory. MDRT is a fully associative buffer with
its immediate predecessor thread. If the register depend- each entry containing a valid (V) bit, a word address
ence speculation turns out to be wrong, due to dynamic (addr), and a value. In addition, each thread has an L-bit
behavior of loop iterations, and the immediate predeces- and an S-bit indicating whether the memory word has
sor thread does not generate the register value, the sec- been loaded or stored, respectively.
ond-level speculation is used. This involves searching Loads can proceed normally for non-speculative
back for the last thread that generated a value for the reg- threads. However, speculative threads performing a load,
ister. check to see if an entry exists based on addr. If none ex-
On the other hand, if the D_Anchor bit of a register is ists, an entry is allocated for the addr. If an entry is
zero, it indicates that previous executions of the loop iter- found, then its L-bit is set and the load is allowed to pre-
ations did not have an inter-thread dependence on the cede its execution. A store is not allowed to update the
register. However, since dynamic behavior of loop itera- memory unless it is from a non-speculative thread and it
tions may have changed a register due to an inter-thread is at the head of its ROB. This guarantees that the precise
dependent register, the speculation used is to assume that state of the processor can be maintained. When a non-
predecessors may have modified the register and the reg- speculative thread performs a store, it sets the S-bit for
ister is read from the last thread that wrote to this register. that thread. In addition, it checks to see if other threads
If no predecessor threads have modified the register, it is have their L bits set. If any thread has read this memory
read from its own context. value too early, the thread and all of its successor threads
Since the DSMT processor relies very heavily on are squashed
speculation, the cloning and speculative execution of
threads require a method to detect and squash threads 4. Simulation Environment
when misspeculation occurs. The detection of misspecu-
lation is performed when registers are written in each
context during the Commit stage. Whenever a thread To evaluate the performance of DSMT, DSMTSim was
writes to a register, the L bits of the successor threads are implemented. DSMTSim is a cycle-level accurate, exe-
checked to see if any thread has read the register earlier. cution-driven simulator capable of operating in different
If so, that thread and all of its successor threads are execution modes, which are: (a) fast, in-order simulation
squashed and reinitiated. and (b) detailed wide-issue, out-of-order, multiple context
In order to ensure the proper ordering and yet maxim- simulation. The fast simulation mode allows quickly
ize the overlapped execution of the cloned threads, a new placing the simulator in a particular section of a bench-
mark code, skipping non-representative parts like those
6
58 6
7
5
Number of data cache ports used per clock cycle
4
6
4
5
3
2 Contexts
2 Contexts
4 4 Contexts 3
4 Contexts
8
2 Contexts
8 Contexts
4 Contexts
23
8 Contexts
2
1
1
1
0 0
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ge 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24era ean ea
M
Livermore Kernel Av M G
G
Livermore Kernel Livermore Kernel
Figure
Figure 3: 5: Dataperformance
DSMT cache accesses per on
based clock cycle
ICount2.8- Figure 4: Performance with unlimited number of
modified fetch policy. fetch ports.
corresponding to initialization. During fast mode simula- tion flag turned on. In this simulation, the I-cache is two-
tion, instructions are read directly from memory and exe- ported [27] and the fetching policy used by the Scheduler
cuted in sequence. Conversely, in the detailed simulation (see Figure 1) is similar to ICount2.8 [24]. ICount2.8
mode, all the memory hierarchy and the pipeline stages of policy employs two fetch ports, each one able to fetch up
the simulator are exercised. The simulator loads a binary to eight instructions per clock cycle. Using this policy,
program into its internal memory, using the detailed priority is given to the context with the lowest number of
simulation mode, and then simulates, cycle-by-cycle, all instructions (ICount) in its decode, rename, and issue
the processing performed by the main pipeline. stages. However, SMT’s original ICount2.8 policy was
DSMTSim executes PISA binaries generated from slightly modified for DSMT (called ICount2.8-modified)
SimpleScalar’s GCC. However, except for the memory by selecting first the non-speculative thread, and then
model and the syscall support, DSMTsim is a completely choosing among the rest of speculative threads the thread
new simulator that shares very little with the Sim- with the lowest ICount. DSMTSim’s simulation parame-
pleScalar’s sim-outorder simulator. One of the main ad- ters and the functional unit configuration were based on
vantages of DSMTSim is that it reproduces in detail all Table 1 and 2. Also, each context was allowed to issue up
the dynamic events that the DSMT’s multithreaded exe- to four instructions per clock cycle.
cution mode generates. Specifically, DSMTSim process- We first ran Livermore loops to assess the effective-
es instructions out-of-order, spawns new threads, syn- ness of DSMT on benchmarks that contains mainly loops.
chronizes threads, and flushes and recovers from mis- Figure 3 shows the speedup obtained by the DSMT pro-
speculated threads. At intra-thread level, branches are cessor with 2, 4, and 8 contexts when compared to a sin-
predicted and executed speculatively. Also, register val- gle context 8-wide issue superscalar processor. In the
ues are generated and sent from producer instructions to graph, GMean represents the geometric mean of the
consumers either at intra-thread or inter-thread level on- speedup obtained by all 24 kernels. As the figure shows,
the-fly. On-the-fly value passing closely resembles the the maximum speedup obtained by DSMT was on aver-
action performed by real superscalar processors, and is age 34%, 84%, and 100% with 2, 4, and 8 contexts, re-
also used as a means for checking the correct operation of spectively. Figure 4 shows the performance of an ideal
the tagging and out-of-order execution mechanisms in- case where the number of fetch ports is equal to the num-
cluded in the simulator. In this way, correct manipulation ber of contexts. The difference in performance between
of data values is ensured during speculative execution. the two configurations (ideal and ICount2.8-modified) is
on the average 0%, 5%, and 3% for 2, 4, and 8 contexts,
5. Performance Simulation Results respectively. Therefore, the performance obtained by
using two fetch ports with ICount2.8-modified policy is
very close to the one obtained by the ideal configuration.
Our simulation study of DSMT was based on Livermore These results also show that the speedup was negligi-
loops and SPEC95 benchmarks. Livermore loops consist ble for certain loops. Statistics gathered from DSMTSim
of 24 core calculations used in familiar numerical algo- showed that the LSST mechanism resulted in a high
rithms, such as matrix multiplication, Cholesky’s conju- thread misprediction rate for Kernel-8 (integration). This
gate gradient, Monte Carlo’s search, etc. Both sets of is because the memory access pattern in this loop is com-
benchmarks were compiled without modifications to the plex, and therefore, the simple value prediction method
original C source code, using GCC with the -O3 optimiza- for LSST is unable to accurately predict the induction
7
8 100%
90%
7
Number of data cache ports used per clock cycle
80%
6
5
60%
2 Contexts 2 Contexts
4 4 Contexts 50% 4 Contexts
8 Contexts 8 Contexts
40%
3
30%
2
20%
1
10%
0 0%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 e 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ge
ag a
er er
Av Av
Livermore Kernel Livermore Kernel
Figure 5: Data cache accesses per clock cycle Figure 6: Instructions committed in DSMT mode for
Livermore loops
variables used to control the execution of the loop. In threads will be squashed when the conditional branch is
contrast, low performance obtained in Kernel-10 (differ- frequently not taken.
ence predictor) is due to a combination of two factors: Figure 5 shows the average number of data cache
LSST’s low thread prediction rate and high thread syn- ports accessed per clock cycle with Livermore loops. As
chronization rate. In DSMT, a speculative thread is expected, with more contexts in the processor more pres-
blocked when it finishes its execution before the non- sure is exercised on the data cache ports every cycle.
speculative context does. Since the non-speculative con- However, Kernel-8 and Kernel-17 reveal a different pat-
text is the only one capable of enabling other speculative tern in accessing the data cache: With 8 contexts less data
contexts to become non-speculative, the gap produced cache ports are used. The reason is LSST misspecula-
when the non-speculative thread is delayed causes degra- tions occur very often (especially in Kernel-17) due to
dation in performance. The factor that caused perfor- dynamic behavior in the loop. Therefore, with more con-
mance loss in Kernel 22 (Plankian distribution) is due texts, more threads are squashed before the memory oper-
high branch misprediction rate. ations could reach the load/store ports, which reduces the
Figure 3 also indicates that DSMT does not always pressure on the data cache but degrades performance.
obtain the best performance using the maximum number These results suggest that the number of data cache ports
of threads. The reason for this is two-fold. First, since required by DSMT is around three. However, to take into
each context executes nearly the same code and thus re- account the more demanding memory requirements of
source requirements of each thread are very similar dur- more complex benchmarks, DSMT uses four ports. It is
ing each iteration. Therefore, the likelihood of resources interesting to note that reducing the fetch bandwidth (us-
conflicts increase as more contexts are spawned. This is ing ICount2.8-modified) also reduces the pressure on the
especially critical in loops with a very large number of data cache ports. This indicates that there is a tradeoff
numerical calculations and/or a large number of memory between increasing the fetch bandwidth to improve per-
accesses, such as Kernel 8 (integration), and Kernel 10 formance, but also limiting it if misspeculations occur
(difference predictor). very often.
Dynamic behavior within loops is the other main Figure 6 shows the percentage of instructions commit-
cause for the poor performance exhibited by some loops ted in DSMT mode for each of the Livermore loops. On
with eight contexts. Loops with dynamic behavior tend to the average nearly 80% of the instructions were commit-
generate higher inter-thread misspeculation. Misspecula- ted in DSMT mode, which indicates the Loop Detection
tion occur mainly when register values are read too early Unit is very effective in identifying and exploiting these
by the speculative threads, as is the case of Kernel-17 loops. As Figure 6 shows, DSMT is able to find a signifi-
(conditional computation), which contains five goto cant amount of TLP in Kernel-8 and -10. However, as
statements in the loop body. In this loop, misspeculated Figure 3 shows the speedup obtained in these loops is
threads that are squashed cause a decrease in performance negligible due to thread misspeculation. In contrast,
for eight contexts. An analysis of the code show that DSMT is unable to find enough TLP in Kernel-22 due to
some internal backward jumps are mistaken as loops by a very high branch misprediction rate. Statistics from
the Loop Detection Unit. However, whether these back- DSMTsim also show that Kernel-14, -16 and –18, which
ward branches are taken or not depends on the internal have several internal loops and if-then-else statements in
variables in the loop. Therefore, using the maximum the loop body, produce a relatively high number of branch
number of contexts increases the probability that many mispredictions.
8
100%
90%
1.4
more likely to provide the highest performance based on
1.35
its past dynamic behavior.
The performance results of DSMT were obtained us-
80%
% Instructions commited in DSMT mode
70%
1.3
ing a cycle-accurate, execution-based simulator
Speedup over 8-wide issue superscalar
60%
1.25
2 Contexts
DSMTsim, which is capable of executing mispredicted
50% 4 Contexts
2 Contexts
8 Contexts
paths of execution, jointly with run-time generation, con-
trol and synchronization of multiple threads. Our simula-
1.2 4 Contexts
40% 8 Contexts
1.15
30% tion results show that speculative dynamic multithreading
20%
1.1
based on extracting threads from loops has very good
10%
potential for improving SMT performance, even when
1.05
0%
only a single program is available for execution. DSMT
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ge
ve
ra obtained on average nearly 100% speedup executing the
INT A Overall
swim mgrid tomcatv turb3d ijpeg mk88sim compress
Livermore Kernel
li FP
Gmean Gmean Gmean Livermore loops and 26% of improvement for SPEC95
benchmarks. However, the performance improvement
Livermore Kernel
Figure 7: DSMT performance using SPEC95 obtained by DSMT is limited for non-numerical applica-
benchmarks tions when only loops are exploited. The reasons for this
are two-fold. First, some applications simply lack TLP as
The Livermore loops were used to analyze and explore well as ILP within a thread. This limits the performance
part of the design space of DSMT and to characterize its achievable by DSMT and other similar architectures.
dynamic behavior. However, the real advantage of Second, as our simulations results have shown, the dy-
DSMT is observed when more complex applications are namic behavior of speculative multithreading causes fre-
executed. Figure 7 illustrates the performance of DSMT quent mispredictions in some loops that may produce a
running several SPEC95 benchmarks. In the simulations, detrimental effect on DSMT’s performance. Therefore,
500 million of instructions were executed, but the first the challenge is to maximize the amount of TLP that
200 million instructions corresponding to code initializa- DSMT is capable of exploiting and at the same time re-
tion were skipped using the fast simulation mode provid- duce the frequency of misspeculations.
ed by DSMTSim. The reference inputs of the SPEC95 There are a number of ways the DMST architecture
benchmarks were used during all simulations. Figure 7 can be improved. First, as [13] suggests, exploiting both
shows the performance results obtained by DSMT for procedures and loops will likely be required to improve
both the SPEC95-FP and SPEC-Int benchmarks. Results performance. Second, a more sophisticated value predic-
show the average speedup obtained by DSMT for all tion mechanism will be needed. Another important bot-
benchmarks is 7%, 16.5%, and 26% for 2, 4, and 8 con- tleneck observed during our study was the memory data-
texts, respectively. flow mechanism. Long running threads that access a
As other previous studies have found [9, 10, 23], in large number of memory locations may cause MDRT to
the DSMT architecture SPEC95-FP benchmarks provide quickly fill-up when there are insufficient number of data
the better speedup (32.5% on average with 8 contexts). cache ports and thus cause the entire pipeline to backup.
The reason is that essentially they contain many more This was the reason why the Loop Detection Unit did not
loops than SPEC95-Int. SPEC95-Int obtained an average choose the outer-loop in some of the benchmarks, such as
speedup of 19% average speedup with 8 contexts. As Kernel-21 (matrix multiplication), to clone threads.
these results show, numerical applications will benefit However, even when the middle-loop is chosen, its specu-
more from the DSMT model than non-scientific applica- lative threads may generate a large number of loads and
tions. stores, especially stores that cannot commit. This will
cause the MDRT to backup and the result of this bottle-
6. Conclusions and Future Work neck percolated all the way back up to the IQs. There-
fore, a method that “throttles” the thread execution is
needed to avoid filling up the MDRT too quickly.
This paper presented the DSMT architecture and its per- In addition, choosing the optimum number of contexts
formance results. DSMT employs aggressive forms of for execution may be critical for some applications.
speculation to dynamically extract TLP and ILP from However, since selecting the optimal number of contexts
sequential programs. Unlike other similar architectures, at run-time is difficult, one possible solution is to carry
DSMT uses simple mechanisms to synchronize threads out this information from the compiler to the architecture
and keep track of inter-thread dependencies, both in regis- using a technique similar to the one described in [14].
ters and memory. The novel features in DSMT include Finally, improving the branch prediction mechanism em-
(1) using information obtained during the sequential exe- ployed by DSMT, which is currently based on a simple 2-
cution of code segments as a hint to speculate the subse- bit saturating counter, will also be crucial.
quent behavior of multiple threads, and (2) utilizing a
greedy approach that chooses sections of code that are
9
7. References [15] Redstone J. A., Eggers S. J. and Levy H. M., “An
Analysis of Operating System Behavior on a Simul-
taneous Multithreaded Architecture.” Proceedings of
[1] Akkary H. and Driscoll M., “A Dynamic Multi- the 9th International Conference on Architectural
threading Processor.” 31st Int’l Symp. on Micro- Support for Programming Languages and Operating
arch., Dec 1998. Systems, November 2000.
[2] Bryan B., A, et al. “Can Trace-Driven Simulators [16] Rotenberg E., Bennet S., Smith J., “Trace Processors:
Accurately Predict Superscalar Performance?.” In Exploiting Hierarchy and Speculation.” IEEE Trans-
Proc. of International Conference on Computer De- actions on Computers.
sign, pp. 478--485, Oct. 1996. [17] Sazeides Y., Smith J. E. “The Predictability of Data
[3] Hily S., and Seznec A., “Out-of-order Execution may Values.” The 30th Annual International Symposium
not be Cost Effective on Processors Featuring Simul- on Microarchitecture, pp. 248-258, December 1997.
taneous Multithreading.” International Symposium on [18] Smith J. E., Pleszkun A. R. “Implementing Precise
High-Performance Computer Architecture, January Interrupts in Pipelined Processors.” IEEE Transac-
1999. tions on Computers, Vol. 37, pp. 562-573, May
[4] Hily S., et al., “Contention on 2nd Level Cache May 1988.
Limit the Effectiveness of Simultaneous Multithread- [19] Smith J. E., and Sohi, G. S., “The Microarchitecture
ing,” INRIA internal report 3115, 1997. of Superscalar Processors.” Proceedings of the IEEE,
[5] Intel. Introduction to Hyper-Threading Technology. December 1995.
Document number 250008-002. 2001. [20] Sohi G. S., Roth A., “Speculative Multithreaded Pro-
[6] Krishnan V. and Torrelas J., “Sequential Binaries on cessors.” Internal Report. University of Wisconsin”,
a Clustered Multithreaded Architecture with Specula- 2001.
tion Support.” International Conference on Super- [21] Srivastava A., and Eustace A., “ATOM: A System
computers, 1998. for Building Customized Program Analysis Tools.”
[7] Lipasti M. H., Shen J. P., “Exceeding the Dataflow in Proc. of the 1994 Conf. on Programming Lan-
Limit via Value Prediction.” The 29th Annual Inter- guages Design and Implementation, 1994.
national Symposium on Microarchitecture, pp. 226- [22] Steffan J. G., and Mowry T. C., “The Potential for
237, December 1996. Using Thread-Level Data Speculation to Facilitate
[8] Lo J., et al., “Converting Thread-Level Parallelism Automatic Parallelization.” Proceedings of the
into Instruction-Level Parallelism via Simultaneous Fourth Int'l Conf. on High-Performance Computer
Multithreading.” ACM Transactions on Computer Architecture (HPCA-4), Feb. 1998.
Systems, Aug. 1997, pp. 322-354. [23] Tubella J., and González A., “Control Speculation in
[9] Marcuello P., Gonzales A., Tubella J., “Speculative Multithreaded Processors through Dynamic Loop De-
Multithreaded Processors.” International Conference tection.” in Proc. of the 4th Int. Symp. on High Per-
on Supercomputing, pp 77-84, 1998. formance Computer Architecture (HPCA-4).
[10] Marcuello P., and Gonzalez, A., “Control and Data [24] Tullsen D. M, Eggers S. J., Emer J., Levy H. M, Lo
Dependence Speculation in Multithreaded Proces- J. L., and Stamm R. L. “Exploiting choice: Instruc-
sors.” Proceedings of the Workshop on Multithreaded tion Fetch and Issue on an Implementable Simultane-
Execution, Architecture and Compilation ous Multithreading Processor.” In International Sym-
(MTEAC'98), January 1998. posium on Computer Architecture, May 1996.
[11] Marcuello P., and Gonzales, A., “Clustered Specula- [25] Tullsen, D. M, et al., “Simultaneous Multithreading:
tive Multithreaded Processors.” Proceedings of Maximizing On-Chip Parallelism.” Proceedings of
ICS’99. the 22nd International Symposium on Computer Ar-
[12] Moshovos B., Vijaykumar T. N., and Sohi G. S., chitecture, 1995, pp. 392-403.
“Dynamic Speculation and Synchronization of Data [26] Wallace S., Calder B., Tullsen D. M., “Threaded
Dependences.” in Proc. of Int. Symp. on Computer Multiple Path Execution.” The 25th Annual Interna-
Architecture, pp. 181-193, 1997. tional Symposium on Computer Architecture, June
[13] Oplinger J., Heine D., Lam M. S., “In Search of 1998.
Speculative Thread Level Parallelism.” Proceedings [27] Wilson K., M., Olukotun K., and Rosenblum M.,
of the 1999 International Conference on Parallel Ar- “Increasing Cache Port Efficiency for Dynamic Su-
chitectures and Compilation Techniques (PACT'99), perscalar Microprocessors.” Proc. of the 23rd Inter-
Newport Beach, CA, October 1999. national Symposium on Computer Architecture, June
[14] Puppin D., Tullsen D., “Maximizing TLP with Loop- 1996.
parallelization on SMT.” Proceedings of the 5th [28] Zilles C. B., Emer J. S., Sohi G. S., “The Use of Mul-
Workshop on Multithreaded Execution, Architecture, tithreading for Exception Handling.” Proceedings of
and Compilation, Austin, Texas, December, 2001. Micro-32, 1999.
10
[29] Yankelevsky N. M., Polychronopoulos D. C., “Al- [30] Zilles C. B., Emer J. S., Sohi G. S., “The Use of Mul-
pha-Coral a Multigrain, Multithreaded Processor Ar- tithreading for Exception Handling.” Proceedings of
chitecture.” Proceedings of the 15th International Micro-32, 1999.
Conference on Supercomputing, 2001.
11