0% found this document useful (0 votes)
44 views

Simultaneous Multithreading Processor

Simultaneous multithreading (SMT) is a processor architecture that aims to achieve higher performance by enabling instructions from different threads to execute on the functional units in the same cycle, utilizing both instruction-level parallelism (ILP) and thread-level parallelism (TLP). SMT builds on advanced superscalar processors by making minor changes to allow multithreaded execution. Experimental results show promising performance from compiling techniques like interleaving and loop fusion on SMT processors. A performance model is also presented that can accurately predict performance of simple loops on SMT to within 5%

Uploaded by

simon sylvester
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Simultaneous Multithreading Processor

Simultaneous multithreading (SMT) is a processor architecture that aims to achieve higher performance by enabling instructions from different threads to execute on the functional units in the same cycle, utilizing both instruction-level parallelism (ILP) and thread-level parallelism (TLP). SMT builds on advanced superscalar processors by making minor changes to allow multithreaded execution. Experimental results show promising performance from compiling techniques like interleaving and loop fusion on SMT processors. A performance model is also presented that can accurately predict performance of simple loops on SMT to within 5%

Uploaded by

simon sylvester
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Simultaneous Multithreading Processor

In the run for performance, many new architectures have been proposed
as possible successors of present-day commodity processors.
In order to achieve higher performance, the idea behind the simultaneous
multithreading (SMT) processor is to introduce few changes
into an advanced superscalar processor, yet enable it to execute
instructions from different threads on the different functional units
in the same cycle: thread-level parallelism becomes so a suitable
source of instruction-level parallelism.
This work presents experimental results on performance of Livermore
loops on the SMT: advanced compiling techniques are discussed
and evaluated, that could be implemented into an advanced
compiler for the SMT. The general effectiveness of interleaving,
loop fusion and some other techniques poses encouraging results
in the direction of an advanced multithreading compiler.
A performance model for SMT is also presented: the methodology
described here uses compile-time information to determine
upper and lower bounds for the parallelized performance of simple
loops. The model is _exible enough to manage many different
types of kernels, and is tested with the Livermore loops. Good results
have been achieved with short kernels compiled with the GNU C
compiler: with small loops (less than 40 instructions), the actual
performance
(useful processor utilization) is not farther than 5% from the
maximum expected value.
As processors are growing larger and larger (the one-hundredmillion-
transistor chip is not so far), computer designers are planning
directions for next-generation processors. Just scaling present day
architecture to larger con_gurations seems not to guarantee the expected
performance, so new ideas are proposed and explored, in
order to take full advantage of future engineering opportunities.
Processing-In-Memory are just some of many proposals. Different
problems are
faced by these architectures: better memory interface, advanced
usage of logic arrays, better usage of intra-chip connections. They
offer quite new and original architectural models, in the run for
performance.
The main idea behind simultaneous multithreading (SMT1) computing,
instead, is to create a more ef_cient and versatile processor,
able to exploit more parallelism, in all its available forms: such a processor
would be able to take advantage of both instruction-level
parallelism (ILP) and thread-level parallelism (TLP) with the same
ease. A SMT processor will be implemented as a superscalar
multiprocessor
(with multiple instruction issues) offering multithreaded execution
capabilities.
Simultaneous multithreading can be thought as a technique
whose main goal is to achieve higher utilization of the computational
capabilities of wide superscalar processors. On a SMT processor,
TLP can come from multithreaded parallel programs or from
individual programs in a multiprogrammed work-load.
SMT will use the same instruction-set architecture (ISA) of superscalar
processors, and most of their design. This can be a strong
point in determining a smooth introduction of SMT features in commodity
processors. With the rapid growth of a multithreaded programming
style, that is gaining popularity among developers with
Java, SMT sets itself as a natural candidate to replace present-day
superscalar processor, strong of its advanced multithreading capabilities.
As a matter of fact, Compaq is planning to introduce SMT
features into its commodity processors:
Some ways to improve performance.
One of these is to use the chip real estate to build larger and
larger on-chip memories, as featured by some recent processors
. Even if very popular, many studies show that
this solution is not enough to gain proportional performance, and
beyond a certain point larger caches seem to be not so useful (see
for instance [PS96]). Some paradoxical choices further highlight this
solution's limit: in order to have a low interprocessor communication
latency in Cray T3D/T3E, designers removed the second-level cache
from the featured Alpha processor.
Another one is to increase peak bandwidth, through increasing
clock-frequency (with deeper pipelines) and the number of functional
units of superscalar processors .
Leaving all the engineering problems related to this design (huge
monolithic core) out of account for the moment, it should be remembered
that compilers' ability to extract ILP is still limited, as
well as the opportunities that run-time structures (reordering, renaming...)
have to remove dependencies.
Even with advanced features as out-of-order execution and register
renaming, performance is affected by instruction dependencies,
that limit instruction issuing. Processor can suffer from so-called
vertical wastes, when all the functional units are idle for one or more
cycles, due to data dependencies.

Fine-grain multithreaded processors, able to change context every


cicle without degradation, seem to be a good approach to this
problem, as they are able to interleave the execution of different
threads in order to hide dependencies and latencies. Some studies
anyway hold that they are not able to utilize more than 40% of
a wide superscalar execution bandwidth [TEL95]. It should be clear,
however, that even if vertical wastes are avoided by multithreading,
a single thread may not be able to _ll the whole execution bandwidth,
due to the limited ILP it can offer. This problem is known as
horizontal waste, and only a more ef_cient exploiting of ILP can face
this problem.
A strong debate is going on between ILP pessimists and optimists.
ILP advocates hold that ILP is abudant, and can be exploited with a
few more ten millions of transistors, and a little compiler magic (see
for instance Itanium [MPR99a] and Sparc 64 V [MPR99b] projects).
Yale Patt at University of Texas and John Shen at Carnegie-Mellon
University believe that advanced superscalar techniques, such as
static scheduling, prediction, trace-processing and superspeculation,
will allow to ILP to scale suitably: in their opinion a 16- or 32-wide
superscalar processor will sustain a ILP of more than 10 instructions
per cycle.
These processors will feature a very large monolithic core, along
with the related complexity of design and testing. Explicit threadlevel
parallelism (TLP) can be seen as a way to keep processors simpler.
Chip multiprocessors (CMP), as IBM Power4 [MPR99c] and Sun
MAJC [MPR99d] trust in the parallelism that can be found between
different threads. These projects will implement, on a single chip,
a few replicated superscalar processors: there will be the opportunity
to runmultiple threads, each one on an independent complete
advanced processor (with smaller execution bandwidth).
Nonetheless, both these kinds of architecture (ILP- or TLPoriented)
suffer from poor utilization if the work-load does not match
the design parameters: they shows no _exibility when the parallelism
moves from ILP to TLP or vice versa.

4 Introduction
1.2 SMT: TLP as useful ILP
This distinction between instruction-level and thread-level parallelism
holds no more with SMT: in this new architectural model, they
both represent a way to _nd independent instructions, that can be
executed in parallel. On a SMT, instructions coming from different
threads compete for the shared processor resources every cycle.
SMT is able to transform the parallelism present among instruction
from different threads into instruction-level parallelism, and exploit
the whole execution bandwidth computing for different threads
in the same cycle. Simultaneous multithreading is a new way to
take full advantage of parallelism in all its form, being able to get
full utilization with both TLP and ILP, and to adapt to their dynamic
changes, without any degradation: when only one thread is available,
it can exploit all the functional units, as in one traditional superscalar
processor, but when ILP is low, more threads can run and fill
the execution bandwidth with their instructions.

You might also like