Simultaneous Multithreading Processor
Simultaneous Multithreading Processor
In the run for performance, many new architectures have been proposed
as possible successors of present-day commodity processors.
In order to achieve higher performance, the idea behind the simultaneous
multithreading (SMT) processor is to introduce few changes
into an advanced superscalar processor, yet enable it to execute
instructions from different threads on the different functional units
in the same cycle: thread-level parallelism becomes so a suitable
source of instruction-level parallelism.
This work presents experimental results on performance of Livermore
loops on the SMT: advanced compiling techniques are discussed
and evaluated, that could be implemented into an advanced
compiler for the SMT. The general effectiveness of interleaving,
loop fusion and some other techniques poses encouraging results
in the direction of an advanced multithreading compiler.
A performance model for SMT is also presented: the methodology
described here uses compile-time information to determine
upper and lower bounds for the parallelized performance of simple
loops. The model is _exible enough to manage many different
types of kernels, and is tested with the Livermore loops. Good results
have been achieved with short kernels compiled with the GNU C
compiler: with small loops (less than 40 instructions), the actual
performance
(useful processor utilization) is not farther than 5% from the
maximum expected value.
As processors are growing larger and larger (the one-hundredmillion-
transistor chip is not so far), computer designers are planning
directions for next-generation processors. Just scaling present day
architecture to larger con_gurations seems not to guarantee the expected
performance, so new ideas are proposed and explored, in
order to take full advantage of future engineering opportunities.
Processing-In-Memory are just some of many proposals. Different
problems are
faced by these architectures: better memory interface, advanced
usage of logic arrays, better usage of intra-chip connections. They
offer quite new and original architectural models, in the run for
performance.
The main idea behind simultaneous multithreading (SMT1) computing,
instead, is to create a more ef_cient and versatile processor,
able to exploit more parallelism, in all its available forms: such a processor
would be able to take advantage of both instruction-level
parallelism (ILP) and thread-level parallelism (TLP) with the same
ease. A SMT processor will be implemented as a superscalar
multiprocessor
(with multiple instruction issues) offering multithreaded execution
capabilities.
Simultaneous multithreading can be thought as a technique
whose main goal is to achieve higher utilization of the computational
capabilities of wide superscalar processors. On a SMT processor,
TLP can come from multithreaded parallel programs or from
individual programs in a multiprogrammed work-load.
SMT will use the same instruction-set architecture (ISA) of superscalar
processors, and most of their design. This can be a strong
point in determining a smooth introduction of SMT features in commodity
processors. With the rapid growth of a multithreaded programming
style, that is gaining popularity among developers with
Java, SMT sets itself as a natural candidate to replace present-day
superscalar processor, strong of its advanced multithreading capabilities.
As a matter of fact, Compaq is planning to introduce SMT
features into its commodity processors:
Some ways to improve performance.
One of these is to use the chip real estate to build larger and
larger on-chip memories, as featured by some recent processors
. Even if very popular, many studies show that
this solution is not enough to gain proportional performance, and
beyond a certain point larger caches seem to be not so useful (see
for instance [PS96]). Some paradoxical choices further highlight this
solution's limit: in order to have a low interprocessor communication
latency in Cray T3D/T3E, designers removed the second-level cache
from the featured Alpha processor.
Another one is to increase peak bandwidth, through increasing
clock-frequency (with deeper pipelines) and the number of functional
units of superscalar processors .
Leaving all the engineering problems related to this design (huge
monolithic core) out of account for the moment, it should be remembered
that compilers' ability to extract ILP is still limited, as
well as the opportunities that run-time structures (reordering, renaming...)
have to remove dependencies.
Even with advanced features as out-of-order execution and register
renaming, performance is affected by instruction dependencies,
that limit instruction issuing. Processor can suffer from so-called
vertical wastes, when all the functional units are idle for one or more
cycles, due to data dependencies.
4 Introduction
1.2 SMT: TLP as useful ILP
This distinction between instruction-level and thread-level parallelism
holds no more with SMT: in this new architectural model, they
both represent a way to _nd independent instructions, that can be
executed in parallel. On a SMT, instructions coming from different
threads compete for the shared processor resources every cycle.
SMT is able to transform the parallelism present among instruction
from different threads into instruction-level parallelism, and exploit
the whole execution bandwidth computing for different threads
in the same cycle. Simultaneous multithreading is a new way to
take full advantage of parallelism in all its form, being able to get
full utilization with both TLP and ILP, and to adapt to their dynamic
changes, without any degradation: when only one thread is available,
it can exploit all the functional units, as in one traditional superscalar
processor, but when ILP is low, more threads can run and fill
the execution bandwidth with their instructions.