Lec5 PDF
Lec5 PDF
Superpipeline
Dependency issues
Parallel instruction
execution
1
Superscalar Architecture
Superscalar is a computer designed to improve
the performance of the execution of scalar
instructions.
A scalar is a variable that can hold only one atomic
value at a time, e.g., an integer or a real.
A scalar architecture processes one data item at a time -
- the computers we discussed up till now.
Examples of non-scalar variables:
Arrays
Matrices
Records
2
Superscalar Architecture (Contd)
In a superscalar architecture (SSA), several
scalar instructions can be initiated simultaneously
and executed independently.
Pipelining allows also several instructions to be
executed at the same time, but they have to be
in different pipeline stages at a given moment.
SSA includes all features of pipelining but,
in addition, there can be several instructions
executing simultaneously in the same pipeline
stage.
SSA introduces therefore a new level
of parallelism, called instruction-level
parallelism.
3
Motivation
Most operations are on scalar quantities (about 80%).
Speedup these operations will lead to large overall
performance improvement.
How to implement the idea?
A SSA processor fetches multiple instructions at a time, and
attempts to find nearby instructions that are independent of
each other and therefore can be executed in parallel.
Based on the dependency analysis, the processor may issue
and execute instructions in an order that differs from that
of the original machine code.
The processor may eliminate some unnecessary
dependencies by the use of additional registers and
renaming of register references.
4
Superpipelining
Superpipelining is based on dividing the stages of a pipeline
into several sub-stages, and thus increasing the number of
instructions which are handled by the pipeline at the same
time.
For example, by dividing each stage into two sub-stages,
a pipeline can perform at twice the speed in the ideal
situation.
Many pipeline stages may perform tasks that require less than
half a clock cycle.
No duplication of hardware is needed for these stages.
5
Superpipelining (Contd)
For a given architecture and the corresponding instruction
set there is an optimal number of pipeline stages/sub-
stages.
Increasing the number of stages/sub-stages over this limit
reduces the overall performance.
Overhead of data buffering between the stages.
Not all stages can be divided into (equal-length) sub-stages.
The hazards will be more difficult to resolved.
The clock skew problem.
More complex hardware.
6
Superscalar vs. Superpipeline
Base machine: 4-stage
pipeline
Instruction fetch
Operation decode
Operation execution
Result write back
Superpipeline of degree 2
A sub-stage often takes
half a clock cycle to finish.
Superscalar of degree 2
Two instructions are
executed concurrently
in each pipeline stage.
Duplication of hardware
is required by definition.
7
Superpipelined Superscalar Design
8
Basic Superscalar Concepts
SSA allows several instructions to be issued and
completed per clock cycle.
It consists of a number of pipelines that are
working in parallel.
Depending on the number and kind of parallel
units available, a certain number of instructions
can be executed in parallel.
In the following example two floating point
and two integer operations can be issued and
executed simultaneously.
Each unit is also pipelined and can execute several
operations in different pipeline stages.
9
An SSA Example
10
Lecture 5: Superscalar Processors
Definition and motivation
Superpipeline
Dependency issues
Parallel instruction
execution
11
Parallel Execution Limitation
The situations which prevent instructions to be executed
in parallel by SSA are very similar to those which prevent
efficient execution on a pipelined architecture (pipeline
hazards):
Resource conflicts.
Control (procedural) dependency.
Data dependencies.
Their consequences on SSA are more severe than those on
simple pipelines, because the potential of parallelism in SSA
is greater and, thus, a larger amount of performance will be
lost.
Instruction-level parallelism = the degree in which, on
average, the instructions of a program can be executed in
parallel.
12
Resource Conflicts
Several instructions compete for the same
hardware resource at the same time.
e.g., two arithmetic instructions need the same floating-
point unit for execution.
similar to structural hazards in pipeline.
They can be solved partly by introducing several
hardware units for the same functions.
e.g., have two floating-point units.
the hardware units can also be pipelined to support
several operations at the same time.
13
Procedural Dependency
The presence of branches creates major problems
in assuring the optimal parallelism.
cannot execute instructions after a branch in parallel
with instructions before a branch.
similar to control hazards in pipeline.
If instructions are of variable length, they cannot
be fetched and issued in parallel, since an
instruction has to be decoded in order to identify
the following one.
therefore, superscalar techniques are more efficiently
applicable to RISCs, with fixed instruction length and
format.
14
Data Conflicts
Caused by data dependencies between
instructions in the program.
similar to date hazards in pipeline.
To address the problem and to increase the
degree of parallel execution, SSA provides a
great liberty in the order in which instructions are
issued and executed.
Therefore, data dependencies have to be
considered and dealt with much more carefully.
15
Window of Execution
Due to data dependencies, only some part of the
instructions are potential subjects for parallel execution.
In order to find instructions to be issued in parallel, the
processor has to select from a sufficiently large instruction
sequence.
There are usually a lot of data dependencies in a short
instruction sequence.
Window of execution is defined as the set of instructions
that is considered for execution at a certain moment.
The number of instructions in the window should be as
large as possible. However, this is limited by:
Capacity to fetch instructions at a high rate.
The problem of branches.
The cost of hardware needed to analyze data dependencies.
16
Window of Execution Example
17
Window of Execution Example
18
Window of Execution (contd)
The window of execution can be extended over
basic block borders by branch prediction.
Speculative execution.
With speculative execution, instructions of the
predicted path are entered into the window of
execution.
Instructions from the predicted path are executed
tentatively.
If the prediction turns out to be correct the state change
produced by these instructions will become permanent
and visible (the instructions commit);
Otherwise, all effects are removed.
19
Data Dependencies
All instructions in the window of execution may
begin execution, subject to data dependence and
resource constraints.
20
True Data Dependency
True data dependencies exist when the output of one
instruction is required as an input to a subsequent
instruction:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R2,R4,R5 (R2 := R4 + R5)
can fetch and decode second instruction in parallel with first.
can NOT execute second instruction until first is finished.
They are intrinsic features of the users program, and
cannot be eliminated by compiler or hardware techniques.
They have to be detected and handled by hardware.
The addition above cannot be executed before the result of the
multiplication is available.
The simplest solution is to stall the adder until the multiplier
has finished.
In order to avoid the adder to be idle, the hardware can find
other instructions which can be executed by the adder.
21
True Data Dependency Example
22
Output Dependency
An output dependency exists if two instructions are
writing into the same location.
If the second instruction writes before the first one, an error
occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R4,R2,R5 (R4 := R2 + R5)
23
Anti-dependency
An anti-dependency exists if an instruction uses a
location as an operand while a following one is writing into
that location.
If the first one is still using the location when the second one
writes into it, an error occurs:
MUL R4,R3,R1 (R4 := R3 * R1)
. . .
ADD R3,R2,R5 (R3 := R2 + R5)
24
Output and Anti- Dependencies
Output dependencies and anti-dependencies are not
intrinsic features of the executed program.
They are not real data dependencies but storage conflicts.
They are due to the competition of several instructions for the
same register.
They are only the consequence of the manner in which
the programmer or the compiler are using registers (or
memory locations).
In the previous examples the conflicts are produced only
because:
The output dependency: R4 is used by both instructions to
store the result (due to, for example, optimization of register
usage);
The anti-dependency: R3 is used by the second instruction to
store the result.
25
Output and Anti- Dependencies (Contd)
26
Effect of Dependencies
27
Lecture 5: Superscalar Processors
Definition and motivation
Superpipeline
Dependency issues
Parallel instruction
execution
28
Instruction vs Machine Parallelism
Instruction-level parallelism (ILP) - the average
number of instructions in a program that a processor might
be able to execute at the same time.
Mostly determined by the number of true (data) dependencies
and procedural (control) dependencies in relation to the
number of other instructions.
Machine parallelism of a processor - the ability of the
processor to take advantage of the ILP of the program.
Determined by the number of instructions that can be fetched
and executed at the same time, i.e., the capacity of the
hardware.
To achieve high performance, we need both ILP and
machine parallelism.
The ideal situation is that we have the same ILP and machine
parallelism.
29
Division and Decoupling
To increase ILP, we should divide the instruction execution
into smaller tasks and decouple them. In particular, we have
three important activities :
30
SSA Instruction Execution Policies
Instructions can be executed in an order
different from the strictly sequential one, with
the requirement that the results must be the
same.
31
In-Order Issue with In-Order Completion
32
IOI with IOC Example
I1 needs two execute cycles (floating-point)
I2
I3
I4 needs the same function unit as I3
I5 needs data value produced by I4
I6 needs the same function unit as I5
33
IOI with IOC Discussion
The processor detects and handles (by stalling)
true data dependencies and resource conflicts.
The basic idea of SSA is not to rely on compiler-
based technique (compatibility consideration).
SSA allows the hardware alone to detect
instructions which can be executed in parallel and
to do that accordingly.
IOI with IOC is not very efficient, but it simplifies
the hardware.
34
In-Order Issue w. Out-of-Order Completion
35
Out-of-Order Issue w. Out-of-Order Completion
36
OOI with OOC Example
37
Speedup w/o Procedural Dependencies
38
Summary
The following techniques are main features for superscalar
processors:
Several pipelined units which are working in parallel;
Out-of-order issue and out-of-order completion;
Register renaming.
All of the above techniques are aimed to enhance
performance.
Experiments have shown:
Only adding additional functional units is not very efficient;
Out-of-order issue is extremely important, which allows to look
ahead for independent instructions;
Register renaming can improve performance with more
than 30%; in this case performance is limited only by true
dependencies.
It is important to provide a fetching/decoding capacity so that
the window of execution is sufficiently large.
39