Lect27-parallal-processing
Lect27-parallal-processing
Parallel Processing
Simultaneous data processing tasks for the purpose of increasing the
= computational speed
Perform concurrent data processing to achieve faster execution time
Multiple Functional Unit :
Separate the execution unit into eight functional units operating in parallel
Adder-subtractor
Integer multiply
Logic unit
Shift unit
To Memory
Incrementer
Processor
registers
Floating-point
add-subtract
Floating-point
multiply
Floating-point
divide
Pipelining : it is the process of Decomposing a sequential process into suboperations
with Each subprocess is executed in a special dedicated segment concurrently with all
other segments.
It is a collection of processing segments through which binary information flows. Where
each segment performs partial processing dedicated by the way the task is partioned.
Pipelining의 예제 : Fig. 9-2
Multiply and add operation : Ai * Bi Ci ( for i = 1, 2, …, 7 )
3 개의 Suboperation Segment로 분리
» 1) R1 Ai, R 2 Bi : Input Ai and Bi
» 2) R3 R1 * R 2, R 4 Ci : Multiply and input Ci
» 3) R5 R3 R 4 : Add Ci
Content of registers in pipeline example : Tab. 9-1
Ai Bi Ci
R1 R2
Multiplier
R3 R4
Adder
R5
Segment1 Segment 2 Segment 3
Clock pulse Number R1 R2 R3 R4 R5
1 A1 B1 - - -
2 A2 B2 A1*B1 C1 -
3 A3 B3 A2*B2 C2 A1*B1+C1
4 A4 B4 A3*B3 C3 A2*B2+C2
5 A5 B5 A4*B4 C4 A3*B3+C3
6 A6 A6 A5*B5 C5 A4*B4+C4
7 A7 A7 A6*B6 C6 A5*B5+C5
8- - A7*B7 C7 A6*B6+C6
9- - - - A7*B7+C7
General considerations
4 segment pipeline : the operand pass through all four segments in a
fixed sequence. Each segment consists of a combinational ckt Si that
performs a sub operation over the data stream. The segments are
separated by the registers to hold the intermediate results.
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
Clock cycles 1 2 3 4 5 6 7 8 9
1 T1 T2 T3 T4 T5 T6
Segment
2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
Speedup S : Nonpipeline / Pipeline
S = n • tn / ( k + n - 1 ) • tp = 6 • 6 tn / ( 4 + 6 - 1 ) • tp = 36 tn / 9 tn = 4
» n : task number ( 6 )
» tn : time to complete each task in nonpipeline ( 6 cycle times = 6 tp)
k+n-1n » tp : clock cycle time ( 1 clock cycle )
» k : segment number ( 4 )
If n 이면, S = tn / tp
If we assume that the time it takes to process a task is the same in the pipeline and
nonpipeline circuits then we have
nonpipeline ( tn ) = pipeline ( k • tp )
S = tn / tp = k • tp / tp = k
Where k is the number of segments.
Arithmetic Pipeline
Floating-point Adder Pipeline Example :
Add / Subtract two normalized floating-point binary number
» X = A x 2a = 0.9504 x 103
» Y = B x 2b = 0.8200 x 102
4 segments suboperations
» 1) Compare exponents by subtraction :
3-2=1
X = 0.9504 x 103
Y = 0.8200 x 102
» 2) Align mantissas
X = 0.9504 x 103
Y = 0.08200 x 103
» 3) Add mantissas
Z = 1.0324 x 103
» 4) Normalize result
Z = 0.1324 x 104
Exponents
a b
Mantissas
A B
R R
Compare Difference
Segment 1 : exponents
by subtraction
Add or subtract
Segment 3 :
mantissas
R R
Adjust Normalize
Segment 4 :
exponent result
R R
Instruction Pipeline
Instruction Cycle
1) Fetch the instruction from memory
2) Decode the instruction
3) Calculate the effective address
4) Fetch the operands from memory
5) Execute the instruction
6) Store the result in the proper place
Segment 1 : Fetch instruction
from memory
Decode instruction
Segment 2 : and calculate the
effective address
Branch ?
Fetch operand
Segment 3 : from memory
Interrupt
handling Interrupt ?
Update PC
Empty pipe
Example : Four-segment Instruction Pipeline
Four-segment CPU pipeline :
» 1) FI : Instruction Fetch
» 2) DA : Decode Instruction & calculate EA
» 3) FO : Operand Fetch
» 4) EX : Execution
Timing of Instruction Pipeline :
Step : 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction : 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
No Branch
Branch
Pipeline Conflicts : 3 major difficulties
1) Resource conflicts
» memory access by two segments at the same time.
» Can be avoided by using separate instruction stream and data memories.
2) Data dependency
» when an instruction depend on the result of a previous instruction, but this result is not
yet available
3) Branch difficulties
» branch and other instruction (interrupt, ret, ..) that change the value of PC
Data Dependency 해결 방법
Hardware 적인 방법
» Hardware Interlock
previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를 강제 삽입
» Operand Forwarding
previous instruction의 결과를 곧바로 ALU 로 전달 (정상적인 경우, register를 경유함)
Software 적인 방법
» Delayed Load
previous instruction의 결과가 나올 때 까지 No-operation instruction 을 삽입
Assignment