Data science
Data science
ORGANIZATION
AND
ARCHITECTURE
Debdeep Mukhopadhyay,
CSE, IIT Kharagpur
Datapath Elements and
Their Designs
Why Datapaths?
The speed of these elements often dominates the
overall system performance so optimization
techniques are important.
However, as we will see, the task is non-trivial since
there are multiple equivalent logic and circuit
topologies to choose from, each with adv./disadv. in
terms of speed, power and area.
Datapath elements include shifters, adders,
multipliers, etc.
Bit-slicing method of constructing
ALU
Bit slicing is a technique for constructing a
processor from modules of smaller bit width.
Each of these components processes one
bit field or "slice" of an operand.
The grouped processing components would
then have the capability to process the chosen
full word-length of a particular software
design.
Bit slicing
0 0 Y<-A No shift
0 1 Y<-shlA Shift left
1 0 Y<-shrA Shift right
1 1 Y<-0 Zero
outputs
0
A[1]
A[0] Y[1]
MUX
A[2]
A[0]
A[1] Y[0]
MUX
0
Verilog Code
module shifter(Con,A,Y);
input [1:0] Con;
input[2:0] A;
output[2:0] Y;
reg [2:0] Y;
always @(A or Con)
begin
case(Con)
0: Y=A;
1: Y=A<<1;
2: Y=A>>1;
default: Y=3’b0;
endcase
end
endmodule
Combinational logic shifters with
shiftin and shiftout
Sel Operation Function
0 Y<=A No shift
1 Y<-A rol 1 Rotate once
2 Y<-A rol 2 Rotate twice
3 Y<- A rol 3 Rotate Thrice
4 Y<-A rol 4 Rotate four times
5 Y<-A rol 5 Rotate five times
Verilog Coding
function [2:0] rotate_left;
input [5:0] A;
input [2:0] NumberShifts;
reg [5:0] Shifting;
integer N;
begin
Shifting = A;
for(N=1;N<=NumberShifts;N=N+1)
begin
Shifting={Shifting[4:0],Shifting[5]};
end
rotate_left=Shifting;
end
endfunction
Verilog
always @(Rotate or A)
begin
case(Rotate)
0: Y=A;
1: Y=rotate_left(A,1);
2: Y=rotate_left(A,2);
3: Y=rotate_left(A,3);
4: Y=rotate_left(A,4);
5: Y=rotate_left(A,5);
default: Y=6’bx;
endcase
end
Another Way
.
data 1
n bits
output
data 2 n bits
n bits
S = A⊕ B Cout S = A⊕ B ⊕C
Cout C
Cout = AgB Cout = MAJ ( A, B, C )
S S
A B Co S A B C Co S
0 0 0 0 0 0 0 0 0
0 1 0 1 0 0 1 0 1
1 0 0 1 0 1 0 0 1
1 1 1 0 0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1
Carry-Ripple Adder
Simplest design: cascade full adders
Critical path goes from Cin to Cout
Design full adder to have fast carry delay
A4 B4 A3 B 3 A2 B2 A 1 B1
Cout Cin
C3 C2 C1
S4 S3 S2 S1
Full adder
Computes one-bit sum, carry:
si = ai XOR bi XOR ci
ci+1 = aibi + aici + bici
Half adder computes two-bit sum.
Ripple-carry adder: n-bit adder built from full
adders.
Delay of ripple-carry adder goes through all
carry bits.
Verilog for full adder
module fulladd(a,b,carryin,sum,carryout);
input a, b, carryin; /* add these bits*/
output sum, carryout; /* results */
fulladd a0(a[0],b[0],carryin,sum[0],carry[1]);
fulladd a1(a[1],b[1],carry[1],sum[1],carry[2]);
…
fulladd a7(a[7],b[7],carry[7],sum[7],carryout]);
endmodule
Generate and Propagate
G[i ] = A[i ].B[i ] G[i ] = A[i ].B[i ]
P[i ] = A[i ] ⊕ B[i ] P[i ] = A[i ] + B[i ]
C[i ] = G[i ] + P[i ].C[i −1] C[i ] = G[i ] + P[i ].C[i −1]
S [i ] = P[i ] ⊕ C[i −1] S [i ] = A[i ] ⊕ B[i ] ⊕ C[i − 1]
endmodule
Verilog for carry-lookahead sum unit
module sum(a,b,carryin,result);
input a, b, carryin; /* add these bits*/
output result; /* sum */
∑ ( j − i)2
j =i +1
− ( j −i )
+ (k − i)2− ( k −1−i )
= 2 − 2− ( k −1−i )
p
[Using, ∑l2
l =1
−l
= 2 − ( p + 2)2− p ]
Carry completion sensing adder
A=011101101101101 A=011101101101101
B=100111000010101 B=100111000010101
------------------------------ ------------------------------
C=000000000000000 C=000101000000101
N=000000000000000 N=000000010000010
------------------------------ ------------------------------
C=000101000000101 C=001111000001101
N=000000010000010 N=000000110000010
Carry completion sensing adder
A=011101101101101 A=011101101101101
B=100111000010101 B=100111000010101
------------------------------ ------------------------------
C=001111000001101 C=011111000011101
N=000000110000010 N=000000110000010
------------------------------ ------------------------------
C=011111000011101 C=111111000111101
N=000000110000010 N=000000110000010
Carry completion sensing adder
A=011101101101101
B=100111000010101
------------------------------
C=111111000111101
N=000000110000010
-----------------------------
-
C=111111001111101
N=000000110000010
Carry completion sensing adder
(A[i],B[i])=(0,0)=>(Ci,Ni)=(0,1)
(A[i],B[i])=(1,1)=>(Ci,Ni)=(1,0)
(A[i],B[i])=(0,1)=>(Ci,Ni)=(Ci-1,Ni-1)
(A[i],B[i])=(1,0)=>(Ci,Ni)=(Ci-1,Ni-1)
Stop, when for all i, Ci V Ni = 1
Justification
Ci and Ni together is a coding for the carry.
When Ci=1, carry can be computed. Make
Ni=0
When Ci=0 is the final carry, then indicate by
Ni=1
The carry can be surely stated when both Ai
and Bi are 1’s or 0’s.
Carry-skip adder
Looks for cases in which carry out of a set of
bits is identical to carry in.
Typically organized into b-bit stages.
Can bypass carry through all stages in a group
when all propagates are true: Pi Pi+1 … Pi+b-1.
Carry out of group when carry out of last bit in
group or carry is bypassed.
Carry-skip structure
ci
Pi
Pi+1 AND
…
Pi+b-1
OR
Ci+b-1
Carry-skip structure
fulladd_p a0(a[0],b[0],carryin,sum[0],carry[1],p[0]);
fulladd_p a1(a[1],b[1],carry[1],sum[1],carry[2],p[1]);
fulladd_p a2(a[2],b[2],carry[2],sum[2],carry[3],p[2]);
fulladd_p a3(a[3],b[3],carry[3],sum[3],carry[4],p[3]);
assign cs4 = carry[4] | (p[0] & p[1] & p[2] & p[3] & carryin);
fulladd_p a4(a[4],b[4],cs4, sum[4],carry[5],p[4]);
…
assign carryout = carry[8] | (p[4] & p[5] & p[6] & p[7] & cs4);
endmodule
Delay analysis
Assume that skip delay = 1 bit carry delay.
Delay of k-bit adder with block size b:
T = (b-1) + 0.5 + (k/b –2) + (b-1)
block 0 OR gate skips last block
For equal sized blocks, optimal block size is
sqrt(k/2).
Delay of Carry-Skip Adder
tp
ripple adder
bypass adder
N
t d = 2( k − 1) t RCA + − 2 t SKIP
2k
4..8
N
Carry-select adder
Computes two results in parallel, each for
different carry input assumptions.
Uses actual carry in to select correct result.
Reduces delay to multiplexer.
Carry-select structure
Carry-save adder
Useful in multiplication.
Input: 3 n-bit operands.
Output: n-bit partial sum, n-bit carry.
Use carry propagate adder for final sum.
Operations:
s = (x + y + z) mod 2.
c = [(x + y + z) –2] / 2.
Carry Network is the Essence of a Fast Adder
gi pi Carry is: xi yi
0 0 annihilated or killed gi = xi yi
0 1 propagated
1 0 generated pi = xi ⊕ yi
1 1 (impossible)
Carry network
Ripple; Skip;
ck
c k−1
... ci ... c0 Lookahead;
c k−2 c1
c i+1 Parallel-prefix
si
...
ck ck−1 ck−2 c2 c1 c0
Carry network
ck
c k−1
... ci ... c0
c k−2 c1
c i+1
si
The carry recurrence can be unrolled to obtain each carry signal directly
from inputs, rather than through propagation Note:
Addition symbol
ci = gi–1 ∨ ci–1 pi–1 vs logical OR
= gi–1 ∨ (gi–2 ∨ ci–2 pi–2) pi–1
= gi–1 ∨ gi–2 pi–1 ∨ ci–2 pi–2 pi–1
= gi–1 ∨ gi–2 pi–1 ∨ gi–3 pi–2 pi–1 ∨ ci–3 pi–3 pi–2 pi–1
= gi–1 ∨ gi–2 pi–1 ∨ gi–3 pi–2 pi–1 ∨ gi–4 pi–3 pi–2 pi–1 ∨ ci–4 pi–4 pi–3 pi–2 pi–1
=...
Full Carry Lookahead
x3 y3 x2 y2 x1 y1 x0 y0
cin
...
s3 s2 s1 s0
indirectly g2
Full carry lookahead is quite practical
for a 4-bit adder c2 p1
c1 = g0 ∨ c0 p0 g1
c2 = g1 ∨ g0 p1 ∨ c0 p0 p1 p0
c3 = g2 ∨ g1 p2 ∨ g0 p1 p2 ∨ c0 p0 p1 p2 c1
g0
c4 = g3 ∨ g2 p3 ∨ g1 p2 p3 ∨ g0 p1 p2 p3 c0
∨ c0 p0 p1 p2 p3 Four-bit carry network with
full lookahead.
Carry Lookahead Beyond 4 Bits
Consider a 32-bit adder
No circuit sharing:
c1 = g0 ∨ c0 p0
Repeated computations
c2 = g1 ∨ g0 p1 ∨ c0 p0 p1
c3 = g2 ∨ g1 p2 ∨ g0 p1 p2 ∨ c0 p0 p1 p2
.
.
. 32-input AND
c31 = g30 ∨ g29 p30 ∨ g28 p29 p30 ∨ g27 p28 p29 p30 ∨ . . . ∨ c0 p0 p1 p2 p3 ... p29 p30
...
High fan-ins necessitate
32-input OR tree-structured circuits
Solution to the Fan-in Problem
High-radix addition (i.e., radix 2h)
Increases the latency for generating g and p signals and sum digits,
but simplifies the carry network (optimal radix?)
Multilevel lookahead
Either way, the carries c4, c8, and c12 are determined first
g [i,i+3] = gi+3 ∨ gi+2 pi+3 ∨ gi+1 pi+2 pi+3 ∨ gi pi+1 pi+2 pi+3
p [i,i+3] = pi pi+1 pi+2 pi+3
ci+3 ci+2 ci+1
g[i,i+3] p[i,i+3]
pi+3
c4
A 4-bit
lookahead gi+3
carryp3generator Block Signal Generation
g3 Intermediate Carries
c3 ci+3
p2 pi+2
A 4-bit
carry g2 gi+2
network
p1 ci+2 pi+1
c2
g1 gi+1
p0 pi
c1 ci+1
g0 gi
ci
c0
Combining Block g and p Signals
j0 i0
j1 i1
j2 i2
j3
Block generate and
i3
propagate signals
can be combined in
c j 2 +1 c j 1 +1 cj the same way as bit
0 +1
g and p signals to
form g and p signals
g p g p g p g p
for wider blocks
ci 0
4-bit lookahead carry generator
g p
Fig. 6.3 Combining of g and p signals of four
(contiguous or overlapping) blocks of arbitrary widths
into the g and p signals for the overall block [i0, j3].
Apr. 2012 Computer Arithmetic, Addition/Subtraction Slide 62
A Two-Level Carry-Lookahead Adder
c12 c8 c4 c0
g [0,63]
p [0,63] Fig. 6.4 Building a 64-bit carry-lookahead adder from 16
4-bit adders and 5 lookahead carry generators.
p′
j1 i1
¢
g = g" + g'p"
g p p = p'p"
(g, p)
Block B g p
Combining of g and p signals of two (contiguous or overlapping) blocks B'
and B" of arbitrary widths into the g and p signals for block B.
Formulating the Prefix Computation Problem
The problem of carry determination can be formulated as:
Given (g0, p0)(g1, p1) . . . (gk–2, pk–2) (gk–1, pk–1)
c1 c2 . . . ck–1 ck
Carry-in can be viewed as an extra (−1) position: (g–1, p–1) = (cin, 0)
¢ ¢ (b) A 4-bit
Carry
lookahead
¢ ¢ network
¢ ¢ ¢
g[0,1] p[0,1]
Reason for 2
latency being
2 log2k – 2 3
Brent-Kung
5
parallel prefix
graph for
16 inputs. 6