0% found this document useful (0 votes)
79 views

Pipelining & Verilog

The document discusses pipelining and Verilog. It describes how pipelining can be used to increase throughput by dividing sequential operations into stages separated by registers. It provides an example of a sequential divider circuit that could be pipelined. It also discusses Verilog math functions and shows an example of a pipelined Verilog divider module. Finally, it covers performance metrics like latency and throughput, and how retiming circuits by moving registers can modify the critical path and reduce the number of registers.

Uploaded by

Dlisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Pipelining & Verilog

The document discusses pipelining and Verilog. It describes how pipelining can be used to increase throughput by dividing sequential operations into stages separated by registers. It provides an example of a sequential divider circuit that could be pipelined. It also discusses Verilog math functions and shows an example of a pipelined Verilog divider module. Finally, it covers performance metrics like latency and throughput, and how retiming circuits by moving registers can modify the critical path and reduce the number of registers.

Uploaded by

Dlisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Pipelining & Verilog

• Division
• Latency & Throughput
• Pipelining to increase throughput
• Retiming
• Verilog Math Functions

6.111 Fall 2016 Lecture 9 1


Sequential Divider
Assume the Dividend (A) and the divisor (B) have N bits. If we
only want to invest in a single N-bit adder, we can build a
sequential circuit that processes a single subtraction at a time
and then cycle the circuit N times. This circuit works on unsigned
operands; for signed operands one can remember the signs, make
operands positive, then correct sign of result.

Init: P0, load A and B


0 1
Repeat N times {
S LSB
shift P/A left one bit
P A S 0 B temp = P-B
N bits if (temp > 0)
N+1 N+1
{Ptemp, ALSB1}
-
else ALSB0
N+1 }
>0? S
Done: Q in A, R in P

6.111 Fall 2016 Lecture 9 2


Verilog divider.v
// The divider module divides one number by another. It always @( posedge clk ) begin
// produces a signal named "ready" when the quotient output del_ready <= !bit;
// is ready, and takes a signal named "start" to indicate if( start ) begin
// the the input dividend and divider is ready.
// sign -- 0 for unsigned, 1 for twos complement bit = WIDTH;
quotient = 0;
// It uses a simple restoring divide algorithm. quotient_temp = 0;
// http://en.wikipedia.org/wiki/Division_(digital)#Restoring_division dividend_copy = (!sign || !dividend[WIDTH-1]) ?
{1'b0,zeros,dividend} :
module divider #(parameter WIDTH = 8) {1'b0,zeros,~dividend + 1'b1};
(input clk, sign, start, divider_copy = (!sign || !divider[WIDTH-1]) ?
input [WIDTH-1:0] dividend, {1'b0,divider,zeros} :
input [WIDTH-1:0] divider, {1'b0,~divider + 1'b1,zeros};
output reg [WIDTH-1:0] quotient,
output [WIDTH-1:0] remainder; negative_output = sign &&
output ready); ((divider[WIDTH-1] && !dividend[WIDTH-1])
||(!divider[WIDTH-1] && dividend[WIDTH-1]));
reg [WIDTH-1:0] quotient_temp; end
reg [WIDTH*2-1:0] dividend_copy, divider_copy, diff; else if ( bit > 0 ) begin
reg negative_output; diff = dividend_copy - divider_copy;
quotient_temp = quotient_temp << 1;
wire [WIDTH-1:0] remainder = (!negative_output) ? if( !diff[WIDTH*2-1] ) begin
dividend_copy[WIDTH-1:0] : ~dividend_copy[WIDTH-1:0] + 1'b1; dividend_copy = diff;
quotient_temp[0] = 1'd1;
reg [5:0] bit; end
reg del_ready = 1; quotient = (!negative_output) ?
wire ready = (!bit) & ~del_ready; quotient_temp :
~quotient_temp + 1'b1;
wire [WIDTH-2:0] zeros = 0; divider_copy = divider_copy >> 1;
initial bit = 0; bit = bit - 1'b1;
initial negative_output = 0; end
end
endmodule

L. Williams MIT ‘13


6.111 Fall 2016 Lecture 9 3
Math Functions in Coregen

Wide selection of math functions available

6.111 Fall 2016 Lecture 9 4


Coregen Divider

not necessary many


applications

Details in data sheet.

6.111 Fall 2016 Lecture 9 5


Coregen Divider

Chose minimium
number for application

Ready For Data: needed


if clocks/divide >1

6.111 Fall 2016 Lecture 9 6


Performance Metrics for Circuits

Circuit Latency (L): time between arrival of new input and generation
of corresponding output.

For combinational circuits this is just tPD.

Circuit Throughput (T): Rate at which new outputs appear.

For combinational circuits this is just 1/tPD or 1/L.

6.111 Fall 2016 Lecture 9 7


Coregen Divider Latency

Latency dependent on
dividend width +
fractioanl reminder width

6.111 Fall 2016 Lecture 9 8


Performance of Combinational Circuits
For combinational logic:
L = tPD,
F T = 1/tPD.

X H P(X) We can’t get the answer faster,


but are we making effective use
of our hardware at all times?
G

X
F(X)
G(X)
P(X)

F & G are “idle”, just holding their outputs


stable while H performs its computation

6.111 Fall 2016 Lecture 9 9


Retiming: A very useful transform
Retiming is the action of moving registers around in the system
 Registers have to be moved from ALL inputs to ALL outputs or vice versa

Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint
partitions of the edges being cut. To retime, delays are moved from the ingoing to the
outgoing edges or vice versa.

Benefits of retiming:
• Modify critical path delay
• Reduce total number of registers
6.111 Fall 2016 Lecture 9 10
Retiming Combinational Circuits
aka “Pipelining”

15 15

X 25 P(X) Xi 25 P(Xi-2)

20 20

Assuming ideal registers: tCLK = 25


i.e., tPD = 0, tSETUP = 0
L = 45 L = 2*tCLK = 50
T = 1/45 T = 1/tCLK = 1/25

6.111 Fall 2016 Lecture 9 11


Pipeline diagrams
F
15

X H P(X) Clock cycle


25

G
20
i i+1 i+2 i+3

Input Xi Xi+1 Xi+2 Xi+3 …


Pipeline stages

F Reg F(Xi) F(Xi+1) F(Xi+2)



G Reg G(Xi) G(Xi+1) G(Xi+2)

H Reg H(Xi) H(Xi+1) H(Xi+2)

The results associated with a particular set of input


data moves diagonally through the diagram, progressing
through one pipeline stage each clock cycle.

6.111 Fall 2016 Lecture 9 12


Pipeline Conventions
DEFINITION:
a K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K
registers on every path from an input to an output.

a COMBINATIONAL CIRCUIT is thus an 0-stage pipeline.

CONVENTION:
Every pipeline stage, hence every K-Stage pipeline, has a register on its
OUTPUT (not on its input).

ALWAYS:
The CLOCK common to all registers must have a period sufficient to
cover propagation over combinational paths PLUS (input) register tPD
PLUS (output) register tSETUP.

The LATENCY of a K-pipeline is K times the period of


the clock common to all registers.

The THROUGHPUT of a K-pipeline is the frequency of


the clock.
6.111 Fall 2016 Lecture 9 13
Ill-formed pipelines
Consider a BAD job of pipelining:

X A C 1 2

B
Y

For what value of K is the following circuit a K-Pipeline? ________ none

Problem:
Successive inputs get mixed: e.g., B(A(Xi+1), Yi). This
happened because some paths from inputs to outputs
have 2 registers, and some have only 1!
This CAN’T HAPPEN on a well-formed K pipeline!

6.111 Fall 2016 Lecture 9 14


A pipelining methodology
Step 1: STRATEGY:
Add a register on each output. Focus your attention on
placing pipelining registers
Step 2: around the slowest circuit
elements (BOTTLENECKS).
Add another register on each
output. Draw a cut-set contour
that includes all the new
registers and some part of the
circuit. Retime by moving regs A B C
from all outputs to all inputs of
4 nS 3 nS 8 nS

cut-set. D F
4 nS 5 nS

Repeat until satisfied with T. E


2 nS
T = 1/8ns
L = 24ns

6.111 Fall 2016 Lecture 9 15


Pipeline Example
2 3 1
OBSERVATIONS:
X A C
2 1 • 1-pipeline improves
neither L or T.
3
2 • T improved by breaking
Y B long combinational paths,
1
allowing faster clock.
• Too many stages cost L,
don’t improve T.
LATENCY THROUGHPUT
• Back-to-back registers
0-pipe: 4 1/4 are often required to
keep pipeline well-
1-pipe: 4 1/4 formed.
2-pipe: 4 1/2
3-pipe: 6 1/2

6.111 Fall 2016 Lecture 9 16


Pipeline Example - Verilog
Lab 3 Pong
pixel
• G = game logic 8ns tpd
X G C • C = draw round puck, use
8 Y 9 multiply with 9ns tpd
hcount, intermediate • System clock 65mhz =
vcount, wires 15ns period – opps
etc

No pipeline reg [N:0] x,y;


assign y = G(x); // logic for y reg [23:0] pixel
assign pixel = C(y) // logic for pixel always @ * begin
y=G(x);
pixel = C(y);
end
Y Y2
G C pixel
X
8 9
clock clock

Pipeline Latency = 2 clock cyles!


always @(posedge clock) begin Implications?
...
y2 <= G(x); // pipeline y
pixel <= C(y2) // pipeline pixel
end

6.111 Fall 2016 Lecture 9 17


Increasing Throughput: Pipelining
Idea: split processing across
several clock cycles by dividing
circuit into pipeline stages
separated by registers that hold
values passing from one stage to
the next.

= register

Throughput = 1/4tPD,FA instead of 1/8tPD,FA)


6.111 Fall 2016 Lecture 9 18
How about tPD = 1/2tPD,FA?

= register

6.111 Fall 2016 Lecture 9 19


Timing Reports

65mhz = 27mhz*2.4

Synthesis
report Multiple: 7.251ns

Total Propagation
delay: 34.8ns

6.111 Fall 2016 Lecture 9 20


History of Computational Fabrics
 Discrete devices: relays, transistors (1940s-50s)
 Discrete logic gates (1950s-60s)
 Integrated circuits (1960s-70s)
 e.g. TTL packages: Data Book for 100’s of different parts

 Gate Arrays (IBM 1970s)


 Transistors are pre-placed on the chip & Place and Route software
puts the chip together automatically – only program the interconnect
(mask programming)
 Software Based Schemes (1970’s- present)
 Run instructions on a general purpose core

 Programmable Logic (1980’s to present)


 A chip that be reprogrammed after it has been fabricated
 Examples: PALs, EPROM, EEPROM, PLDs, FPGAs
 Excellent support for mapping from Verilog

 ASIC Design (1980’s to present)


 Turn Verilog directly into layout using a library of standard cells
 Effective for high-volume and efficient use of silicon area

6.111 Fall 2016 Lecture 9 21


Reconfigurable Logic
• Logic blocks
– To implement combinational
and sequential logic
• Interconnect
– Wires to connect inputs and
outputs to logic blocks
• I/O blocks
– Special logic blocks at
periphery of device for
external connections

• Key questions:
– How to make logic blocks programmable?
(after chip has been fabbed!)
– What should the logic granularity be?
– How to make the wires programmable?
(after chip has been fabbed!) n m
Logic
– Specialized wiring structures for local Inputs Logic
D
SET
Q

Outputs
vs. long distance routes?
CLR
Q

– How many wires per logic block?


Configuration
6.111 Fall 2016 Lecture 9 22
Programmable Array Logic (PAL)
• Based on the fact that any combinational logic can be
realized as a sum-of-products
• PALs feature an array of AND-OR gates with programmable
interconnect

input AND
signals array OR array

output
signals

programming of programming of
product terms sum terms

6.111 Fall 2016 Lecture 9 23


RAM Based Field Programmable
Logic - Xilinx
Vcc
Slew Passive
Rate Pull-Up,
Control Pull-Down
CLB CLB

D Q
Switch Output Pad
Matrix Buffer

Input
Buffer
Q D Delay
CLB CLB

Programmable
Interconnect I/O Blocks (IOBs)

C1 C2 C3 C4

H1 DIN S/R EC
S/R
Control

G4 DIN
G3 G F'
SD

G2 Func. G' D Q

Gen. H'

G1
EC
RD
1

H G'
Y
Func. H'
S/R
Gen. Control

Configurable
F4
F3 F DIN
Func. SD
F2 Gen.
F'
G' D Q

Logic Blocks (CLBs)


F1 H'

EC
RD
1
H'
F'
X
K

6.111 Fall 2016 Lecture 9 24


LUT Mapping
• N-LUT direct implementation of a truth table: any function
of n-inputs.
• N-LUT requires 2N storage elements (latches)
• N-inputs select one latch location (like a memory)
Inputs

Output

Latches set by configuration bitstream

4LUT example
6.111 Fall 2016 Lecture 9 25
Configuring the CLB as a RAM
Memory is built using Latches not FFs

16x2

Read is same a LUT Function!


6.111 Fall 2016 Lecture 9 26
Xilinx 4000 Interconnect

6.111 Fall 2016 Lecture 9 27


Xilinx 4000 Interconnect Details

Wires are not ideal!

6.111 Fall 2016 Lecture 9 28


Add Bells & Whistles
Hard
Processor

Gigabit
Serial

18 Bit
36 Bit
I/O
18 Bit

Multiplier VCCIO

Programmable Z
Z
Impedance
Control

Termination Clock
BRAM Mgmt
Courtesy of David B. Parlour, ISSCC 2004 Tutorial,
“The Reality and Promise of Reconfigurable Computing in Digital Signal Processing”
6.111 Fall 2016 Lecture 9 29
The Virtex II CLB (Half Slice Shown)

6.111 Fall 2016 Lecture 9 30


Adder Implementation
Cout
LUT: AB

B
A Y = A  B  Cin

Dedicated carry logic

1 half-Slice = 1-bit adder

Cin

6.111 Fall 2016 Lecture 9 31


FPGA’s

DSP with 25x18


multiplier

Gigabit ethernet
support

CLB Dist RAM Block RAM Multipliers

Virtex 2 8,448 1,056 kbit 2,592 kbit 144 (18x18)


Virtex 6 667,000 6,200 kbit 22,752 kbit 1,344 (25x18)
Spartan 3E 240 15 kbit 72 kbit 4 (18x18)
Artix-7 A100 7,925 1,188 kbit 4,860 kbit 240 (25x18)

6.111 Fall 2016 Lecture 9 32


Design Flow - Mapping
• Technology Mapping: Schematic/HDL to Physical Logic units
• Compile functions into basic LUT-based groups (function of
target architecture)

a
c
b
SET SET
D Q D Q
LUT
b Q Q
CLR CLR

always @(posedge clock or negedge reset)


begin
if (! reset)
q <= 0;
else
q <= (a&b&c)||(b&d);
end

6.111 Fall 2016 Lecture 9 33


Design Flow – Placement & Route
• Placement – assign logic location on a particular device

LUT

LUT

LUT

 Routing – iterative process to connect CLB inputs/outputs and IOBs. Optimizes critical path
delay – can take hours or days for large, dense designs

Iterate placement if timing


not met

Satisfy timing?  Generate


Bitstream to config device

Challenge! Cannot use full chip for reasonable speeds (wires are not ideal).
Typically no more than 50% utilization.
6.111 Fall 2016 Lecture 9 34
Example: Verilog to FPGA

module adder64 ( • Synthesis


input [63:0] a, b;
output [63:0] sum);
• Tech Map
• Place&Route
assign sum = a + b;
endmodule

64-bit Adder Example Virtex II – XC2V2000

6.111 Fall 2016 Lecture 9 35


How are FPGAs Used?

Logic Emulation
 Prototyping
 Ensemble of gate arrays used to emulate a
circuit to be manufactured
 Get more/better/faster debugging done than
with simulation
 Reconfigurable hardware
 One hardware block used to implement more
than one function
 Special-purpose computation engines
 Hardware dedicated to solving one problem
(or class of problems)
 Accelerators attached to general-purpose
computers (e.g., in a cell phone!)

FPGA-based Emulator
(courtesy of IKOS)

6.111 Fall 2016 Lecture 9 36


Summary

• FPGA provide a flexible platform for implementing digital


computing
• A rich set of macros and I/Os supported (multipliers, block
RAMS, ROMS, high-speed I/O)
• A wide range of applications from prototyping (to validate a
design before ASIC mapping) to high-performance spatial
computing
• Interconnects are a major bottleneck (physical design and
locality are important considerations)

6.111 Fall 2016 Lecture 9 37


Test Bench
module sample_tf;
// Inputs module sample(
reg bit_in; input bit_in,
reg [3:0] bus_in; input [3:0] bus_in,

// Outputs output out_bit,


wire out_bit; output [7:0] out_bus
wire [7:0] out_bus; );
. . . Verilog . . .
// Instantiate the Unit Under Test (UUT)
sample uut ( endmodule
.bit_in(bit_in),
.bus_in(bus_in),
.out_bit(out_bit),
.out_bus(out_bus)
);

initial begin
// Initialize Inputs
bit_in = 0;
bus_in = 0;

// Wait 100 ns for global reset to finish


#100;

// Add stimulus here

end

endmodule

6.111 Fall 2016 Lecture 9 38

You might also like