0% found this document useful (0 votes)
47 views

REC FPGA Seminar IAP 1998: Session 3

This document outlines techniques for optimizing FPGA designs, including black box optimizations, counter designs, and distributed arithmetic. It discusses breaking designs into smaller combinational logic blocks, different types of counter architectures, and how distributed arithmetic can implement multipliers efficiently in FPGAs by serializing operations. Specific optimization examples and implementations are provided for counters, arithmetic, and other common FPGA functions.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

REC FPGA Seminar IAP 1998: Session 3

This document outlines techniques for optimizing FPGA designs, including black box optimizations, counter designs, and distributed arithmetic. It discusses breaking designs into smaller combinational logic blocks, different types of counter architectures, and how distributed arithmetic can implement multipliers efficiently in FPGAs by serializing operations. Specific optimization examples and implementations are provided for counters, arithmetic, and other common FPGA functions.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

REC FPGA Seminar IAP 1998

Session 3:
Advanced Design Techniques, Optimizations, and Tricks

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 1

Outline
• Focus on Xilinx 4000E-style FPGA (one of the
most common FPGAs)
• Thinking FPGA
• Black box optimizations
• Counter design
• Distributed arithmetic
• One-hot state machines
• Miscellaneous tricks

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 2


Thinking FPGA
• When starting a design, consider the implementation technology
• Architect your design to fit into an FPGA
– memory granularity (16x1, 16x2, 32x1)
– 4 or 5 input logic functions / 4 + 4 and 2-1 mux
• fewer inputs per logic function is wasteful
• more inputs is slower
– routing limitations
• limited number of tristate buffers and longlines
• limited number of clock buffers
– I/O cell features
• flip flops in I/O cells
• special delays and slew rate control

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 3

“Black Box” Optimization


• Most basic of FPGA design optimizations
– Essentially performing manual hardware mapping
• Procedure:
– break down design into combinational logic black
boxes
• inputs and outputs with stuff inbetween
• arbitrarily complex logic inside the box, but CLB doesn’t care
since it is a LUT anyways
– adjust the “level” of black-boxing until you have
mostly 4 or 5 input functions or 4+4 input and 2-1 mux
functions

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 4


“Black Box” Example
• ALU
– implements a 32-bit wide 2-input AND, OR, XOR,
pass-through
• Example worked through on chalkboard
– obvious implementation
• 3 32-bit wide 2-input devices feeding into a mux or a tri-state
bus
– optimized implementation
• 32 4-input devices: 66% or more savings in area; roughly 30-
50% speed increase

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 5

Counter Design
• Counters have many design options depending
upon the application
– basic ripple counter
– ripple-carry
– lookahead-carry
– Johnson (mobius)
– linear feedback shift register (LFSR)

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 6


Ripple Counter
Count out

D Q D Q D Q D Q

CLK
R R R R

RESET

• Ripple carry counter is not recommended in FPGA designs due to their


asynchronous nature
• However, ripple carry counters are very efficient in terms of area
• k*O(n) delay growth with the number of bits, k is large (poor
performance)
• Max counting states is 2N
Robotics and Electronics Cooperative FPGA Seminar IAP 1998 7

Ripple-Carry Counter
Count out

TC
AND AND AND

CE XOR D Q XOR D Q XOR D Q

R R R
CLK
RESET

• Synchronous design
• k*O(n) delay growth with n bits, k small
• this is the basic counter provided in Xilinx libraries
• good area efficiency
• Max counting states is 2N
• Loads or sync clears come for free in terms of area and speed
Robotics and Electronics Cooperative FPGA Seminar IAP 1998 8
Carry-Lookahead counter
• Like ripple-carry but carry input to nth counter element is computed
using a full sum-of-products of the previous (n-1) bits counter state
• Can have near O(1) delay growth up to a few bits
• Good performance
• Requires a lot of gates
• Combinations of carry-lookahead and ripple-carry can be used to get
the best of both worlds
• Max counting states is 2N

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 9

Johnson or Mobius Counter


Count out

D Q D Q D Q D Q

R R R R
CLK
RESET

• O(1) delay growth for most applications


• Well-suited for clock division or count-limit only applications
• Non-binary counter
• Counts to 2 * n, where n is the number of flip flops
• Excellent area and speed characteristics
• Near toggle-rate speeds
Robotics and Electronics Cooperative FPGA Seminar IAP 1998 10
LFSR Counters
Count out

XNOR

D Q D Q D Q D Q

R R R R
CLK
RESET

• O(1) delay growth for most applications


• non-binary counter
• 2N-1 states in a pseudorandom sequence
• excellent area and speed characteristics
• near toggle-rate speeds
• ideal for applications where count sequence is irrelevant (FIFO, timers)
Robotics and Electronics Cooperative FPGA Seminar IAP 1998 11

LFSR application
• FIFO application
– Count sequence doesn’t matter
• just need to address unique memory locations
• last count value and half-full count values can be
predetermined and logic created to detect these conditions
– Saves area, increases performance
• no carry look-ahead structures, O(1) delay growth with
increasing FIFO depth

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 12


Distributed Arithmetic
• Parallel multipliers are expensive to implement in
FPGAs
– requires very wide logic functions or the use of carry-
chains
– hardware and delay growth O(n2)
• Distributed arithmetic serializes multiplies using
partial products
– partial products can be computed in parallel
– serialized multiplies fit well into FPGA architectures
– can achieve same throughput as parallel multiplier
silicon macros but with longer latency
Robotics and Electronics Cooperative FPGA Seminar IAP 1998 13

Distributed Arithmetic
• DA takes advantage of associative and commutative properties of
addition

Digit nomenclature: A = an an-1... a2 a1

In base 10:

A * B = Pn + Pn-1+ ... P2 + P1 where Pn = A * bn * 10 n -1


So 42 * 121 = 42 * 1 * 100 + 42 * 2 * 10 + 42 * 1 * 1

In base 2:
A * B = Pn + Pn-1+ ... P2 + P1 where Pn = A * bn * 2n -1
So 101 * 1101 = (101 * 1) << 3 + (101 * 1) << 2 + (101 * 0) << 1 + (101 * 1) << 0
multiply operator breaks down to AND operation in one-digit binary; be
careful of sign extensions for signed numbers!

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 14


Distributed Arithmetic
• Looking at the relation
101 * 1101 = (101 * 1) << 3 + (101 * 1) << 2 + (101 * 0) << 1 + (101 * 1) << 0

• One sees a basic functional unit- the scaling multiply. This, combined with
an accumulator and bit-serial input stream (via “time skew buffer”), is the
essence of the DA multiplier
• Note that the DA implementation discussed here works best for constant *
variable expressions, which is ideally suited for applications such as
convolutions and DSP filters
• replace the (A * bn) multiply kernel by a lookup-table instead of several
AND gates
• LUTs in some architectures are more efficient than AND gates
• Time to compute = number of bits in input * time to do scaling multiply

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 15

Distributed Arithmetic
• Implementation for variable * C0; computes result in N clock cycles
– diagram courtesy Xilinx
N BITS WIDE
shift register
SAMPLE DATA
MSBs

A0 (2 -1)
X0
1
2 WORD BY X BIT
x
LOOK A[0] LOOK UP TABLE
• PSC, LSB First UP
TABLE
SE 0 ...000000
R
A E C0
1
Scaling G
ADRS I
Accum. FILTERED
x+1
SE + - S
T
DATA OUT

DATA x B E
R

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 16


Distributed Arithmetic
• so what?
– the real power of DA comes in when you try to do
multiple-tap FIR filters
y[n] = Σ x[k] * h[n - k]
y[1] = x[0] * h[1] + x[1] * h[0]
Example: 101 * 011 + 110 * 100
= (101 * 0) << 2 + (101 * 1) << 1 + (101 * 1) << 0 +
(110 * 1) << 2 + (110 * 0) << 1 + (110 * 0) << 0

= ( (101 * 0) + (110 * 1) ) << 2 +


These boxes are about as complex
( (101 * 1) + (110 * 0) ) << 1 + as the boxes used in the one-tap case!

( (101 * 1) + (110 * 0) ) << 0

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 17

Distributed Arithmetic for a 3-Tap Filter


-23 22 21 20 -23 22 21 20 -23 22 21 20
1 0 0 1 (-7) 0 1 1 0 ( 6) 0 0 1 0 ( 2)
X 0 1 1 1 ( 7) X 0 1 0 1 ( 5) X 0 1 1 1 ( 7)
( 1 0 0 1 + 0 1 1 0 + 0 0 1 0) 0001
( 1 0 0 1 + 0 0 0 0 + 0 0 1 0 ) 1011
( 1 0 0 1 + 0 1 1 0 + 0 0 1 0 ) 0001
(0 0 0 0 + 0 0 0 0 + 0 0 0 0 ) 0000
1 1 0 0 1 1 1 1 (-49) 0 0 0 1 1 1 1 0 ( 30) 1 1 0 0 1 1 1 1 ( 14) = 1 1 1 1 1 0 1 1
(-5)

• Partial Products of equal weight are added together


before being summed to next higher partial product
weight.
= Sign Extension

(slide courtesy Xilinx)


Robotics and Electronics Cooperative FPGA Seminar IAP 1998 18
Distributed Arithmetic
N BITS WIDE 8 WORD BY X BIT
SAMPLE DATA A[210] LOOK UP TABLE
MSBs
000 ...000000
(2 -1)
X0 A0
1 001 C0
x
LOOK
UP 010 C1
SE
A1 TABLE
X1 R 011 C1 + C0
1 A E
Scaling G 100 C2
ADRS I
A2 Accum. C2 + C0
X2 S x+1
101
1
SE + - T FILTERED
C2 + C1
• PSC, LSB First DATA x B E DATA OUT 110
R
111 C2 + C1 + C0

• LUT contains the sums of


Shift registers all the partial products.

(slide courtesy Xilinx)


Robotics and Electronics Cooperative FPGA Seminar IAP 1998 19

Distributed Arithmetic
• k O(2n) + j O(1), k is relatively small (for area)
• very close to O(1) performance scaling
• DA can be parallelized and pipelined to gain even
more performance
– Each bit can have its own LUT and adder
– All bits computed in parallel
– One result per clock cycle max throughput

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 20


8-Tap Symmetric Slice
(8-Bit Example)

A[7:0] A7 C0
+ B7 C1
C7
D7
C2
C3
+ + +
A6 C0
B[7:0] B6 C1
+ C6
D6
C2
C3
X1/2 X1/4 X1/16

C[7:0] +
+ A1 C0
B1 C1
C1 X1/4
D[7:0]
D1
C2
C3 +
+ A0
B0
C0
C1
C0 C2 X1/2
D0 C3

+ = ROUNDING ADDER

= SIGN EXTENDED ADDER (courtesy Xilinx)


Robotics and Electronics Cooperative FPGA Seminar IAP 1998 21

Distributed Arithmetic
• Performance
– Serial Distributed Arithmetic (SDA), 10-tap FIR
• 7.8 Msamp/s for 8 bit samples @ 42 CLBs
• 4.1 Msamp/s for 16 bit samples @ 50 CLBs
• old numbers; probably 50% faster now
– Parallel Distributed Arithmetic (PDA), 8-tap FIR
• 50-70 Msamp/s for 8 bit samples @ 122 CLBs
• pipelined, hand-optimized
– For reference, the XC4008E has 324 CLBs (18 x 18 array)

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 22


One-Hot State Machines
• Conventional state machines use log2(states) bits to implement
function
– output is decoded from state number
– next state is a combinational function of states
– state transition rate limited by state number decoding and next
state logic delays
• One-hot state machines use as many bits as there are states to
implement function
– only one flip flop storing “1” at any time
– output is decoded as an OR of appropriate state FFs
– state transition rate limited only by next state logic delays, which
in many cases is zero

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 23

Miscellaneous Tricks
• Tri-state mux
– saves on area, especially for wide muxes
– may have better or worse performance depending on architecture
and device characteristics
– not shown in illustration is decoder for tri-state buffers

Robotics and Electronics Cooperative FPGA Seminar IAP 1998 24


Miscellaneous Tricks
• Use IOBs to register inputs
– gives faster setup/hold times (eliminates routing delays from setup
time)
– introduces additional latency
– can save on logic array flip flop usage
• Inverters come for free in most architectures
• Use longlines for timing-critical signals
– use sparingly since this is a precious resource in Xilinx 4K
architectures
– all wires in Altera “Fast Track” architecture are longlines so routes
are always “fast”
• Use pipeline stages to improve pin-locked routing in Altera 8K designs
• When you can afford it, pipeline your design
– latency versus clock speed tradeoff
• Double-wide half-rate logic (area versus speed)
Robotics and Electronics Cooperative FPGA Seminar IAP 1998 25

You might also like