REC FPGA Seminar IAP 1998: Session 3
REC FPGA Seminar IAP 1998: Session 3
Session 3:
Advanced Design Techniques, Optimizations, and Tricks
Outline
• Focus on Xilinx 4000E-style FPGA (one of the
most common FPGAs)
• Thinking FPGA
• Black box optimizations
• Counter design
• Distributed arithmetic
• One-hot state machines
• Miscellaneous tricks
Counter Design
• Counters have many design options depending
upon the application
– basic ripple counter
– ripple-carry
– lookahead-carry
– Johnson (mobius)
– linear feedback shift register (LFSR)
D Q D Q D Q D Q
CLK
R R R R
RESET
Ripple-Carry Counter
Count out
TC
AND AND AND
R R R
CLK
RESET
• Synchronous design
• k*O(n) delay growth with n bits, k small
• this is the basic counter provided in Xilinx libraries
• good area efficiency
• Max counting states is 2N
• Loads or sync clears come for free in terms of area and speed
Robotics and Electronics Cooperative FPGA Seminar IAP 1998 8
Carry-Lookahead counter
• Like ripple-carry but carry input to nth counter element is computed
using a full sum-of-products of the previous (n-1) bits counter state
• Can have near O(1) delay growth up to a few bits
• Good performance
• Requires a lot of gates
• Combinations of carry-lookahead and ripple-carry can be used to get
the best of both worlds
• Max counting states is 2N
D Q D Q D Q D Q
R R R R
CLK
RESET
XNOR
D Q D Q D Q D Q
R R R R
CLK
RESET
LFSR application
• FIFO application
– Count sequence doesn’t matter
• just need to address unique memory locations
• last count value and half-full count values can be
predetermined and logic created to detect these conditions
– Saves area, increases performance
• no carry look-ahead structures, O(1) delay growth with
increasing FIFO depth
Distributed Arithmetic
• DA takes advantage of associative and commutative properties of
addition
In base 10:
In base 2:
A * B = Pn + Pn-1+ ... P2 + P1 where Pn = A * bn * 2n -1
So 101 * 1101 = (101 * 1) << 3 + (101 * 1) << 2 + (101 * 0) << 1 + (101 * 1) << 0
multiply operator breaks down to AND operation in one-digit binary; be
careful of sign extensions for signed numbers!
• One sees a basic functional unit- the scaling multiply. This, combined with
an accumulator and bit-serial input stream (via “time skew buffer”), is the
essence of the DA multiplier
• Note that the DA implementation discussed here works best for constant *
variable expressions, which is ideally suited for applications such as
convolutions and DSP filters
• replace the (A * bn) multiply kernel by a lookup-table instead of several
AND gates
• LUTs in some architectures are more efficient than AND gates
• Time to compute = number of bits in input * time to do scaling multiply
Distributed Arithmetic
• Implementation for variable * C0; computes result in N clock cycles
– diagram courtesy Xilinx
N BITS WIDE
shift register
SAMPLE DATA
MSBs
A0 (2 -1)
X0
1
2 WORD BY X BIT
x
LOOK A[0] LOOK UP TABLE
• PSC, LSB First UP
TABLE
SE 0 ...000000
R
A E C0
1
Scaling G
ADRS I
Accum. FILTERED
x+1
SE + - S
T
DATA OUT
DATA x B E
R
Distributed Arithmetic
• k O(2n) + j O(1), k is relatively small (for area)
• very close to O(1) performance scaling
• DA can be parallelized and pipelined to gain even
more performance
– Each bit can have its own LUT and adder
– All bits computed in parallel
– One result per clock cycle max throughput
A[7:0] A7 C0
+ B7 C1
C7
D7
C2
C3
+ + +
A6 C0
B[7:0] B6 C1
+ C6
D6
C2
C3
X1/2 X1/4 X1/16
C[7:0] +
+ A1 C0
B1 C1
C1 X1/4
D[7:0]
D1
C2
C3 +
+ A0
B0
C0
C1
C0 C2 X1/2
D0 C3
+ = ROUNDING ADDER
Distributed Arithmetic
• Performance
– Serial Distributed Arithmetic (SDA), 10-tap FIR
• 7.8 Msamp/s for 8 bit samples @ 42 CLBs
• 4.1 Msamp/s for 16 bit samples @ 50 CLBs
• old numbers; probably 50% faster now
– Parallel Distributed Arithmetic (PDA), 8-tap FIR
• 50-70 Msamp/s for 8 bit samples @ 122 CLBs
• pipelined, hand-optimized
– For reference, the XC4008E has 324 CLBs (18 x 18 array)
Miscellaneous Tricks
• Tri-state mux
– saves on area, especially for wide muxes
– may have better or worse performance depending on architecture
and device characteristics
– not shown in illustration is decoder for tri-state buffers