Fast Fourier Transform: VLSI Architectures: Vladimir Stojanović
Fast Fourier Transform: VLSI Architectures: Vladimir Stojanović
VLSI Architectures
Lecture 10
Vladimir Stojanovi
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Examples Radix-2
8 C2 BF2 4 C2
4 BF2 2 C2
2 BF2 j 1 C2
BF2
(1) . R2MDC(N-16)
8 4 BF2 2 BF2 j 1 BF2
BF2
multi-path delay commutator single-path delay feedback single-path delay feedback multi-path delay commutator single-path delay commutator
C4
(2) . R25DF(N-16)
2B4E8F
3X64
3X16 BF4
3X4 BF4
3X1
BF4
BF4
Radix-4
(3) . R4SDF(N-256)
192 128 64 48 32 16 12 8 4 3 2 1
BF4
16 32 48
C4
BF4
4 8 12
C4
BF4
1 2 3
C4
BF4
(4) . R4MDC(N-256)
DC6X64
BF4
DC6X16
BF4
DC6X4
BF4
DC6X1
BF4
(5) . R4SDC(N-256)
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
The most classical approach for pipeline implementation of radix-2 FFT Input sequence broken into two parallel data streams flowing forward with correct distance between data elements entering the butterfly scheduled by proper delays Both butterflies and multipliers are in 50% utilization
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
8 4 2 1
BF2
BF2
BF2
BF2
A single data stream goes through the multiplier at every stage Multiplier utilization is also 50%
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
[Despain74]
3 X4 3X1
x0 x4 x8 x12 x15
DFT 4
BF4
BF4
WN
n
y(n) N ) 4 N ) 2
x(n+
-j -1 -1 -1
WN
2n
y(n+
x(n+
WN
y(n+
3N x(n+ ) 4
WN
3n
3N y(n+ ) 4
[Swartzlander84]
12 C4 8 4 BF4 1 2 3
Figure by MIT OpenCourseWare.
x0
3 C4 2 1 BF4
x4 x8 x12 x15
Butterflies? Multipliers?
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
+++
x(n) N ) 4 N ) 2
j -1 -j
WN
n
y(n) N ) 4 N ) 2
x(n+
-j -1 -1 -1
WN
2n
y(n+
x(n+
WN
y(n+
3N x(n+ ) 4
WN
3n
3N y(n+ ) 4
x4 x8 x12 x15
input
commutator
butterfly element
commutator
butterfly element
c1 c2 c3
c4 c5 c6
coefficient
Figure by MIT OpenCourseWare.
x(n) N ) 4 N ) 2
WN
n
y(n) N ) 4 N ) 2
Modified radix-4 algorithm Programmable radix-4 BF 75% utilization Used to build one of the largest single-chip FFTs (8192pts) [Bidet95]
x(n+
-j -1
WN
2n
y(n+
x(n+
-1
WN
-1
y(n+
3N x(n+ ) 4
j -1
-j
WN
3n
3N y(n+ ) 4
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
input Nt Nt Nt Nt Nt Nt
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 5 4 3 2 1 0 15 14 13
1 0
2 1 0 15 14 13 12 11 10 9 8 7 6
T
5 4 3 2 1 0 15 14 13
x(n) input
Time
t'+16T
t'
2:1 multiplexers
mt 0 1 2 3
c1 1 0 0 0
c2 c3 1 1 1 1 0 1 0 0
2 1 0 15 14 13 12 11 10 9 8 7 6
Time
14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9
10 9 8 7 6 5 4
3 2 1 0 15 14 13 12 11 10 9 8 7
6 5
6 5 4 3 2 1 0 15 14 13 12 11 10 9
8 7
6 5 4
3 2 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 stage 1
0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
0 0 0 0 0 1 2 3 0 2 4 6 0 3 6 9 stage 2
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
t'+12T m1= 0
mt 0 1 2 3
c4 0 1 0 1
c5 0 0 1 1
c6 (0 = addition, 1 = subtraction) 0 1 1 0
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Some conclusions
Delay feedback approaches are always more efficient than corresponding delay-commutator approaches
Where spatial regularity is preserved in a signal-flow graph (SFG) So that arithmetic operations can be tightly scheduled for efficient hardware utilization
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
Decomposition a review
Cascading the twiddle factor decomposition leads to new forms of FFT with high-spatial regularity
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
10
Radix 22 approach
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
11
A 16pt example
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) BF I
X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15)
N/4 DFT
(k1=0, k2=1)
N/4 DFT
(k1=1, k2=0)
-j -j -j -j
N/4 DFT
(k1=1, k2=1)
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15) BF I
X(0) X(8) -j W2 W4 W6 W1 W2 W3 -j -j -j -j BF II W3 W6 W9 BF III -j BF IV -j X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15)
-j
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
12
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
13
N=256
x(n)
clk
128 BF2I
X
64 BF2II
t
X
32 BF2I
X
16 BF2II
t
X
8 BF2I
X
4 BF2II
2 BF2I
1 BF2II
t
X
X(k)
W1(n)
7 6 5 4
W2(n)
3 2
W3(n)
1 0
+ +
0 1 0 1 1 0 1 0 x
-+ -+ (i). BF2I
Similar to R2SDF
(ii). BF2II
t x
One identical to that in R2SDF The other contains the logic for trivial twiddle factor multiplication (with j) Just a log2N binary counter
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
14
Synchronization controller Address counter for twiddle factor reading in each stage Butterfly is idle (input data directed to shift registers) Butterfly computes a 2pt DFT with incoming data and data stored in the shift registers Output Z1(n) sent to twiddle multiplier Output Z1(n+N/2) sent back to the shift register to be multiplied in next N/2 cycles, when the first half of the next frame is loaded in
128 64 BF2II
t
X
32 BF2I
X
16 BF2II
t
X
8 BF2I
X
4 BF2II
2 BF2I
1 BF2II
t
X
x(n)
clk
BF2I
X
X(k)
W1(n)
7 6 5 4
Operation of BF2 is similar, except the distance of butterfly input sequence is just N/4 and the trivial multiply logic Utilization of the multiplier is 75% Next frame can be computed w/o pausing due to the pipelined processing in each stage Pipeline register can be inserted between each multiplier and BF stage to improve the performance
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
W2(n)
3 2
W3(n)
1 0
15
Arithmetic complexity
multiplier #
R2MDC R2SDF R4SDF R4MDC R4SDC R22SDF 2(log4 N - 1) 2(log4 N - 1) log4 N - 1 3(log4 N - 1) log4 N - 1 log4 N - 1
adder #
4 log4 N 4 log4 N 8 log4 N 8 log4 N 3 log4 N 4 log4 N
memory size
3N/2 - 2 N-1 N-1 5N/2 - 4 2N - 2 N-1
control
simple simple medium simple complex simple
Figure by MIT OpenCourseWare.
R22SDF has reached minimum requirement for both multiplier and storage Only R4SDC better in terms of adder usage R22SDF well suited for VLSI implementations of pipeline FFT processors
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
16
Memory issues
FIFO register files at each stage Complex multipliers at each (or every other stage)
To diminish the unnecessary data moving in the FIFO need to reconstruct the storage
With read and write addresses displaced by a constant 2-port RAM cells 33% more area of the 1-port RAM cell
b D(n) a
E E E
D(n-N)
D(n)
D(n-N)
W-addr. W R
R-addr.
17
W -1
W -1
W -1
X[0] X[4]
S/P & Bit reverse N/r Butterflies
W -1 W
W -1
W -1
X[2] X[6]
W -1 W -1
W -1
X[1] X[5]
W -1
X[3] X[7]
Coeff ROM Counter Control Circuits
N TFFT = r . logrN .Tr,PE Where, N/r = No. of butterfly per stage logrN = No. of stage Tr,PE = Time to calculate one butterfly
Figure by MIT OpenCourseWare.
[Sadat2001]
Need constant geometry signal flow graph Big price in area for parallelism (within each stage)
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
18
P/S
The number of nontrivial complex multiplications for radix-2 FFT is 66 Radix-4 (or 22) FFTs need only 52 multiplies
Important to note that for 8pt FFT (DIT) no need for multiplies
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
19
Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 64-point Fourier Transform Chip for High-speed Wireless LAN Application Using OFDM." Solid-State Circuits 39 (2004): 484-493. Copyright 2004 IEEE. Used with permission.
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
20
Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 64-point Fourier Transform Chip for High-speed Wireless LAN Application Using OFDM." Solid-State Circuits 39 (2004): 484-493. Copyright 2004 IEEE. Used with permission.
Fully parallel in each stage (radix-2 8pt FFT, single clk cycle) Large number of global wires resulting from the multiplexing of complex data to the 8-point FFTs Construction of the multiplier unit to attain the required speed with minimal silicon are is not trivial
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
21
Input unit
To the 8pt FFT Reduce de-muxing Reduce global wires Multiplier cannot finish Extend latency
Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 64-point Fourier Transform Chip for High-speed Wireless LAN Application Using OFDM." Solid-State Circuits 39 (2004): 484-493. Copyright 2004 IEEE. Used with permission.
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
22
Multiplier unit
49 multiplies
hard-wired constant
for coefficients
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
23
Figures from Maharatna, K., E. Grass, and U. Jagdhold. "A 64-point Fourier Transform Chip for High-speed Wireless LAN Application Using OFDM." Solid-State Circuits 39 (2004): 484-493. Copyright 2004 IEEE. Used with permission.
~50% less power and area than 8 standard complex multipliers Buffer unit similar to input unit, just w/o temporary registers
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
24
Output unit
Control/sync is simple
Starts counting when in put full Local counters control Input Intermediate Output units
Figure from Maharatna, K., E. Grass, and U. Jagdhold. "A 64-point Fourier Transform Chip for High-speed Wireless LAN Application Using OFDM." Solid-State Circuits 39 (2004): 484-493. Copyright 2004 IEEE. Used with permission.
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
25
Readings
[1] H.e. Shousheng and M. Torkelson "A new approach to pipeline FFT processor," Parallel Processing Symposium, 1996., Proceedings of IPPS '96, The 10th International no. SN -, pp. 766-770, 1996.
[3] H.e. Shousheng and M. Torkelson "Designing pipeline FFT processor for OFDM (de)modulation," Signals, Systems, and Electronics, 1998. ISSSE 98. 1998 URSI International Symposium on no. SN -, pp. 257-262, 1998.
[2] E. Wold and Alvin M. Despain "Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementations," IEEE Trans. Computers vol. 33, no. 5, pp. 414-426, 1984. [3] G. Bi and E.V. Jones "A pipelined FFT processor for word-sequential data," Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing], IEEE Transactions on vol. 37, no. 12 SN - 0096-3518, pp. 1982-1985, 1989. [4] K. Maharatna, E. Grass and U. Jagdhold "A 64-point Fourier transform chip for highspeed wireless LAN application using OFDM," Solid-State Circuits, IEEE Journal of vol. 39, no. 3 SN - 0018-9200, pp. 484-493, 2004. Interesting DIT&F algorithm
[4] C. Chiu, W. Hui, T.J. Ding and J.V. McCanny "A 64-point Fourier transform chip for video motion compensation using phase correlation," Solid-State Circuits, IEEE Journal of vol. 31, no. 11 SN 0018-9200, pp. 1751-1761, 1996.
Power-performance estimation
[2] S. Hong, S. Kim, M.C. Papaefthymiou and W.E. Stark "Power-complexity analysis of pipelined VLSI FFT architectures for low energy wireless communication applications," Circuits and Systems, 1999. 42nd Midwest Symposium on vol. 1, no. SN -, pp. 313-316 vol. 1, 1999. [3] K. Pagiamtzis and P.G. Gulak "Empirical performance prediction for IFFT/FFT cores for OFDM systems-on-a-chip," Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on vol. 1, no. SN -, pp. I-583-6 vol.1, 2002.
Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
26