Hardware Implementation Low Power High Speed FFT Core
Hardware Implementation Low Power High Speed FFT Core
1, January 2009
1. Introduction
Fast Fourier Transforms (FFT) is the fast implementation of the Discrete Fourier Transform (DFT) which relies on mathematical simplification and classification of the input sequence to achieve their performance gain. The FFT typically requires O(N log2N) operations to complete in comparison to the straight DFT requiring O(N2) operations [2]. The FFT processor is used in a wide range of DSP and communication applications, such as radar signal processing and wireless LAN. Recent research work has demonstrated the pipelined FFT as a leading architecture for real time applications. In this paper, a low power and efficient multiplier-less approach is employed to substitute complex multipliers in pipelined FFTs the commutator is needed to reorder the input data. It is well known that the switching power is mainly responsible for power consumption in CMOS circuits. This power, Psw, is given by 1 2 psw = kcload vdd f (1) 2 where k is the average number of times the gate makes an active transition during one clock cycle, f is the clock frequency, Vdd is the supply voltage and Cload is the load capacitance of the gate. Hence, for achieving low power, one or more of the parameters Cload, Vdd and k need to be minimized. However, since Cload and Vdd are relative to the target technology, k becomes the main point of improvement.
(2)
j WN = e N is twiddle factor [10]. Since 16 and 64 is a power of four, radix-4 decimation-in-frequency algorithm is used to break the DFT formula into four smaller DFTs. The FFT is the speed-up algorithm of DFT [7]. The final sets of transforms look like
where
(3)
N/4-1 n kn N N 3N X(4k+1)= [x(n)-jx(n+ 4 )-x(n+ 2 )+jx(n+ 4 )]WN WN/4 n=0
(4)
N/4-1 2n kn N N 3N X(4k+2)= [x(n)-x(n+ 4 )+x(n+ 2 )-x(n+ 4 )]WN WN/4 n=0
(5)
N/4-1 3n kn X(4k+3)= [x(n)+jx(n+ N )-x(n+ N )-jx(n+ 3N )]W W N N/4 4 2 4 n=0
(6) For 16-point FFT for k=0, 1, 2, 3 we get 16 equations and 64-point FFT k varies from 0 to 15. The flow graph of 16-point FFT is seen in Figure 1. In this
2 2009
Figure the numbers inside the open circle represent equations which are used for computing the output in the butterfly stage.
(R4SDC) [4] is widely used, owing to its high utilization of multipliers, butterfly elements and memory blocks. The commutators will take up more proportion of the overall power consumption and act as a leading actor with the increase of FFT size. Therefore, reducing the power consumption of the communator units is crucial for the low power implementation of pipelined FFT processor. The requisite commutator is shown in Figure 2 (this is required for both real and imaginary parts). It consists of six shift registers each providing Nt word delays. Control signals (denoted c1, c2, and c3) select the appropriate data via 2:1 multiplexers. In accordance with the value of mt, the four complex outputs from the commutator are connected to its associated butterfly. The commuator supplies the same set of data for Nt word cycles. Each FIFO is implemented through a set of shift registers. The FIFO size Nt equals 4(5-t), where t is the stage number.
Fig
I/P Nt Nt Nt Nt Nt Nt m1 0 1 2 3 c1 1 0 0 0 c2 1 1 0 0 c3 1 1 1 0 O1 O2 O3 O4
The number outside the open circle is the twiddle factor used. The 4 outputs from two commutators are fed into each simplified butterfly unit. The butterfly unit computes the four equations in a clock cycle. Coefficients are fed in to complex multipliers, respectively. A pipelined N-point radix-4 FFT processor based on this architecture [6], shown in Figure 3, has log4N stages. Each stage produces one output within each word cycle. Each stage contains a commutator, a butterfly element and a complex multiplier. The sequential outputs at each stage must be ordered in accordance with the value of mt. For instance, from Figure 2 at stage 1, the outputs associated with mt = 0 are produced in the first four word cycles, then those associated with mt = 1 in the next four cycles and so on. It is clear from FFT equation that input data for each summation at stage t are separated in time by Nt words.
Stage 1 Com muta in tor C1 C2 C3 Coefficient Butte rfly Com muta tor Stage 2 Butte rfly Stage v Com muta tor Butte rfly
out
C4 C5 C6
2. Implementation
2.1. Commutator
In realtime applications, input data is a sequential stream. Therefore, it does not match the FFT algorithm since the FFT requires temporal re-ordering of data. For this reason, the commutator is needed to reorder the input data. Among several pipelined FFT architectures, Radix-4 Single-path Delay Commutator
addition operations with common sub expression sharing are used to pre-compute twiddle coefficients which reduces area as well as power [5]. The number of coefficients for the 16-point FFT is shown in Table 1 The multiplier-less unit as shown in Figure 7 consist of shift and addition operations with common sub expression sharing to replace complex multiplications [3]. A close observation reveals that the seven coefficients (7fff, 0000) and (0000, 8000) are the trivial coefficients which are the quantized representation for (1, 0) and (0, -1) in 16-bit twos complement format respectively. In each set, the first entry corresponds to the cosine function (the real part, Wr) and the second one corresponds to the sine function (the imaginary part, Wi). For the trivial coefficients (7fff, 0000) and (0000, 8000), the complex multiplication is not necessary. Data can directly pass through the multiplier unit without any multiplication, when data is multiplied with (7fff, 0000). Only an additional unit, which swaps the real and imaginary parts of input data, and inverts the imaginary part is needed for those data (0000, 8000). The rest of the coefficients can be represented by three constants (7641, 5a82 and 30fb). For example, a multiplication with the constant a57d could be realized by first multiplying the data with 5a82, and then twos complementing the result. The other two constants (89be and cf04) can be realized in a similar manner, using constants 7641 and 30fb, respectively.
ar ai add/sub add/sub br bi cr ci add/sub Exor Re
5a82 (0101101010000010), 7641 (1000-10-001000001) and 30fb (010-1000100000-10-1). We can use shifters and adders based on the three constants to carry out those nontrivial complex multiplications as shown below: 5a82X = 5X << 12 + 5X << 9 + 65X <<1 7641X = X << 15 + 65X 5X <<9 30fbX = 65X << 8 X << 12 5X The common sub expressions for the two constants are 101 (5) and 1000001 (65). Figure 5 shows the shiftand-addition module for the three constants in the multiplier-less unit. ROM unit storing coefficients is replaced by a FSM unit generating control signals (s1- s8) in multiplier-less approach. The same multiplier architecture is applied to 64-point FFT is shown in Figure 6 All coefficients for 64-point FFT is represented interms of 7f62, 7d8a, 7a7d, 7641, 70c2, 6a6d, 62f2, 5a82, 5133, 471c, 3c56, 30fb, 2528, 18f8, 0c8b [8]. The following coefficients are pre-computed using common sub expression based shift and addition. 7f62X = X << 15 5X << 5 + X << 1 7d8aX =X << 15 5X << 7+ 5X << 1 7a7dX =65X << 9 X << 11 + X << 7 X << 2 + X 7641X = X << 15 + 65X 5X <<9 70c2X = X<< 1X << 5 +X <<8 X << 12 +X <<15 6a6dX = X 5X << 2+X <<7+65X <<9 X << 13 62f2X = X << 1 + X << 10 + X << 15 X << 4 X << 8 X << 13 5a82X = 5X << 12 + 5X << 9 + 65X <<1 5133X = 5X << 12 + 65X << 2 X << 4 + X << 6 471cX = X << 5 65X << 2 + X << 11 + X << 14 3c56X = X << 7 5X << 1 X << 7 X << 10 + X << 14 30fbX = 65X << 8 X << 12 5X 2528X = 5X << 3 + 5X << 8 + X << 13 18f8X = X << 8 X <<3 X << 11 + X << 13 0c8bX = X << 11 5X 65X << 4 +X << 7 Similarly, coefficients for 256-point FFT is represented interms of 7ff6, 7fd8, 7fa7, 7f62, 7f09, 7e9d, 7e1d, 7d8a, 7ce2, 7c29, 7b5d, 7a7d, 798a, 7884, 776c, 7641, 7504, 73b5, 7255, 70e2, 6f5f, 6dca, 6c24, 6a6d, 68a6, 66cf, 64e8, 62f2, 60ec, 5ed7, 5cb4, 5a82, 5842, 55f5, 539b, 5133, 4ebf, 4c3f, 49b4, 471c, 447a, 41ce, 3f17, 3c56, 398c, 36ba, 33de, 30fb, 2e11, 2b1f, 282b, 2528, 2223, 1f19, 1c0b, 18f8, 15e2, 12c8, 0fab, 0c8b, 096a, 0647, 0324. The coefficients are pre-computed using common sub expression based shift and addition.
X Constant 5a82 Inverter Mux X Comm on sub exp block Constant 7641 Inverter S2 Swap
add/sub
add/sub dr di add/sub Im
0->addition 1->subtraction
m1 0 1 2 3
c1 c2 c3 0 0 0 1 0 1 0 1 1 1 1 0
Figure 4. Butterfly element for stage for stage t. Table 1. The coefficients for 16-point.
Coefficient Sequence m1 = 0,1 Wo Wo Wo Wo Wo W1 W2 W3 Original quantized coefficient 7fff, 0000 7fff, 0000 7fff, 0000 7fff, 0000 7fff, 0000 7641, cf04 5a82, a57d 30fb, 89be Coefficient sequence m1 = 2,3 Wo W2 W4 W6 Wo W3 W6 W9 Original quantized coefficient 7fff, 0000 5a82, a57d 0000, 000 a57d, a57d 7fff, 0000 30fb, 89be a57d, a57d 89be, 30fb
5a82 is represented in twos complement format, 7641 and 30fb are represented in Canonical Signed-Digit (CSD) format:
X S1
S1
4 2009
5a82 7f62 0c8b 7d8a 18fd 7a7d 2528 7641 30fb 70c2 I n s w i t c h u n i t 3c56 6a6d 471c 62f2 5133 O U T S W I T
S1
inverter inverter
S3
mux
S2
fixed point using 16 point, 64-point and 256-point radix-4 DIF FFT algorithm. The given inputs and corresponding outputs are as follows: Input: 1, 1 ,1 ,1, 2 ,2 ,2 ,2 ,1 ,1 ,1 ,1 , 0 , 0 , 0 , 0, 1, 1 ,1 ,1, 2 ,2 ,2 ,2 ,1 ,1 ,1 ,1 , 0 , 0 , 0 , 0, 1, 1 ,1 ,1, 2 ,2 ,2 ,2 ,1 ,1 ,1 ,1 , 0 , 0 , 0 , 0, 1, 1 ,1 ,1, 2 ,2 ,2 ,2 ,1 ,1 ,1 ,1 , 0 , 0 , 0 , 0 Output: 64, 0, 0, 0, -16.109-24.109i, 0, 0, 0, 0, 0, 0, 0, 9.9864-1.9864i,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3273-6.6727i , 0, 0, 0, 0, 0, 0, 0, 4.7956+3.2044i ,0, 0, 0, 0, 0, 0, 0, 4.7956-3.2044i, 0, 0, 0, 0, 0, 0, 0, 1.3273+6.6727i, 0, 0, 0, 0, 0, 0, 0, 9.9864+1.9864i, 0, 0, 0, 0, 0, 0, 0, 16.109+24.109i, 0, 0, 0.
swap
S5
inverter
S4
inverter
S3
swap
S5
inverter
S4
inverter
S3
swap
S5
inverter
S4
inverter
S3
swap
S5
inverter
S4
inverter
S3
swap
S5
inverter
S4
3.3. Reports
RTL Compiler was used to evaluate power, area and timing report for FFTs. The Timing and power report for 16-point, 64-point and 256-point FFT core is shown in Table 2 and 3 The power and area report of different modules present in top FFT core for 16-point and 64-point FFT is shown in Tables 4 and 5 For 256point FFT above reports are given in Tables 6 and 7.
Table 2. Timing report for FFT core (different points).
FFT Size 16-point 64-point 256-point Frequency(MHz) 200 166.66 125
inverter
S3
swap
S5
inverter
S4
inverter
S3
swap
S5
inverter
S4
S5
Shift-Add Module
M U X
o/p
Shift-Add Module
3. Results
3.1. Simulation Results Using Modelsim Tool
The FFT blocks are simulated and the results are shown below using Modelsim Tool in Verilog HDL. The resulting Verilog HDL simulation models can then be used as building blocks in larger circuits (using schematics, block diagrams or system-level Verilog HDL descriptions) for the purpose of simulation. The top module is simulated for 32 bits (complex data)
Leakage (w) Internal (mw) Net ( mw) Switching (mw) Total (m w) Cell Area (mm2)
References
[1] Bi G. and Jones E., A Pipelined FFT Processor
for Word Sequential Data, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 12, pp. 1982-1985, 1989. Cooley J. and Tukey J., An Algorithm for the Machine Computation of the Complex Fourier Series, Mathematics of Computation, vol. 19, pp. 297-301, 1965. Han W., Arslan T., Erdogan A., and Hasan M., A Novel Low Power Pipelined FFT Based on Sub Expression Sharing for Wireless LAN Applications, IEEE Workshop on Signal Processing Systems, pp. 83-88, 2004. Han W., Arslan T., Erdogan A., and Hasan M., Low Power Commutator for Pipelined FFT Processors, IEEE International Symposium on Circuits and Systems, vol. 5, pp. 5274-5277, 2005. Han W., Arslan T., Erdogan A., and Hasan M., Multiplier-Less Based Parallel-Pipelined FFT Architecture for Wireless Communication Applications, IEEE Proceedings on Acoustics, Speech and Signal Processing, vol. 5, pp. 45-48, 2005. Han W., Arslan T., Erdogan A., and Hasan M., The Development of High Performance FFT IP Cores Through Hybrid Low Power Algorithmic Methodology, in Proceedings of the Asia South Pacific Design Automation, pp. 549-552, China, 2005. John G. and Manolakis D., Digital Signal Processing, MacMillian, London, 1988. Maharatna K., Grass.E., and Jagdhold U., A 64Point Fourier Transform Chip for High-Speed Wireless LAN Application Using OFDM, IEEE Journal of Solid-State Circuits, vol. 39, no. 3, pp. 484-493, 2004
[2]
[3]
[4]
[5]
[6]
[7] [8]
0.044224
0.0420
4. Conclusion
In this paper a pipelined architecture for 16 point, 64point and 256-point radix-4 DIF FFT in fixed point representation is implemented. Low power FFT processor is implemented by using multiplier less (shift add) approach for multiplying twiddle coefficient. This paper presents a multiplier-less pipelined FFT processor architecture suitable for shorter FFTs. This design approach can also be applied to the longer FFTs. The multiplier-less architecture employs the minimum number of shift and addition operations to realize the complex multiplications. By combining a commutator architecture and low power butterfly architecture with this approach, the resulting power savings are around 43% and 59% for 64-point and 16point radix-4 FFTs, respectively, as compared to a conventional FFT architecture based on non-booth coded wallace tree multiplier. The parameterization
6 2009
Muniandi Kannan received his BE in electronics and communication engineering from MK University, Madurai, and his ME from Anna University, Chennai. Since 1993 he has been working in Anna University, Chennai, India. His area of interests includes computer architecture, VLSI design, and VLSI for signal processing. Srinivasa Srivatsa received his BE in electronics and telecommunication engineering from Jadavpur University, his ME in electrical communication engineering, and his PhD from Indian Institute of Science, Bangalore, India. He had been a professor of electronics engineering in Anna University, Chennai, India for nearly 20 years. He is the author of 191 publications in reputed journals/conference proceedings. His area of interests includes computer networks, digital logic design, and design of algorithms and robotics.