FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
FPGA Implementation of High Speed FIR Filters Using Add and Shift Method
Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University Of California, Santa Barbara, CA 93106 E-mail: [email protected], [email protected], [email protected]
Abstract-We present a method for implementing high speed Finite Impulse Response (FIR) filters using just registered adders and hardwired shifts. We extensively use a modified common subexpression elimination algorithm to reduce the number of adders. We target our optimizations to Xilinx Virtex II devices where we compare our implementations with those produced by Xilinx CoregenTM using Distributed Arithmetic. We observe up to 50% reduction in the number of slices and up to 75% reduction in the number of LUTs for fully parallel implementations. We also observed up to 50% reduction in the total dynamic power consumption of the filters. Our designs perform significantly faster than the MAC filters, which use embedded multipliers.
I. INTRODUCTION
most of the current generation FPGAs such as Virtex IITM have embedded multipliers to handle these multiplications, the number of these multipliers is typically limited. Furthermore, the size of these multipliers is limited to only 18 bits, which limits the precision of the computations for high speed requirements. The ideal implementation would involve a sharing of the Combinational Logic Blocks (CLBs) and these multipliers. In this paper, we present a technique that is better than conventional techniques for implementation on the CLBs.
X [n]
x
hL-1
hL-2
hL-3
h1
h0 y [n]
z-1
z-1
z-1
...
z-1
z-1
FPGAs are being increasingly used for a variety of computationally intensive applications, mainly in the realm of Digital Signal Processing (DSP) and communications [1-7]. Due to rapid increases in the technology, current generation of FPGAs contain a very high number of Configurable Logic Blocks (CLBs), and are becoming more feasible for implementing a wide range of applications. The high nonrecurring engineering (NRE) costs and long development time for ASICs are making FPGAs more attractive for application specific DSP solutions. DSP functions such as FIR filters and transforms are used in a number of applications such as communication and multimedia. These functions are major determinants of the performance and power consumption of the whole system. Therefore it is important to have good tools for optimizing these functions. Equation (I) represents the output of an L tap FIR filter, which is the convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series. y[n] = h[k] x[n-k] k= 0, 1, ..., L-1 (I)
An alternative to the above approach is Distributed Arithmetic (DA) which is a well known method to save resources. Using DA method, the filter can be implemented either in bit serial or fully parallel mode to trade bandwidth for area utilization. Assuming coefficients c[n] are known constants, equation (I) can be rewritten as follows: y[n] = c[n] x[n] n = 0, 1, , N-1 (II)
Variable x[n] can be represented by: x [n] = xb [n] 2b b=0, 1, , B-1 xb [n] [0, 1] (III)
where xb [n] is the bth bit of x[n] and B is the input width. Finally, the inner product can be rewritten as follows: y = c[n] xb [k] 2b = c[0] (xB-1 [0]2B-1 + xB-2 [0]2B-2 + + x0 [0]20 ) + c[1] (xB-1 [1]2B-1 + xB-2 [1]2B-2 + + x0 [1]20 ) + + c[N-1] (xB-1 [N-1]2B-1 + xB-2 [0]2B-2 + + x0 [N1]20 ) = (c[0] xB-1 [0] + c[1] xB-1 [1] + + c[N-1] xB-1 [N1])2B-1 +(c[0] xB-2 [0] + c[1] xB-2 [1] + + c[N-1] xB-2 [N1])2B-2 + + (c[0] x0 [0] + c[1] x0 [1] + + c[N-1] x0 [N-1])20 = 2b c[n] xb [k] (IV) where n=0, 1, , N-1 and b=0, 1, , B-1
The conventional tapped delay line realization of this inner product is shown in Figure 1. This implementation translates to L multiplications and L-1 additions per sample to compute the result. This can be implemented using a single Multiply Accumulate (MAC) engine, but it would require L MAC cycles, before the next input sample can be processed. Using a parallel implementation with L MACs can speed up the performance L times. A general purpose multiplier occupies a large area on FPGAs. Since all the multiplications are with constants, the full flexibility of a general purpose multiplier is not required, and the area can be vastly reduced using techniques developed for constant multiplication. Though
The coefficients in most of DSP applications for the multiply accumulate operation are constants. The partial products are obtained by multiplying the coefficients ci by multiplying one bit of data xi at a time in AND operation. These partial products should be added and the result depend only on the outputs of the input shift registers. The AND functions and adders can be replaced by Look Up Tables (LUTs) that gives the partial product. This is shown in Figure 2. Input sequence is fed into the shift register at the input sample rate. The serial output is presented to the RAM based shift registers (registers are not shown in Figure for simplicity) at the bit clock rate which is n+1 times (n is number of bits in a data input sample) the sample rate. The RAM based shift register stores the data in a particular address. The outputs of registered LUTs are added and loaded to the scaling accumulator from LSB to MSB and the result which is the filter output will be accumulated over the time. For an n bit input, n+1 clock cycles are needed for a symmetrical filter to generate the output. In conventional MAC method with a limited number of MAC engines, as the filter length is increased, the system sample rate is decreased. This is not the case with serial DA architectures since the filter sample rate is decoupled from the filter length. As the filter length is increased, the throughput is maintained but more logic resources are consumed. Though the serial DA architecture is efficient by construction, its performance is limited by the fact that the next input sample can be processed only after every bit of the current input samples are processed. Each bit of the current input samples takes one clock cycle to process.
scaling accumulator
x0[i] x1[i] x2[i] x3[i] x4[i] x5[i] x6[i] x7[i] x0[i+1] x1[i+1] x2[i+1] x3[i+1] x4[i+1] x5[i+1] x6[i+1] x7[i+1]
LUT
scaling accumulator
LUT
<<
SET
Q Q
CLR
LUT
LUT
A popular technique for implementing the transposed form of FIR filters is the use of a multiplier block, instead of using multipliers for each constant as shown in Figure 4. The multiplications with the set of constants {hk} are replaced by an optimized set of additions and shift operations, involving computation sharing. Further optimization can be done by factorizing the expression and finding common subexpressions. The performance of this filter architecture is limited by the latency of the biggest adder and is the same as that of the PDA.
LUT
<<
SET
CLR
LUT
Therefore, if the input bitwidth is 12, then a new input can be sampled every 12 clock cycles. The performance of the circuit can be improved by modifying the architecture to a parallel architecture which processes the data bits in groups. Figure 3 shows the block diagram of a 2 bit parallel DA FIR filter. The tradeoff here is performance for area since increasing the number of bits sampled has a significant effect on resource utilization on FPGA. For instance, doubling the number of bits sampled, doubles the throughput and results in the half the number of clock cycles. This change doubles the number of LUTs as well as the size of the scaling accumulator. The number of bits being processed can be increased to its maximum size which is the input length n. This gives the maximum throughput to the filter. For a fully parallel implementation of the DA filter (PDA), the number of LUTs required would be enormous. In this work we show an alternative to the PDA method for implementing high speed FIR filters that consumes significantly lesser area and power.
The main contribution in this paper is the development of a novel algorithm for optimizing the multiplier block for FIR filters, using a modified algorithm for common subexpression elimination. The goal of the algorithm is to produce a filter that can provide the maximum sample rate with the least amount of hardware. Our algorithm takes into account the specific features of FPGA slices to reduce the total number of occupied slices. The reduced number of slices also leads to a reduction in the total power on the FPGA. We compare our results with the industry standard Xilinx CoregenTM, where we compare the total area and power consumption. The rest of the paper is organized as follows: Section 2 presents some related work. In Section3, we describe our filter architecture. In Section 4, we present our optimization algorithm for reducing the total area of the design. In Section 5, we describe our experimental setup and present our results. Finally we conclude the paper in Section 6.
II. RELATED WORK
Multiplications with constants have to be performed in many signal processing and communication applications such as FIR filters, audio, video and image processing. Since implementing a general purpose multiplier is expensive on an FPGA and since we do not really need such a multiplier, when one of the operands is a constant, there has been a lot of work
on deriving efficient structures for constant multiplications [813]. All these techniques are based on computing constant multiplications using table lookups and additions. The method of Distributed Arithmetic [12, 14] which is the most popular method for implementing Multiplierless FIR filters, is also based on table lookup. The XilinxTM CORE Generator has a highly parameterizable, optimized filter core for implementing digital FIR filters [12]. based on both Distributed Arithmetic as well as MAC (Multiply Accumulate) based architectures. It generates synthesized core that targeting a wide range of Xilinx devices. The MAC based implementations make use of the embedded DSP slices on the FPGA devices. In this work, we primarily compare our technique with the Coregen implementation of the Distributed Arithmetic, since that also is a Multiplierless technique. We show that our designs are much more area efficient than the DA based approach for fully parallel filters. We also compare our method with MAC based implementations, where we achieve significantly higher performance Though there has been a lot of work on optimizing constant multiplications using adders and employing redundancy elimination [15-19] , they have not been effectively used for FIR filter design. The closest work to implementing filters with adders is in [20], FIR filters are implemented using the Add and Shift method. Canonical Signed Digit (CSD) encoding is used for the coefficients to minimize the number of additions. The paper discusses how high speed implementations can be achieved by registering each adder, due to which the critical path becomes equal to the delay of the adder. Registering an adder output comes at no extra cost on an FPGA because of the presence of a D flip flop at the output of each LUT. In comparison with [20], we extensively use common subexpression elimination for reducing the number of adders and therefore area. Furthermore, our designs can run with sample rates as high as 252 Msps (Million samples per second), whereas the designs in [20] can run only at 78.6 Msps. In comparison with the other algorithms for common subexpression elimination [15, 16, 18, 19, 21], our method takes into account the structure of the FPGA slices (Figure 5) and takes into account both the cost of adders and registers when performing the optimization. Furthermore, we provide comprehensive evidence of the benefits of our technique through experimental results, where we compare our results with those produced by industry standard tools.
III. FILTER ARCHITECTURE
X y
X y
z-1
s'
X1 y1
LUT
SET
Q Q
s1
X1 y1
LUT
SET
Q Q
s'1
CLR
CLR
X0 y0
LUT
SET
s0
X0 y0
LUT
SET
s'0
CLR
CLR
Logic Block 1
Logic Block 1
(a)
(b)
Performing subexpression elimination can sometimes increase the number of registers substantially, and the overall area could possibly increase. Consider the two expressions F1 and F2 which could be part of the multiplier block. F1 = A + B + C + D F2 = A + B + C + E Figure 6 shows the original unoptimized expression trees. Both the expressions have a minimum critical path of two addition cycles. These expressions require a total of six registered adders for the fastest implementation, and no extra registers are required. From the expressions we can see that the computation A + B + C is common to both the expressions. If we extract this subexpression, we get the structure shown in Figure 7. Since both D and E need to wait for two addition cycles to be added to (A + B + C), we need to use two registers each for D and E, such that new values for A,B,C,D and E can be read in at each clock cycle. Assuming that the cost of an adder and a register with the same bitwidth are the same, the structure shown in Figure 7 occupies more area than the one shown in Figure 6. A more careful subexpression elimination algorithm would only extract the common subexpression A + B (or A+C or B + C). The number of adders is decreased by one from the original, and no additional registers are added. This is illustrated in Figure 8. The algorithm for performing this kind of optimization is described in the next section.
We base our filter architecture on the transposed form of the FIR filter as shown in Figure 1. The filter can be divided into two main parts, the multiplier block and the delay block, and is illustrated in Figure 4. In the multiplier block, the current input variable x[n] is multiplied by all the coefficients of the filter to produce the yi outputs. These yi outputs are then delayed and added in the delay block to produce the filter output y[n]. We perform all our optimizations in the multiplier block. The constant multiplications are decomposed into registered additions and hardwire shifts. The additions are performed using two input adders, which are arranged in the fastest tree structure. We use registered adders, so that the performance of the filter is only limited by the slowest adder. We use common subexpression elimination extensively, to reduce the number of adders, which leads to a reduction in the area. To synchronize all the intermediate values in the computation, we insert registers in the dataflow, wherever necessary.
The goal of our optimization is to reduce the area of the multiplier block by reducing the number of adders and any additional registers required for the fastest implementation of the FIR filter. We first give a brief overview of the common subexpression elimination methods. A detailed description can be found in [22]. We then present the modified optimization algorithm to be used for our work. A. Overview of common subexpression elimination We use a polynomial transformation of constant multiplications. Given a representation for the constant C, and the variable X, the multiplication C*X can be represented as a summation of terms denoting the decomposition of the multiplication into shifts and additions as C*X = XLi (V) The terms can be either positive or negative when the constants are represented using signed digit representations such as the Canonical Signed Digit (CSD) representation. The exponent of L represents the magnitude of the left shift and the is represent the digit positions of the non-zero digits of the constants. For example the multiplication 7*X = (100-1)CSD*X = X<<3 X = XL3 X, using the polynomial transformation. We use the divisors to represent all possible common subexpressions. Divisors are obtained from an expression by looking at every pair of terms in the expression and dividing the terms by the minimum exponent of L. For example in the expression F = XL2 + XL3 + XL5, consider the pair of terms (+XL2 + XL3). The minimum exponent of L in the two terms is L2. Dividing by L2, we get the divisor (X + XL). From the other two pairs of terms (XL2 + XL5) and (XL3 + XL5), we get the divisors (X + XL3) and (X + XL2) respectively. These divisors are significant, because every common subexpression in the set of expressions can be detected by performing intersections among the set of divisors. B. Optimization algorithm We first calculate the minimum number of registers required for our design. We calculate this by arranging the original expressions in the fastest possible tree structure, and then inserting registers. For example, for the six term expression F = A + B + C + D + E + F, we have the fastest tree structure with three addition steps, and we require one register to synchronize the intermediate values, such that new values for A,B,C,D,E,F can be read in every clock cycle. This is illustrated in Figure 9. We first generate all the divisors for the set of expressions describing the multiplier block. We then use an iterative algorithm, where we extract the divisor that has the greatest
i
value. To calculate the value of the divisor, we assume that the cost of a registered adder and a register is the same. We calculate the value of a divisor as the number of additions saved by extracting it minus the number of registers that have to be added. After selecting the best divisor, we rewrite the expressions using it. We then generate new divisors from the new terms that have been generated due to rewriting, and add them to the dynamic list of divisors. The iteration stops when there is no valuable divisor remaining in the set of divisors. Consider the expressions shown in Figure 6. We need six registered adders and no additional registers for the fastest evaluation of F1 and F2. Now consider the selection of the divisor d1 = (A+B). This divisor saves one addition and does not increase the number of registers. Divisors (A + C) and (B + C) also have the same value, but (A+B) is selected randomly. The expressions are now rewritten as:
d1 = (A + B) F1 = d1 + C + D F2 = d1 + C + E
ReduceArea( {Pi} ) { {Pi} = Set of expressions in polynomial form; {D} = Set o f divisors = ; //Step 1: Creating divisors and calculating minimum number of registers required for each expression Pi in {Pi} { {Dnew} = FindDivisors(Pi); Update frequency statistics of divisors in {D}; {D} = {D} { Dnew}; Pi->MinRegisters = Calculate Minimum registers required for fastest evaluation of Pi ; } //Step 2: Iterative selection and elimination of best divisor while(1) { Find d = Divisor in {D} with greatest Value; // Value = Num Additions reduced Num Registers Added; if( d == NULL) break; Rewrite affected expressions in {Pi} using d; Remove divisors in {D} that have become invalid; Update frequency statistics of affected divisors; {Dnew} = Set of new divisors from new terms added by division; {D} = {D} {Dnew};
} }
%R eductio
After rewriting the expressions and forming new divisors, the divisor d2 = (d1 + C) is considered. This divisor saves one adder, but introduces five additional registers, as can be seen in Figure 7. Therefore this divisor has a value of - 4. No other valuable divisors can be found and the iteration stops. We end up with the expressions shown in Figure 8.
V. EXPERIMENTS
Reduction in Resources
80 70 60 50 40 30 20 10 0 6 10 13 20 28 # of Taps 41 61 119 152 SLIC Es LU T s FFs
The goal of our experiments was to compare the number of resources consumed by our add and shift method with that produced by the cores generated by the commercial CoregenTM tool, based on Distributed Arithmetic. Besides the resources, we also compared the power consumption of the two implementations, and also measured the performance. For our experiments, we considered 9 FIR filters of various sizes (6, 10, 13, 20, 28, 41, 61, 119 and 151 tap filters). We targeted the Xilinx Virtex II device for our experiments. The constants were normalized to 17 digit of precision and the input samples were assumed to be 12 bits wide. For the add and shift method, we decomposed all the constant multiplications into additions and shifts and optimized the expressions using the algorithm explained in Section 4.2. We used the Xilinx Integrated Software Environment (ISE) for performing synthesis and implementation of the designs. All the designs were synthesized for maximum performance. Table 1a shows the resources utilized for the various filters and the performance in terms of Million samples per second (Msps) for the filters implemented using the add and shift method. Table 1b, shows the same numbers for the filters implemented using Xilinx Coregen, using the Parallel Distributed Arithmetic (PDA) method.
Table 1a. Filter Synthesis using Add Shift method
Filter (# taps) 6 10 13 20 28 41 61 119 151 Slices 264 474 386 856 1294 2154 3264 6009 7579 LUTs 213 406 334 705 1145 1719 2591 4821 6098 FFs 509 916 749 1650 2508 4161 6303 11551 14611 Performance (Msps) 251 222 252 250 227 223 192 203 180
Figure 12 compares power consumption for our add/shift method versus CoregenTM. From the results we can observe up to 50% reduction in dynamic power consumption. We did not include the quiescent power into our calculation since that value is the same for both methods. The power consumption is the result of applying the same test stimulus to both designs and measuring the power using XPower tools provided by Xilinx ISE software.
Dynam ic Power Consum ption
1600 1400 1200 1000 800 600 400 200 0 6 10 13 20 28 41 61 119 Filter s ize (# of taps )
Power (mw
Add/Shift C oregen
Comparison with MAC filters using embedded multipliers CoregenTM can produce FIR filters based on the Multiply Accumulate (MAC) method, which makes use of the embedded multipliers and DSP blocks. We implemented the FIR filters using the MAC method to compare the resource usage and performance with our add and shift method. Due to tool limitations we had to do the experiments for Virtex IV device . We present the synthesis results in terms of number of slices on the Virtex IV device and the performance in Msps in Table 2.
Table 2. Comparing with MAC filter on Virtex IV Filter (# taps) 6 10 13 20 28 41 61 119 151 Add Shift Method Slices Msps 264 296 475 296 387 296 851 271 1303 305 2178 296 3284 247 6025 294 7623 294 MAC filter Slices Msps 219 262 418 253 462 253 790 251 886 251 1660 243 1947 242 3581 241 7631 215
Figure 11 plots the reduction in the number of resources, in terms of the number of Slices, Look Up Tables (LUTs) and the number of Flip Flops (FFs). From the results, we can observe an average reduction of 58.7% in the number of LUTs, and about 25% reduction in the number of slices and FFs. Though our algorithm does not optimize for performance, the synthesis produces better performance in most of the cases, and for the 13 and 20 tap filters, we observe about 26% improvement in performance.
Table 1b. Filter Synthesis using Coregen (PDA method) Filter (# taps)
6 10 13 20 28 41 61 119 151
Slices
524 781 929 1191 1774 2475 3528 6484 8274
LUTs
774 1103 1311 1631 2544 3642 5335 9754 12525
FFs
1012 1480 1775 2288 3381 4748 6812 12539 15988
Performance (Msps)
245 222 199 199 199 222 199 205 199
From the table, it can be seen that the MAC filter uses fewer number of slices compared to the add-shift method, but it also
uses the DSP blocks available on Virtex IV devices. The number of DSP blocks is equal to the number of taps of the filter. The results show that we achieve higher performance as the filter size increases. This is mainly because that critical path in our design consists of adders while in MAC method, critical path consists of multipliers and adders. Another limitation for MAC method is that Xilinx CoregenTM is limited to input width of 17 bits due to the embedded DSP block input limitation while our add and shift method can accept inputs of any width.
VI. CONCLUSION
[16]
In this paper we presented a multiplierless technique, based on the add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. We validated our techniques on Virtex IITM devices where we observed significant area and power reductions over traditional Distributed Arithmetic based techniques. In future, we would like to modify our algorithm to make use of the limited number of embedded multipliers available on the FPGA devices.
[17]
[18]
[19]
VII. REFERENCES [1] K.D.Underwood and K.S.Hemmert, "Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance," presented at International Symposium on Field-Programmable Custom Computing Machines, California, USA, 2004. L.Zhuo and V.K.Prasanna, "Sparse Matrix-Vector Multiplication on FPGAs," presented at International Symposium on Field Programmable Gate Arrays (FPGA), Monterey, CA, 2005. Y.Meng, A.P.Brown, R.A.Iltis, T.Sherwood, H.Lee, and R.Kastner, "MP Core: Algorithm and Design Techniques for Efficient Channel Estimation in Wireless Applications," presented at Design Automation Conference (DAC), Anaheim, CA, 2005. B. L. Hutchings and B. E. Nelson, "Gigaop DSP on FPGA," presented at Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on, 2001. A.Alsolaim, J.Becker, M.Glesner, and J.Starzyk, "Architecture and Application of a Dynamically Reconfigurable Hardware Array for Future Mobile Communication Systems," presented at International Symposium on Field Programmable Custom Computing Machines (FCCM), 2000. S.J.Melnikoff, S.F.Quigley, and M.J.Russell, "Implementing a Simple Continuous Speech Recognition System on an FPGA," presented at International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2002. T.Yokota, M.Nagafuchi, Y.Mekada, T.Yoshinaga, K.Ootsu, and T.Baba, "A Scalable FPGA-based Custom Computing Machine for Medical Image Processing," presented at International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2002. K.Chapman, "Constant Coefficient Multipliers for the XC4000E," Xilinx Technical Report 1996. K. Wiatr and E. Jamro, "Constant coefficient multiplication in FPGA structures," presented at Euromicro Conference, 2000. Proceedings of the 26th, 2000. M. J. Wirthlin and B. McMurtrey, "Efficient Constant Coefficient Multiplication Using Advanced FPGA Architectures," presented at International Conference on Field Programmable Logic and Applications (FPL), 2001. M.J.Wirthlin, "Constant Coefficient Multiplication Using Look-Up Tables," Journal of VLSI Signal Processing, vol. 36, pp. 7-15, 2004.
[20]
[21]
[2] [3]
[22]
[4]
"Distributed Arithmetic FIR Filter v9.0," Xilinx Product Specification 2004. T. Sasao, Y. Iguchi, and T. Suzuki, "On LUT Cascade Realizations of FIR Filters," presented at Euromicro Conference on Digital System Design (DSD), 2005. G.R.Goslin, "A Guide to Using Field Programmable Gate Arrays (FPGAs) for Application-Specific Digital Signal Processing Performance," Xilinx Application Note, San Jose 1995. M.Potkonjak, M.B.Srivastava, and A.P.Chandrakasan, "Multiple Constant Multiplications: Efficient and Versatile Framework and Algorithms for Exploring Common Subexpression Elimination," IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 1996. R.I.Hartley, "Subexpression sharing in filters using canonic signed digit multipliers," Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on [see also Circuits and Systems II: Express Briefs, IEEE Transactions on] , vol. 43, pp. 677-688, 1996. H.T.Nguyen and A.Chatterjee, "Number-splitting with shift-andadd decomposition for power and hardware optimization in linear DSP synthesis," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 8, pp. 419-424, 2000. H.-J. Kang, H. Kim, and I.-C. Park, "FIR filter synthesis algorithms for minimizing the delay and the number of adders," presented at Computer Aided Design, 2000. ICCAD-2000. IEEE/ACM International Conference on, 2000. A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware Compleity of Linear DSP Systems by Iteratively Eliminating Two Term Common Subexpressions," presented at Asia South Pacific Design Automation Conference, Shanghai, 2005. M. Yamada and A. Nishihara, "High-speed FIR digital filter with CSD coefficients implemented on FPGA," presented at Design Automation Conference, 2001. Proceedings of the ASP-DAC 2001. Asia and South Pacific, 2001. H.Safiri, M.Ahmadi, G.A.Jullien, and W.C.Miller, "A new algorithm for the elimination of common subexpressions in hardware implementation of digital filters by using genetic programming," presented at Application-Specific Systems, Architectures, and Processors, 2000. Proceedings. IEEE International Conference on, 2000. A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware complexity by iteratively eliminating two term common subexpressions," presented at Asia South Pacific Design Automation Conference (ASP-DAC), 2005.
[5]
[6]
[7]
[11]