0% found this document useful (0 votes)

67 views

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

FIR convolution

Uploaded by

pramani90

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

FIR convolution

Uploaded by

pramani90

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University Of California, Santa Barbara, CA 93106 E-mail: [email protected], [email protected], [email protected]
Abstract-We present a method for implementing high speed Finite Impulse Response (FIR) filters using just registered adders and hardwired shifts. We extensively use a modified common subexpression elimination algorithm to reduce the number of adders. We target our optimizations to Xilinx Virtex II devices where we compare our implementations with those produced by Xilinx CoregenTM using Distributed Arithmetic. We observe up to 50% reduction in the number of slices and up to 75% reduction in the number of LUTs for fully parallel implementations. We also observed up to 50% reduction in the total dynamic power consumption of the filters. Our designs perform significantly faster than the MAC filters, which use embedded multipliers.
I. INTRODUCTION

most of the current generation FPGAs such as Virtex IITM have embedded multipliers to handle these multiplications, the number of these multipliers is typically limited. Furthermore, the size of these multipliers is limited to only 18 bits, which limits the precision of the computations for high speed requirements. The ideal implementation would involve a sharing of the Combinational Logic Blocks (CLBs) and these multipliers. In this paper, we present a technique that is better than conventional techniques for implementation on the CLBs.
X [n]
x

hL-1

hL-2

hL-3

h0 y [n]

z-1

...

z-1

FPGAs are being increasingly used for a variety of computationally intensive applications, mainly in the realm of Digital Signal Processing (DSP) and communications [1-7]. Due to rapid increases in the technology, current generation of FPGAs contain a very high number of Configurable Logic Blocks (CLBs), and are becoming more feasible for implementing a wide range of applications. The high nonrecurring engineering (NRE) costs and long development time for ASICs are making FPGAs more attractive for application specific DSP solutions. DSP functions such as FIR filters and transforms are used in a number of applications such as communication and multimedia. These functions are major determinants of the performance and power consumption of the whole system. Therefore it is important to have good tools for optimizing these functions. Equation (I) represents the output of an L tap FIR filter, which is the convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series. y[n] = h[k] x[n-k] k= 0, 1, ..., L-1 (I)

Figure 1. A MAC FIR filter block diagram

An alternative to the above approach is Distributed Arithmetic (DA) which is a well known method to save resources. Using DA method, the filter can be implemented either in bit serial or fully parallel mode to trade bandwidth for area utilization. Assuming coefficients c[n] are known constants, equation (I) can be rewritten as follows: y[n] = c[n] x[n] n = 0, 1, , N-1 (II)

Variable x[n] can be represented by: x [n] = xb [n] 2b b=0, 1, , B-1 xb [n] [0, 1] (III)

where xb [n] is the bth bit of x[n] and B is the input width. Finally, the inner product can be rewritten as follows: y = c[n] xb [k] 2b = c[0] (xB-1 [0]2B-1 + xB-2 [0]2B-2 + + x0 [0]20 ) + c[1] (xB-1 [1]2B-1 + xB-2 [1]2B-2 + + x0 [1]20 ) + + c[N-1] (xB-1 [N-1]2B-1 + xB-2 [0]2B-2 + + x0 [N1]20 ) = (c[0] xB-1 [0] + c[1] xB-1 [1] + + c[N-1] xB-1 [N1])2B-1 +(c[0] xB-2 [0] + c[1] xB-2 [1] + + c[N-1] xB-2 [N1])2B-2 + + (c[0] x0 [0] + c[1] x0 [1] + + c[N-1] x0 [N-1])20 = 2b c[n] xb [k] (IV) where n=0, 1, , N-1 and b=0, 1, , B-1

The conventional tapped delay line realization of this inner product is shown in Figure 1. This implementation translates to L multiplications and L-1 additions per sample to compute the result. This can be implemented using a single Multiply Accumulate (MAC) engine, but it would require L MAC cycles, before the next input sample can be processed. Using a parallel implementation with L MACs can speed up the performance L times. A general purpose multiplier occupies a large area on FPGAs. Since all the multiplications are with constants, the full flexibility of a general purpose multiplier is not required, and the area can be vastly reduced using techniques developed for constant multiplication. Though

1-4244-9707-X/06/$20.00 2006 IEEE

The coefficients in most of DSP applications for the multiply accumulate operation are constants. The partial products are obtained by multiplying the coefficients ci by multiplying one bit of data xi at a time in AND operation. These partial products should be added and the result depend only on the outputs of the input shift registers. The AND functions and adders can be replaced by Look Up Tables (LUTs) that gives the partial product. This is shown in Figure 2. Input sequence is fed into the shift register at the input sample rate. The serial output is presented to the RAM based shift registers (registers are not shown in Figure for simplicity) at the bit clock rate which is n+1 times (n is number of bits in a data input sample) the sample rate. The RAM based shift register stores the data in a particular address. The outputs of registered LUTs are added and loaded to the scaling accumulator from LSB to MSB and the result which is the filter output will be accumulated over the time. For an n bit input, n+1 clock cycles are needed for a symmetrical filter to generate the output. In conventional MAC method with a limited number of MAC engines, as the filter length is increased, the system sample rate is decreased. This is not the case with serial DA architectures since the filter sample rate is decoupled from the filter length. As the filter length is increased, the throughput is maintained but more logic resources are consumed. Though the serial DA architecture is efficient by construction, its performance is limited by the fact that the next input sample can be processed only after every bit of the current input samples are processed. Each bit of the current input samples takes one clock cycle to process.
scaling accumulator

x0[i] x1[i] x2[i] x3[i] x4[i] x5[i] x6[i] x7[i] x0[i+1] x1[i+1] x2[i+1] x3[i+1] x4[i+1] x5[i+1] x6[i+1] x7[i+1]

LUT

scaling accumulator

LUT

SET

Q Q

CLR

LUT

Figure 3. A 2 bit parallel DA FIR filter block diagram

A popular technique for implementing the transposed form of FIR filters is the use of a multiplier block, instead of using multipliers for each constant as shown in Figure 4. The multiplications with the set of constants {hk} are replaced by an optimized set of additions and shift operations, involving computation sharing. Further optimization can be done by factorizing the expression and finding common subexpressions. The performance of this filter architecture is limited by the latency of the biggest adder and is the same as that of the PDA.

x0[i] x1[i] x2[i] x3[i] x4[i] x5[i] x6[i] x7[i]

LUT

SET

CLR

LUT

Figure 4. Replacing constant multiplication by multiplier block

Address 0000 0001 0010 1111

Data 0 C0 C0+C1 C0+C1+C2+C3

Figure 2. A serial DA FIR filter block diagram

Therefore, if the input bitwidth is 12, then a new input can be sampled every 12 clock cycles. The performance of the circuit can be improved by modifying the architecture to a parallel architecture which processes the data bits in groups. Figure 3 shows the block diagram of a 2 bit parallel DA FIR filter. The tradeoff here is performance for area since increasing the number of bits sampled has a significant effect on resource utilization on FPGA. For instance, doubling the number of bits sampled, doubles the throughput and results in the half the number of clock cycles. This change doubles the number of LUTs as well as the size of the scaling accumulator. The number of bits being processed can be increased to its maximum size which is the input length n. This gives the maximum throughput to the filter. For a fully parallel implementation of the DA filter (PDA), the number of LUTs required would be enormous. In this work we show an alternative to the PDA method for implementing high speed FIR filters that consumes significantly lesser area and power.

The main contribution in this paper is the development of a novel algorithm for optimizing the multiplier block for FIR filters, using a modified algorithm for common subexpression elimination. The goal of the algorithm is to produce a filter that can provide the maximum sample rate with the least amount of hardware. Our algorithm takes into account the specific features of FPGA slices to reduce the total number of occupied slices. The reduced number of slices also leads to a reduction in the total power on the FPGA. We compare our results with the industry standard Xilinx CoregenTM, where we compare the total area and power consumption. The rest of the paper is organized as follows: Section 2 presents some related work. In Section3, we describe our filter architecture. In Section 4, we present our optimization algorithm for reducing the total area of the design. In Section 5, we describe our experimental setup and present our results. Finally we conclude the paper in Section 6.
II. RELATED WORK

Multiplications with constants have to be performed in many signal processing and communication applications such as FIR filters, audio, video and image processing. Since implementing a general purpose multiplier is expensive on an FPGA and since we do not really need such a multiplier, when one of the operands is a constant, there has been a lot of work

on deriving efficient structures for constant multiplications [813]. All these techniques are based on computing constant multiplications using table lookups and additions. The method of Distributed Arithmetic [12, 14] which is the most popular method for implementing Multiplierless FIR filters, is also based on table lookup. The XilinxTM CORE Generator has a highly parameterizable, optimized filter core for implementing digital FIR filters [12]. based on both Distributed Arithmetic as well as MAC (Multiply Accumulate) based architectures. It generates synthesized core that targeting a wide range of Xilinx devices. The MAC based implementations make use of the embedded DSP slices on the FPGA devices. In this work, we primarily compare our technique with the Coregen implementation of the Distributed Arithmetic, since that also is a Multiplierless technique. We show that our designs are much more area efficient than the DA based approach for fully parallel filters. We also compare our method with MAC based implementations, where we achieve significantly higher performance Though there has been a lot of work on optimizing constant multiplications using adders and employing redundancy elimination [15-19] , they have not been effectively used for FIR filter design. The closest work to implementing filters with adders is in [20], FIR filters are implemented using the Add and Shift method. Canonical Signed Digit (CSD) encoding is used for the coefficients to minimize the number of additions. The paper discusses how high speed implementations can be achieved by registering each adder, due to which the critical path becomes equal to the delay of the adder. Registering an adder output comes at no extra cost on an FPGA because of the presence of a D flip flop at the output of each LUT. In comparison with [20], we extensively use common subexpression elimination for reducing the number of adders and therefore area. Furthermore, our designs can run with sample rates as high as 252 Msps (Million samples per second), whereas the designs in [20] can run only at 78.6 Msps. In comparison with the other algorithms for common subexpression elimination [15, 16, 18, 19, 21], our method takes into account the structure of the FPGA slices (Figure 5) and takes into account both the cost of adders and registers when performing the optimization. Furthermore, we provide comprehensive evidence of the benefits of our technique through experimental results, where we compare our results with those produced by industry standard tools.
III. FILTER ARCHITECTURE

X y

z-1

X1 y1

LUT

SET

Q Q

X1 y1

LUT

SET

Q Q

s'1

CLR

Logic Block 2 carry

X0 y0

LUT

SET

X0 y0

LUT

SET

s'0

CLR

Logic Block 1

Figure 5. Registered adder at no additional cost

(a)

(b)

Performing subexpression elimination can sometimes increase the number of registers substantially, and the overall area could possibly increase. Consider the two expressions F1 and F2 which could be part of the multiplier block. F1 = A + B + C + D F2 = A + B + C + E Figure 6 shows the original unoptimized expression trees. Both the expressions have a minimum critical path of two addition cycles. These expressions require a total of six registered adders for the fastest implementation, and no extra registers are required. From the expressions we can see that the computation A + B + C is common to both the expressions. If we extract this subexpression, we get the structure shown in Figure 7. Since both D and E need to wait for two addition cycles to be added to (A + B + C), we need to use two registers each for D and E, such that new values for A,B,C,D and E can be read in at each clock cycle. Assuming that the cost of an adder and a register with the same bitwidth are the same, the structure shown in Figure 7 occupies more area than the one shown in Figure 6. A more careful subexpression elimination algorithm would only extract the common subexpression A + B (or A+C or B + C). The number of adders is decreased by one from the original, and no additional registers are added. This is illustrated in Figure 8. The algorithm for performing this kind of optimization is described in the next section.

We base our filter architecture on the transposed form of the FIR filter as shown in Figure 1. The filter can be divided into two main parts, the multiplier block and the delay block, and is illustrated in Figure 4. In the multiplier block, the current input variable x[n] is multiplied by all the coefficients of the filter to produce the yi outputs. These yi outputs are then delayed and added in the delay block to produce the filter output y[n]. We perform all our optimizations in the multiplier block. The constant multiplications are decomposed into registered additions and hardwire shifts. The additions are performed using two input adders, which are arranged in the fastest tree structure. We use registered adders, so that the performance of the filter is only limited by the slowest adder. We use common subexpression elimination extensively, to reduce the number of adders, which leads to a reduction in the area. To synchronize all the intermediate values in the computation, we insert registers in the dataflow, wherever necessary.

Figure 6. Unoptimized expression trees

Figure 7. Extracting common expression (A + B + C)

Figure 8. Extracting common subexpression (A+B)

Figure 9. Calculating registers required for fastest evaluation

IV. OPTIMIZATION ALGORITHM

The goal of our optimization is to reduce the area of the multiplier block by reducing the number of adders and any additional registers required for the fastest implementation of the FIR filter. We first give a brief overview of the common subexpression elimination methods. A detailed description can be found in [22]. We then present the modified optimization algorithm to be used for our work. A. Overview of common subexpression elimination We use a polynomial transformation of constant multiplications. Given a representation for the constant C, and the variable X, the multiplication C*X can be represented as a summation of terms denoting the decomposition of the multiplication into shifts and additions as C*X = XLi (V) The terms can be either positive or negative when the constants are represented using signed digit representations such as the Canonical Signed Digit (CSD) representation. The exponent of L represents the magnitude of the left shift and the is represent the digit positions of the non-zero digits of the constants. For example the multiplication 7*X = (100-1)CSD*X = X<<3 X = XL3 X, using the polynomial transformation. We use the divisors to represent all possible common subexpressions. Divisors are obtained from an expression by looking at every pair of terms in the expression and dividing the terms by the minimum exponent of L. For example in the expression F = XL2 + XL3 + XL5, consider the pair of terms (+XL2 + XL3). The minimum exponent of L in the two terms is L2. Dividing by L2, we get the divisor (X + XL). From the other two pairs of terms (XL2 + XL5) and (XL3 + XL5), we get the divisors (X + XL3) and (X + XL2) respectively. These divisors are significant, because every common subexpression in the set of expressions can be detected by performing intersections among the set of divisors. B. Optimization algorithm We first calculate the minimum number of registers required for our design. We calculate this by arranging the original expressions in the fastest possible tree structure, and then inserting registers. For example, for the six term expression F = A + B + C + D + E + F, we have the fastest tree structure with three addition steps, and we require one register to synchronize the intermediate values, such that new values for A,B,C,D,E,F can be read in every clock cycle. This is illustrated in Figure 9. We first generate all the divisors for the set of expressions describing the multiplier block. We then use an iterative algorithm, where we extract the divisor that has the greatest
i

value. To calculate the value of the divisor, we assume that the cost of a registered adder and a register is the same. We calculate the value of a divisor as the number of additions saved by extracting it minus the number of registers that have to be added. After selecting the best divisor, we rewrite the expressions using it. We then generate new divisors from the new terms that have been generated due to rewriting, and add them to the dynamic list of divisors. The iteration stops when there is no valuable divisor remaining in the set of divisors. Consider the expressions shown in Figure 6. We need six registered adders and no additional registers for the fastest evaluation of F1 and F2. Now consider the selection of the divisor d1 = (A+B). This divisor saves one addition and does not increase the number of registers. Divisors (A + C) and (B + C) also have the same value, but (A+B) is selected randomly. The expressions are now rewritten as:
d1 = (A + B) F1 = d1 + C + D F2 = d1 + C + E
ReduceArea( {Pi} ) { {Pi} = Set of expressions in polynomial form; {D} = Set o f divisors = ; //Step 1: Creating divisors and calculating minimum number of registers required for each expression Pi in {Pi} { {Dnew} = FindDivisors(Pi); Update frequency statistics of divisors in {D}; {D} = {D} { Dnew}; Pi->MinRegisters = Calculate Minimum registers required for fastest evaluation of Pi ; } //Step 2: Iterative selection and elimination of best divisor while(1) { Find d = Divisor in {D} with greatest Value; // Value = Num Additions reduced Num Registers Added; if( d == NULL) break; Rewrite affected expressions in {Pi} using d; Remove divisors in {D} that have become invalid; Update frequency statistics of affected divisors; {Dnew} = Set of new divisors from new terms added by division; {D} = {D} {Dnew};

} }

Figure 10. Optimization algorithm to reduce area

%R eductio

After rewriting the expressions and forming new divisors, the divisor d2 = (d1 + C) is considered. This divisor saves one adder, but introduces five additional registers, as can be seen in Figure 7. Therefore this divisor has a value of - 4. No other valuable divisors can be found and the iteration stops. We end up with the expressions shown in Figure 8.
V. EXPERIMENTS

Reduction in Resources
80 70 60 50 40 30 20 10 0 6 10 13 20 28 # of Taps 41 61 119 152 SLIC Es LU T s FFs

The goal of our experiments was to compare the number of resources consumed by our add and shift method with that produced by the cores generated by the commercial CoregenTM tool, based on Distributed Arithmetic. Besides the resources, we also compared the power consumption of the two implementations, and also measured the performance. For our experiments, we considered 9 FIR filters of various sizes (6, 10, 13, 20, 28, 41, 61, 119 and 151 tap filters). We targeted the Xilinx Virtex II device for our experiments. The constants were normalized to 17 digit of precision and the input samples were assumed to be 12 bits wide. For the add and shift method, we decomposed all the constant multiplications into additions and shifts and optimized the expressions using the algorithm explained in Section 4.2. We used the Xilinx Integrated Software Environment (ISE) for performing synthesis and implementation of the designs. All the designs were synthesized for maximum performance. Table 1a shows the resources utilized for the various filters and the performance in terms of Million samples per second (Msps) for the filters implemented using the add and shift method. Table 1b, shows the same numbers for the filters implemented using Xilinx Coregen, using the Parallel Distributed Arithmetic (PDA) method.
Table 1a. Filter Synthesis using Add Shift method
Filter (# taps) 6 10 13 20 28 41 61 119 151 Slices 264 474 386 856 1294 2154 3264 6009 7579 LUTs 213 406 334 705 1145 1719 2591 4821 6098 FFs 509 916 749 1650 2508 4161 6303 11551 14611 Performance (Msps) 251 222 252 250 227 223 192 203 180

Figure 11. Reduction in resources

Figure 12 compares power consumption for our add/shift method versus CoregenTM. From the results we can observe up to 50% reduction in dynamic power consumption. We did not include the quiescent power into our calculation since that value is the same for both methods. The power consumption is the result of applying the same test stimulus to both designs and measuring the power using XPower tools provided by Xilinx ISE software.
Dynam ic Power Consum ption
1600 1400 1200 1000 800 600 400 200 0 6 10 13 20 28 41 61 119 Filter s ize (# of taps )

Power (mw

Add/Shift C oregen

Figure 12. Power consumption

Comparison with MAC filters using embedded multipliers CoregenTM can produce FIR filters based on the Multiply Accumulate (MAC) method, which makes use of the embedded multipliers and DSP blocks. We implemented the FIR filters using the MAC method to compare the resource usage and performance with our add and shift method. Due to tool limitations we had to do the experiments for Virtex IV device . We present the synthesis results in terms of number of slices on the Virtex IV device and the performance in Msps in Table 2.
Table 2. Comparing with MAC filter on Virtex IV Filter (# taps) 6 10 13 20 28 41 61 119 151 Add Shift Method Slices Msps 264 296 475 296 387 296 851 271 1303 305 2178 296 3284 247 6025 294 7623 294 MAC filter Slices Msps 219 262 418 253 462 253 790 251 886 251 1660 243 1947 242 3581 241 7631 215

Figure 11 plots the reduction in the number of resources, in terms of the number of Slices, Look Up Tables (LUTs) and the number of Flip Flops (FFs). From the results, we can observe an average reduction of 58.7% in the number of LUTs, and about 25% reduction in the number of slices and FFs. Though our algorithm does not optimize for performance, the synthesis produces better performance in most of the cases, and for the 13 and 20 tap filters, we observe about 26% improvement in performance.
Table 1b. Filter Synthesis using Coregen (PDA method) Filter (# taps)
6 10 13 20 28 41 61 119 151

Slices
524 781 929 1191 1774 2475 3528 6484 8274

LUTs
774 1103 1311 1631 2544 3642 5335 9754 12525

FFs
1012 1480 1775 2288 3381 4748 6812 12539 15988

Performance (Msps)
245 222 199 199 199 222 199 205 199

From the table, it can be seen that the MAC filter uses fewer number of slices compared to the add-shift method, but it also

uses the DSP blocks available on Virtex IV devices. The number of DSP blocks is equal to the number of taps of the filter. The results show that we achieve higher performance as the filter size increases. This is mainly because that critical path in our design consists of adders while in MAC method, critical path consists of multipliers and adders. Another limitation for MAC method is that Xilinx CoregenTM is limited to input width of 17 bits due to the embedded DSP block input limitation while our add and shift method can accept inputs of any width.
VI. CONCLUSION

[12] [13] [14] [15]

[16]

In this paper we presented a multiplierless technique, based on the add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. We validated our techniques on Virtex IITM devices where we observed significant area and power reductions over traditional Distributed Arithmetic based techniques. In future, we would like to modify our algorithm to make use of the limited number of embedded multipliers available on the FPGA devices.

[17]

[18]

[19]

VII. REFERENCES [1] K.D.Underwood and K.S.Hemmert, "Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance," presented at International Symposium on Field-Programmable Custom Computing Machines, California, USA, 2004. L.Zhuo and V.K.Prasanna, "Sparse Matrix-Vector Multiplication on FPGAs," presented at International Symposium on Field Programmable Gate Arrays (FPGA), Monterey, CA, 2005. Y.Meng, A.P.Brown, R.A.Iltis, T.Sherwood, H.Lee, and R.Kastner, "MP Core: Algorithm and Design Techniques for Efficient Channel Estimation in Wireless Applications," presented at Design Automation Conference (DAC), Anaheim, CA, 2005. B. L. Hutchings and B. E. Nelson, "Gigaop DSP on FPGA," presented at Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on, 2001. A.Alsolaim, J.Becker, M.Glesner, and J.Starzyk, "Architecture and Application of a Dynamically Reconfigurable Hardware Array for Future Mobile Communication Systems," presented at International Symposium on Field Programmable Custom Computing Machines (FCCM), 2000. S.J.Melnikoff, S.F.Quigley, and M.J.Russell, "Implementing a Simple Continuous Speech Recognition System on an FPGA," presented at International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2002. T.Yokota, M.Nagafuchi, Y.Mekada, T.Yoshinaga, K.Ootsu, and T.Baba, "A Scalable FPGA-based Custom Computing Machine for Medical Image Processing," presented at International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2002. K.Chapman, "Constant Coefficient Multipliers for the XC4000E," Xilinx Technical Report 1996. K. Wiatr and E. Jamro, "Constant coefficient multiplication in FPGA structures," presented at Euromicro Conference, 2000. Proceedings of the 26th, 2000. M. J. Wirthlin and B. McMurtrey, "Efficient Constant Coefficient Multiplication Using Advanced FPGA Architectures," presented at International Conference on Field Programmable Logic and Applications (FPL), 2001. M.J.Wirthlin, "Constant Coefficient Multiplication Using Look-Up Tables," Journal of VLSI Signal Processing, vol. 36, pp. 7-15, 2004.

[20]

[21]

[2] [3]

[22]

[4]

"Distributed Arithmetic FIR Filter v9.0," Xilinx Product Specification 2004. T. Sasao, Y. Iguchi, and T. Suzuki, "On LUT Cascade Realizations of FIR Filters," presented at Euromicro Conference on Digital System Design (DSD), 2005. G.R.Goslin, "A Guide to Using Field Programmable Gate Arrays (FPGAs) for Application-Specific Digital Signal Processing Performance," Xilinx Application Note, San Jose 1995. M.Potkonjak, M.B.Srivastava, and A.P.Chandrakasan, "Multiple Constant Multiplications: Efficient and Versatile Framework and Algorithms for Exploring Common Subexpression Elimination," IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 1996. R.I.Hartley, "Subexpression sharing in filters using canonic signed digit multipliers," Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on [see also Circuits and Systems II: Express Briefs, IEEE Transactions on] , vol. 43, pp. 677-688, 1996. H.T.Nguyen and A.Chatterjee, "Number-splitting with shift-andadd decomposition for power and hardware optimization in linear DSP synthesis," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 8, pp. 419-424, 2000. H.-J. Kang, H. Kim, and I.-C. Park, "FIR filter synthesis algorithms for minimizing the delay and the number of adders," presented at Computer Aided Design, 2000. ICCAD-2000. IEEE/ACM International Conference on, 2000. A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware Compleity of Linear DSP Systems by Iteratively Eliminating Two Term Common Subexpressions," presented at Asia South Pacific Design Automation Conference, Shanghai, 2005. M. Yamada and A. Nishihara, "High-speed FIR digital filter with CSD coefficients implemented on FPGA," presented at Design Automation Conference, 2001. Proceedings of the ASP-DAC 2001. Asia and South Pacific, 2001. H.Safiri, M.Ahmadi, G.A.Jullien, and W.C.Miller, "A new algorithm for the elimination of common subexpressions in hardware implementation of digital filters by using genetic programming," presented at Application-Specific Systems, Architectures, and Processors, 2000. Proceedings. IEEE International Conference on, 2000. A.Hosangadi, F.Fallah, and R.Kastner, "Reducing Hardware complexity by iteratively eliminating two term common subexpressions," presented at Asia South Pacific Design Automation Conference (ASP-DAC), 2005.

[5]

[6]

[7]

[8] [9] [10]

[11]

CIC Filter Introduction
No ratings yet
CIC Filter Introduction
7 pages
Computer Price List
No ratings yet
Computer Price List
22 pages
LG GP55EX70 - Ultra Slim Portable DVD Writer With M-DISC™ Support - LG USA
No ratings yet
LG GP55EX70 - Ultra Slim Portable DVD Writer With M-DISC™ Support - LG USA
6 pages
Siemens FMX 1
No ratings yet
Siemens FMX 1
32 pages
FIR Filter
No ratings yet
FIR Filter
5 pages
Fgmos Based Low-Voltage Low-Power High Output Impedance Regulated Cascode Current Mirror
No ratings yet
Fgmos Based Low-Voltage Low-Power High Output Impedance Regulated Cascode Current Mirror
18 pages
Efficient Design of FIR Filter Using Modified Booth Multiplier
No ratings yet
Efficient Design of FIR Filter Using Modified Booth Multiplier
5 pages
FIR Filter Design On Chip Using VHDL: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
FIR Filter Design On Chip Using VHDL: IPASJ International Journal of Computer Science (IIJCS)
5 pages
High-Performance DSP Capability Within An Optimized Low-Cost Fpga Architecture
No ratings yet
High-Performance DSP Capability Within An Optimized Low-Cost Fpga Architecture
12 pages
Smart Card:: Smart Cards-What Are They?
No ratings yet
Smart Card:: Smart Cards-What Are They?
12 pages
Implementing Bit-Serial Digital Filters in At6000 Fpgas
No ratings yet
Implementing Bit-Serial Digital Filters in At6000 Fpgas
9 pages
F0283111611-Ijsce Paper - Subir
No ratings yet
F0283111611-Ijsce Paper - Subir
5 pages
Design of FIR Filter Using Distributed Arithmetic Architecture
No ratings yet
Design of FIR Filter Using Distributed Arithmetic Architecture
3 pages
Convolution
No ratings yet
Convolution
6 pages
CETCME-2020 - NIET - Vineet Shekher
No ratings yet
CETCME-2020 - NIET - Vineet Shekher
9 pages
An Approach To Digital Low-Pass IIR Filter Design
No ratings yet
An Approach To Digital Low-Pass IIR Filter Design
6 pages
AN540 - Implementacion de Filtros IIR Con Pic PDF
No ratings yet
AN540 - Implementacion de Filtros IIR Con Pic PDF
21 pages
EECT6306 Miidterm Project Harshit Vamshi
No ratings yet
EECT6306 Miidterm Project Harshit Vamshi
13 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
6 pages
Paper 1
No ratings yet
Paper 1
2 pages
Trancated Multiplier Pepr
No ratings yet
Trancated Multiplier Pepr
5 pages
Fir Filter Paper
No ratings yet
Fir Filter Paper
4 pages
Hardware Implementations of Digital Fir Filters in Fpga
No ratings yet
Hardware Implementations of Digital Fir Filters in Fpga
4 pages
Design of Low Power and High Speed 4X4 WTM
No ratings yet
Design of Low Power and High Speed 4X4 WTM
5 pages
Fpga Implementation of Fir Filter in Signal Processing: Abstract
No ratings yet
Fpga Implementation of Fir Filter in Signal Processing: Abstract
4 pages
JETIRBV06063
No ratings yet
JETIRBV06063
6 pages
Low Power VLSI Design of Modified Booth Multiplier
No ratings yet
Low Power VLSI Design of Modified Booth Multiplier
6 pages
Multirate Filters and Wavelets: From Theory To Implementation
No ratings yet
Multirate Filters and Wavelets: From Theory To Implementation
22 pages
Dept. of Ece, Sreebuddha College of Engineering 1
No ratings yet
Dept. of Ece, Sreebuddha College of Engineering 1
34 pages
Design of Multiplier Less 32 Tap FIR Filter Using VHDL: Journal
No ratings yet
Design of Multiplier Less 32 Tap FIR Filter Using VHDL: Journal
5 pages
XC Sysgen43
No ratings yet
XC Sysgen43
3 pages
Design and Implementation of Modified Booth Recoder Using Fused Add Multiply Operator
No ratings yet
Design and Implementation of Modified Booth Recoder Using Fused Add Multiply Operator
5 pages
(Apurba Das (Auth.) ) Digital Communication Princi
No ratings yet
(Apurba Das (Auth.) ) Digital Communication Princi
25 pages
39 Efficient
No ratings yet
39 Efficient
7 pages
An Enhancement of Decimation Process Using Fast Cascaded Integrator Comb (CIC) Filter
100% (1)
An Enhancement of Decimation Process Using Fast Cascaded Integrator Comb (CIC) Filter
5 pages
A Low-Power, High-Speed DCT Architecture For Image Compression: Principle and Implementation
No ratings yet
A Low-Power, High-Speed DCT Architecture For Image Compression: Principle and Implementation
6 pages
13 - Chapter 4 - 2
No ratings yet
13 - Chapter 4 - 2
24 pages
(IJCST-V9I2P10) :DR - Shine N Das
No ratings yet
(IJCST-V9I2P10) :DR - Shine N Das
6 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
5 pages
Continuously Variable Fractional Rate Decimator: Application Note: Virtex-5, Virtex-4, Spartan-3
No ratings yet
Continuously Variable Fractional Rate Decimator: Application Note: Virtex-5, Virtex-4, Spartan-3
11 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
11 pages
DEsign and Implementation of PID On FPGA
No ratings yet
DEsign and Implementation of PID On FPGA
6 pages
Convolution FPGA
No ratings yet
Convolution FPGA
6 pages
VLSI Design of Half-Band IIR Interpolation and Decimation Filter
No ratings yet
VLSI Design of Half-Band IIR Interpolation and Decimation Filter
7 pages
Low Power Mac For Digital Fir
No ratings yet
Low Power Mac For Digital Fir
4 pages
An Overview of The Decimation Process and Its VLSI Implementation
No ratings yet
An Overview of The Decimation Process and Its VLSI Implementation
6 pages
VLSI Implementation of Modified Booth Algorithm: Rasika Nigam, Jagdish Nagar
No ratings yet
VLSI Implementation of Modified Booth Algorithm: Rasika Nigam, Jagdish Nagar
4 pages
1ea7 PDF
No ratings yet
1ea7 PDF
6 pages
Project Report About Multipliers
80% (5)
Project Report About Multipliers
62 pages
Maskell, D. L. (2007, CDS) Design of Efficient Multiplierless FIR Filters
No ratings yet
Maskell, D. L. (2007, CDS) Design of Efficient Multiplierless FIR Filters
6 pages
Da For Fir Filters
No ratings yet
Da For Fir Filters
17 pages
High Speed Reconfigurable FFT Design by Vedic Mathematics: Ashish Raman, Anvesh Kumar and R.K.Sarin
No ratings yet
High Speed Reconfigurable FFT Design by Vedic Mathematics: Ashish Raman, Anvesh Kumar and R.K.Sarin
5 pages
High-Performance 8-Bit Modulator Used For Sigma-Delta Analog To Digital Converter
No ratings yet
High-Performance 8-Bit Modulator Used For Sigma-Delta Analog To Digital Converter
7 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
Vlsi Implementation of Area Efficient 2-Parallel Fir Digital Filter
No ratings yet
Vlsi Implementation of Area Efficient 2-Parallel Fir Digital Filter
8 pages
Integer Multiplication and Accumulation
No ratings yet
Integer Multiplication and Accumulation
5 pages
Mapping The SISO Module of The Turbo Decoder To A FPFA
No ratings yet
Mapping The SISO Module of The Turbo Decoder To A FPFA
8 pages
Some Case Studies on Signal, Audio and Image Processing Using Matlab
From Everand
Some Case Studies on Signal, Audio and Image Processing Using Matlab
Dr. Hedaya Mahmood Alasooly
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Signal, Audio and Image Processing
From Everand
Signal, Audio and Image Processing
Dr. Hidaia Mahmood Alassouli
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
From Everand
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
Analog Dialogue
No ratings yet
Guest Editorial Implementation Issues in System On Chip
No ratings yet
Guest Editorial Implementation Issues in System On Chip
2 pages
HARP2 An X Scale Reconfigurable Accelerator Rich Platform For Massively Parallel Signal Processing Algorithms
No ratings yet
HARP2 An X Scale Reconfigurable Accelerator Rich Platform For Massively Parallel Signal Processing Algorithms
13 pages
Digital Filtering in Hardware: Adnan Aziz
No ratings yet
Digital Filtering in Hardware: Adnan Aziz
102 pages
Ramesh Babu PDF
No ratings yet
Ramesh Babu PDF
303 pages
A Fully Isolated Delta-Sigma ADC For Shunt Based Current Sensing
No ratings yet
A Fully Isolated Delta-Sigma ADC For Shunt Based Current Sensing
9 pages
A B C Hello D e F Three GH I Please
No ratings yet
A B C Hello D e F Three GH I Please
1 page
Assistant Engineer: Further Details Regarding Main Topics of PROGRAMME NO. 01/2016 (Item No. 8)
No ratings yet
Assistant Engineer: Further Details Regarding Main Topics of PROGRAMME NO. 01/2016 (Item No. 8)
4 pages
Chap9 PDF
No ratings yet
Chap9 PDF
50 pages
Assistant Eng,,Kseb
No ratings yet
Assistant Eng,,Kseb
12 pages
Main 1
No ratings yet
Main 1
25 pages
Engineering Research Brochure
No ratings yet
Engineering Research Brochure
32 pages
BPC PaperFormat
No ratings yet
BPC PaperFormat
1 page
05 Interfacing and Communication
No ratings yet
05 Interfacing and Communication
57 pages
Resume Bernacki Dakota 2021a
No ratings yet
Resume Bernacki Dakota 2021a
1 page
LSMW - Valuation and NCM Change
No ratings yet
LSMW - Valuation and NCM Change
4 pages
ASM 2 1618: Programming: Computer Programming FPT University
No ratings yet
ASM 2 1618: Programming: Computer Programming FPT University
30 pages
VIDA
No ratings yet
VIDA
13 pages
Master Thesis Computer Architecture
100% (2)
Master Thesis Computer Architecture
8 pages
Vba 32 Version Info
No ratings yet
Vba 32 Version Info
4 pages
Photoshop Lightroom Shortcuts PDF - 2
No ratings yet
Photoshop Lightroom Shortcuts PDF - 2
2 pages
Capgemini Pattern
100% (1)
Capgemini Pattern
19 pages
java qp2
No ratings yet
java qp2
3 pages
Install JavaFX
No ratings yet
Install JavaFX
4 pages
The Z80 Microprocessor
100% (1)
The Z80 Microprocessor
49 pages
Plano - Data Center Sheets - 006
No ratings yet
Plano - Data Center Sheets - 006
2 pages
Comparison Between Page Directory Entry and Page Table Entry
No ratings yet
Comparison Between Page Directory Entry and Page Table Entry
8 pages
2019 Summer Model Answer Paper (Msbte Study Resources)
No ratings yet
2019 Summer Model Answer Paper (Msbte Study Resources)
33 pages
OOP II (Constructor)
No ratings yet
OOP II (Constructor)
11 pages
Java Data Structure
100% (2)
Java Data Structure
5 pages
Introduction To Mechatronics: Microcontrollers and Microprocessors
No ratings yet
Introduction To Mechatronics: Microcontrollers and Microprocessors
12 pages
C#
No ratings yet
C#
9 pages
CIPer Model 50 Product Data Sheet 31-00197
No ratings yet
CIPer Model 50 Product Data Sheet 31-00197
10 pages
Msi Bce Terminal Sp20
No ratings yet
Msi Bce Terminal Sp20
2 pages
ARM Cortex-M: External Interrupt & Keil C Debug
No ratings yet
ARM Cortex-M: External Interrupt & Keil C Debug
23 pages
Top-Down Network Design: Chapter Four
No ratings yet
Top-Down Network Design: Chapter Four
18 pages
Driver Alert System
No ratings yet
Driver Alert System
6 pages
Solution For AD On-Premise AD To Azure AD
No ratings yet
Solution For AD On-Premise AD To Azure AD
4 pages
SM3257ENBA Test Program and ISP Release Note
No ratings yet
SM3257ENBA Test Program and ISP Release Note
8 pages
UsbFix Report
No ratings yet
UsbFix Report
3 pages

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

Uploaded by

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

Uploaded by

FPGA Implementation of High Speed FIR Filters Using Add and Shift Method

Figure 1. A MAC FIR filter block diagram

1-4244-9707-X/06/$20.00 2006 IEEE

Figure 3. A 2 bit parallel DA FIR filter block diagram

x0[i] x1[i] x2[i] x3[i] x4[i] x5[i] x6[i] x7[i]

Figure 4. Replacing constant multiplication by multiplier block

Address 0000 0001 0010 1111

Data 0 C0 C0+C1 C0+C1+C2+C3

Figure 2. A serial DA FIR filter block diagram

Logic Block 2 carry

Logic Block 2 carry

Figure 5. Registered adder at no additional cost

Figure 6. Unoptimized expression trees

Figure 7. Extracting common expression (A + B + C)

Figure 8. Extracting common subexpression (A+B)

Figure 9. Calculating registers required for fastest evaluation

IV. OPTIMIZATION ALGORITHM

Figure 10. Optimization algorithm to reduce area

Figure 11. Reduction in resources

Figure 12. Power consumption

[12] [13] [14] [15]

[8] [9] [10]

You might also like