VLSI Synthesis of MAC Structures Using Distributed Arithmetic - IITCEE 27-28-01 - 2023
VLSI Synthesis of MAC Structures Using Distributed Arithmetic - IITCEE 27-28-01 - 2023
Structures
Using Distributed Arithmetic
M.Bharathi Dr. Yasha Jyothi M Shirur,
Research Scholar,VTU & Assistant Professor, Professor,
Department of ECE, Department of ECE,
School of Engineering and Technology, BNM Institute of Technology,
Mohan Babu University,Tirupati,Chittoor District, Bangalore,
AndhraPradesh,India, VTU,
Email Id: [email protected], [email protected],[email protected]
[email protected]
Abstract— New consumer devices heavily rely on requires that the output signal be generated as the
accessible digital signal processors. Every DSP Core input signal is being recorded. To avoid missing
now includes a Multiply and Accumulate unit (MAC),
which serves as a key building element and that offers a or losing any unprocessed data, complete signals
guide to evaluate for use in product or application. should be processed as soon as the fresh samples
Various MAC cores that depend on signal control in the are received. While difficult to achieve in GPPs,
data path are presented in this study. Despite using this can be accomplished with DSP Core, which
Harvard architecture, the majority of DSP processors
can extract samples and coefficients quickly enough for
allows extremely high processing speeds. The
real-time applications. By utilizing shifters and adders, ability to access numerous memory in a single
distributed arithmetic(DA) is a technique that may be clock cycle is a key distinction between the
used to accomplish the inner dot product between two architecture of general-purpose processors
signals: fixed and variable signals. This paper shows (GPPs) and DSP processors. Program and data
how the parameter of inputs in the datapath affects
different MAC cores. By using DA approach for memories are two independent memories that are
1BAAT single LUT, 2BAAT single LUT, 1BAAT two present in any DSP[3] processor. These are built
LUT, and 2BAAT two LUT, the proposed design will be based on Harvard Architecture which has two
executed for a 16-bit MAC structure. The outcomes of separate busses. One for address and another for
the suggested MAC structures will then be composed
with a traditional DA structure of equal length. By
data. At the same time, it can access both the
comparing the 2BAAT two LUTs are better compared buses hence its execution time is less compared
to a single 1BAAT LUT, the performance evaluation with the Von-Neuman processor. As a result, the
enables a reduction in dynamic power of 39.93%. Xlinix processor is able to fetch an instruction, fetch
Vivado 2019.1 can be used to assess the Simulations and operands, and execute the results of a previous
Synthesis of these designs.
.
instruction all at once. In case of multiport
Keywords— Distributed Arithmetic (DA), 1BAAT (One Bit memories/ multiple independent data memories,
at a Time), 2BAAT (Two Bit at a Time), LUT (Look-Up there is a possibility, of fetching multiple data
Table , General Purpose Processor (GPP) and program memories in one clock cycle. With
• INTRODUCTION the advent of IC and with DSP algorithms the
Off-line processing typically processes the input performance metrics can be increased and
signal, saves the data in memory, and processes improved.
the signals later. In contrast, real-time processing
978-1-6654-9260-7/23/$31.00 2023
c IEEE 148
Hardware dedicated to multiply-accumulate or multiplier-less approach that uses a LUT block
MAC operations is the most important instead of a multiplier, however, it
component of a DSP processor. When MAC extraordinarily deals with the sum of targets. The
operations are used to compute the sum of step-by-step method of the DA algorithm is as
products, the two operands are multiplied and the follows:
results are added (or removed) to form the
cumulative sum. Most digital signal processing
today relies on MAC units. The MAC unit
completes addition and multiplication[7] tasks.
It works in two stages. First, the multiplier comp
utes the yield of the given number and the result i
s sent to the second stage, the addition/accumulat
ion operation.Multiplier speed is important in the
MAC unit to determine the critical path, and ran
ge also has a large impact on MAC planning. M
AC operations are widely used in DSP applicatio
ns and are used for realtime digital signals such
as vector products, digitalfilters, correlations, and
Fourier transforms.
• EXISTING MAC
Dedicated hardware for multiply-accumulate or
MAC operations is the only feature that sets DSP
processors apart. The main goal of MAC is speed. This
reduces latency and consumes less power.Major block here
is multiplier. Implementation of actual MAC[5] with
different sorts of multipliers such as Dadda, Braun, Array,
Figure 1: Algorithm of Distributed Arithmetic
Vedic, Wallace and will be comparing the different
parameters like delay, speed, area etc., For n inputs, the size Figure1, explains the step-by-step procedure of
of the MAC should be 2n+m. where 2n is the result of the DA algorithm. In Step1, it takes the values of
2n-bit multiplication and m is the guard bit.This value can inputs (a0, a1, a2, a3, a4, a5) and address bits
be saturated to a 2n-bit value and then rounded to get a
value of the native data width n, that can be stored in (b0, b1, b2, b3, b4, b5). In step2, precomputed
memory or used in other kinds of operations. values are generated in the form of Sum of
Products by doing product of input and address
The significance of this work is as stated as bits. Step 3, provides addition and followed by
follows, including the key points: accumulation in step 4.
In DSPs, Performing the same function can The exponential increase in LUT size is the
reduce the execution time. drawback in DA, that can be overcome using the
1) Enhancing the multiplier and adder is one way Offset Binary Coding technique not mentioned in
to increase the MAC's efficiency. This has been this paper.
done by several researchers.
2) the Second method is to use the Lookup table Dedicated Distributed Arithmetic
method. Architecture:
International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 149
Mathematical Calculations of 1BAAT Based
on DA:
Consider
We can express as
150 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)
Figure 3 Design of 1BAAT with Two LUT
based DA
Figure 2 The baseline implementation of the
LUT requires rows of LUT sections for
computation of N-term inner products. for
example, when N = 4 terms, the
fundamental implementation of the
LUT concerns 16 rows of LUT block. this is
often how a DA-based MAC operation's dot
product computation is fully implemented. This
is how a DA-based MAC operation's inner
product computation is fully implemented. Figure 5 Design of 2BAAT with Two LUT
Figure 3,conveys that even though the LUT based DA
size are often cut in half, each cycle must now Still, Speed can be increased by passing the data
include an extra addition. The results in parallel [1].This approach mainly discusses
of computing an inner-product because the sum 1BAAT & ,2BAAT that can be incorporated in
of two half-length inner products may be cut in LUT-Less based DA MAC core which can be
half. DA-based implementation use in single used in High-speed applications such as Video,
LUT areas is also significantly over in double Image, Graphics &Medical Image Processing.
LUT areas. The DA-based implementation of 2- Let’s see the formulation of 1BAAT& 2BAAT
bank splitting necessitates the employment of of DA based Structures.
two LUTs. is 3.
• RESULTS &DISCUSSION
Mathematical Calculations of 2 BAAT Based
on DA The below are the graphical view of area
utilization of 1BAAT & 2BAAT based MAC
structures with single and two LUT based DA
For example:
Address bit is 31(1111)
1 1 1 1 3(a1+a2)
Substitute address bits in equation (1)
International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 151
1. B. K. Mohanty and P. K. Meher, "An Efficient Parallel
DA-Based Fixed-Width Design for Approximate Inner-Product
Computation," in IEEE Transactions on VeryLargeScale
Integration (VLSI) Systems, vol. 28, no. 5, pp. 1221-1229, May
2020, doi: 10.1109/TVLSI.2020.2972772.
2. D. Ray, N. V. George and P. K. Meher, "An Analytical
Framework and Approximation Strategy for Efficient
Implementation of Distributed Arithmetic-Based Inner-Product
Architectures," in IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 67, no. 1, pp. 212-224, Jan. 2020, doi:
10.1109/TCSI.2019.2948791.
3. D. Lingaiah, "VLSI Synthesis of DSP Kernels:
Algorithmic and Architectural Transformations," in IEEE Circuits
Figure 7: Power Utilization of 1BAAT& and Devices Magazine, vol. 19, no. 6, pp. 33-35, Nov. 2003, doi:
2BAAT based MAC cores. 10.1109/MCD.2003.1263463.
4. Mahesh Mehendale and Sunil D. Sherlekar. 2001. VLSI
Synthesis of DSP Kernels: Algorithmic and Architectural
Distributed Arithmetic is essentially a LUT- Transformations. Kluwer Academic Publishers, USA.
based structure. As opposed to the traditional 5. Pisupati, Bharadwaja, Naresh, M, Koppala, Neelima
& Krishna, J.. (2019). Design of step-up inexact MAC
single LUT based DA structure, LUT can be (IMAC) unit for DSP applications. International Journal of Recent
split to optimise the performance parameters Technology and Engineering. 7. 360-364.
with varied weights. Additionally, the 6. N. S. and J. E. P., "An Efficient Modified Distributed
Arithmetic Architecture Suitable for FIR Filter," 2021 Sixth
performance is improved by driving the input International Conference on Wireless Communications, Signal
patterns simultaneously after a single LUT has Processing and Networking (WiSPNET), 2021, pp. 89-93, doi:
been divided into several LUTs with various 10.1109/WiSPNET51692.2021.9419365.
7. Dharani, M., Kumar, P. A., Venkatakrishnamoorthy, T.,
weights dependent on the assignment of the Bharghavi, N., & Kumar, B. A. (2020). High level montgomery
input signal. modular multiplier for CSA architecture. Materials Today:
Proceedings.
8.
CONCLUSION 9.
10.
The most efficient way to determine the product
terms of a given sequence is to use the MAC
core, which is a fundamental building component
in a DSP processor[4]. The use of multipliers
considerably extends the time required for
conventional computing. A slower speed and a
longer output delay are the results. The
Distributed Arithmetic approach places
restrictions on the performance of DSP
processors. So, we'll use a DA-based approach
with a multiplier-free implementation. In this
instance, pre-calculated lookup tables are
employed for multiplication. Bit-serial format is
used to store the data.Figure 6& 7 depicts the bar
graph of proposed logic architecture, which
claims that 2BAAT DA with two LUT has a
61.48% area savings over 1BBAT DA with a
single LUT.Similarly, compared to a traditional
one, dynamic and static power are reduced by
4.662% and 0.39%, respectively. These static
and dynamic power can be used to compute the
proposed MAC power relation to the current
MAC core.
REFERENCES
152 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)