0% found this document useful (0 votes)

47 views

VLSI Synthesis of MAC Structures Using Distributed Arithmetic - IITCEE 27-28-01 - 2023

This document discusses the design of multiply and accumulate (MAC) structures using distributed arithmetic. MAC units are important components in digital signal processors that perform multiplication and addition. Distributed arithmetic is a technique that can implement MAC operations using lookup tables (LUTs) instead of multipliers, improving speed. The document proposes designs for 1-bit-at-a-time (1BAAT) and 2-bits-at-a-time (2BAAT) MAC structures using single and double LUTs. It analyzes the performance of these designs and finds that 2BAAT with two LUTs has better performance than 1BAAT with a single LUT, reducing dynamic power by 39.93%.

Uploaded by

yashamallik_83184507

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

VLSI Synthesis of MAC Structures Using Distributed Arithmetic - IITCEE 27-28-01 - 2023

Uploaded by

yashamallik_83184507

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

VLSI Synthesis of Multiply and Accumulate

Structures
Using Distributed Arithmetic
M.Bharathi Dr. Yasha Jyothi M Shirur,
Research Scholar,VTU & Assistant Professor, Professor,
Department of ECE, Department of ECE,
School of Engineering and Technology, BNM Institute of Technology,
Mohan Babu University,Tirupati,Chittoor District, Bangalore,
AndhraPradesh,India, VTU,
Email Id: [email protected], [email protected],[email protected]
[email protected]

Abstract— New consumer devices heavily rely on requires that the output signal be generated as the
accessible digital signal processors. Every DSP Core input signal is being recorded. To avoid missing
now includes a Multiply and Accumulate unit (MAC),
which serves as a key building element and that offers a or losing any unprocessed data, complete signals
guide to evaluate for use in product or application. should be processed as soon as the fresh samples
Various MAC cores that depend on signal control in the are received. While difficult to achieve in GPPs,
data path are presented in this study. Despite using this can be accomplished with DSP Core, which
Harvard architecture, the majority of DSP processors
can extract samples and coefficients quickly enough for
allows extremely high processing speeds. The
real-time applications. By utilizing shifters and adders, ability to access numerous memory in a single
distributed arithmetic(DA) is a technique that may be clock cycle is a key distinction between the
used to accomplish the inner dot product between two architecture of general-purpose processors
signals: fixed and variable signals. This paper shows (GPPs) and DSP processors. Program and data
how the parameter of inputs in the datapath affects
different MAC cores. By using DA approach for memories are two independent memories that are
1BAAT single LUT, 2BAAT single LUT, 1BAAT two present in any DSP[3] processor. These are built
LUT, and 2BAAT two LUT, the proposed design will be based on Harvard Architecture which has two
executed for a 16-bit MAC structure. The outcomes of separate busses. One for address and another for
the suggested MAC structures will then be composed
with a traditional DA structure of equal length. By
data. At the same time, it can access both the
comparing the 2BAAT two LUTs are better compared buses hence its execution time is less compared
to a single 1BAAT LUT, the performance evaluation with the Von-Neuman processor. As a result, the
enables a reduction in dynamic power of 39.93%. Xlinix processor is able to fetch an instruction, fetch
Vivado 2019.1 can be used to assess the Simulations and operands, and execute the results of a previous
Synthesis of these designs.
.
instruction all at once. In case of multiport
Keywords— Distributed Arithmetic (DA), 1BAAT (One Bit memories/ multiple independent data memories,
at a Time), 2BAAT (Two Bit at a Time), LUT (Look-Up there is a possibility, of fetching multiple data
Table , General Purpose Processor (GPP) and program memories in one clock cycle. With
• INTRODUCTION the advent of IC and with DSP algorithms the
Off-line processing typically processes the input performance metrics can be increased and
signal, saves the data in memory, and processes improved.
the signals later. In contrast, real-time processing

978-1-6654-9260-7/23/$31.00 2023
c IEEE 148
Hardware dedicated to multiply-accumulate or multiplier-less approach that uses a LUT block
MAC operations is the most important instead of a multiplier, however, it
component of a DSP processor. When MAC extraordinarily deals with the sum of targets. The
operations are used to compute the sum of step-by-step method of the DA algorithm is as
products, the two operands are multiplied and the follows:
results are added (or removed) to form the
cumulative sum. Most digital signal processing
today relies on MAC units. The MAC unit
completes addition and multiplication[7] tasks.
It works in two stages. First, the multiplier comp
utes the yield of the given number and the result i
s sent to the second stage, the addition/accumulat
ion operation.Multiplier speed is important in the
MAC unit to determine the critical path, and ran
ge also has a large impact on MAC planning. M
AC operations are widely used in DSP applicatio
ns and are used for realtime digital signals such
as vector products, digitalfilters, correlations, and
Fourier transforms.

• EXISTING MAC
Dedicated hardware for multiply-accumulate or
MAC operations is the only feature that sets DSP
processors apart. The main goal of MAC is speed. This
reduces latency and consumes less power.Major block here
is multiplier. Implementation of actual MAC[5] with
different sorts of multipliers such as Dadda, Braun, Array,
Figure 1: Algorithm of Distributed Arithmetic
Vedic, Wallace and will be comparing the different
parameters like delay, speed, area etc., For n inputs, the size Figure1, explains the step-by-step procedure of
of the MAC should be 2n+m. where 2n is the result of the DA algorithm. In Step1, it takes the values of
2n-bit multiplication and m is the guard bit.This value can inputs (a0, a1, a2, a3, a4, a5) and address bits
be saturated to a 2n-bit value and then rounded to get a
value of the native data width n, that can be stored in (b0, b1, b2, b3, b4, b5). In step2, precomputed
memory or used in other kinds of operations. values are generated in the form of Sum of
Products by doing product of input and address
The significance of this work is as stated as bits. Step 3, provides addition and followed by
follows, including the key points: accumulation in step 4.
In DSPs, Performing the same function can The exponential increase in LUT size is the
reduce the execution time. drawback in DA, that can be overcome using the
1) Enhancing the multiplier and adder is one way Offset Binary Coding technique not mentioned in
to increase the MAC's efficiency. This has been this paper.
done by several researchers.
2) the Second method is to use the Lookup table Dedicated Distributed Arithmetic
method. Architecture:

Algorithm of Distributed Arithmetic Distributed Arithmetic architecture has three

In Distributed Arithmetic[2], the precomputed parts includes: Input data section, LUT section
values are stored in LUT and it requires a &Accumulator Section as shown in below
significant amount of storage but the speed can figure2..
be improved. Real-time signal processing can be
possible with this Distributed arithmetic[6] can
be used in applications for the real world. It is a

International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 149
Mathematical Calculations of 1BAAT Based
on DA:

Consider

Let be an N-bit scaled two’s complement

number. In other words,
|| < 1
: {,,………,
Where is the sign bit

We can express as

Figure 2: Block Diagram of Distributed

Arithmetic based MAC
With the sign bit as an input,we can store it in
The Data section has two inputs. one is input a ROM of
values (Constant) other is Address bit ( is a size=2*
known value). The LUT Section can be
constructed based on the number of address bits
K be 6, Means For k address bits we need 2K =
26 = 64-words size of ROM

• PROPOSED DESIGN OF 1BAAT &

2BAAT MAC STRUCTURES USING
DISTRIBUTED ARITHMETIC

One manner to put into

effect it without using multipliers is
distributed mathematics (DA). DA [1]is a
technique that conducts the inner dot product
between two signals—fixed and variable Figure 2 Design of 1BAAT with Single LUT
signals—using shifters and adders. DA stores based DA
values in a bit-serial format that is ideal for
real-time processing[4].
Distributed Arithmetic (DA) with a numerical
example

=a0b0n+a1b1n + a2b2n +a3b3n +a4b4n+ a5b5n

For example, the fixed coefficients are a0, a1, a2,

a3, a4, a5 are 31,30,29,28,27,26 & ADDR =
111111 = 63 = 31,30,29,28,27,26 then y =
(a0 + a1 + a2 + a3 + a4 + a5) = 171

150 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)
Figure 3 Design of 1BAAT with Two LUT
based DA
Figure 2 The baseline implementation of the
LUT requires rows of LUT sections for
computation of N-term inner products. for
example, when N = 4 terms, the
fundamental implementation of the
LUT concerns 16 rows of LUT block. this is
often how a DA-based MAC operation's dot
product computation is fully implemented. This
is how a DA-based MAC operation's inner
product computation is fully implemented. Figure 5 Design of 2BAAT with Two LUT
Figure 3,conveys that even though the LUT based DA
size are often cut in half, each cycle must now Still, Speed can be increased by passing the data
include an extra addition. The results in parallel [1].This approach mainly discusses
of computing an inner-product because the sum 1BAAT & ,2BAAT that can be incorporated in
of two half-length inner products may be cut in LUT-Less based DA MAC core which can be
half. DA-based implementation use in single used in High-speed applications such as Video,
LUT areas is also significantly over in double Image, Graphics &Medical Image Processing.
LUT areas. The DA-based implementation of 2- Let’s see the formulation of 1BAAT& 2BAAT
bank splitting necessitates the employment of of DA based Structures.
two LUTs. is 3.
• RESULTS &DISCUSSION
Mathematical Calculations of 2 BAAT Based
on DA The below are the graphical view of area
utilization of 1BAAT & 2BAAT based MAC
structures with single and two LUT based DA
For example:
Address bit is 31(1111)

1 1 1 1 3(a1+a2)
Substitute address bits in equation (1)

Figure 6: Area Utilization of 1BAAT&

2BAAT based MAC cores.

Figure 4 Design of 2BAAT with Single LUT

based DA

International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 151
1. B. K. Mohanty and P. K. Meher, "An Efficient Parallel
DA-Based Fixed-Width Design for Approximate Inner-Product
Computation," in IEEE Transactions on VeryLargeScale
Integration (VLSI) Systems, vol. 28, no. 5, pp. 1221-1229, May
2020, doi: 10.1109/TVLSI.2020.2972772.
2. D. Ray, N. V. George and P. K. Meher, "An Analytical
Framework and Approximation Strategy for Efficient
Implementation of Distributed Arithmetic-Based Inner-Product
Architectures," in IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 67, no. 1, pp. 212-224, Jan. 2020, doi:
10.1109/TCSI.2019.2948791.
3. D. Lingaiah, "VLSI Synthesis of DSP Kernels:
Algorithmic and Architectural Transformations," in IEEE Circuits
Figure 7: Power Utilization of 1BAAT& and Devices Magazine, vol. 19, no. 6, pp. 33-35, Nov. 2003, doi:
2BAAT based MAC cores. 10.1109/MCD.2003.1263463.
4. Mahesh Mehendale and Sunil D. Sherlekar. 2001. VLSI
Synthesis of DSP Kernels: Algorithmic and Architectural
Distributed Arithmetic is essentially a LUT- Transformations. Kluwer Academic Publishers, USA.
based structure. As opposed to the traditional 5. Pisupati, Bharadwaja, Naresh, M, Koppala, Neelima
& Krishna, J.. (2019). Design of step-up inexact MAC
single LUT based DA structure, LUT can be (IMAC) unit for DSP applications. International Journal of Recent
split to optimise the performance parameters Technology and Engineering. 7. 360-364.
with varied weights. Additionally, the 6. N. S. and J. E. P., "An Efficient Modified Distributed
Arithmetic Architecture Suitable for FIR Filter," 2021 Sixth
performance is improved by driving the input International Conference on Wireless Communications, Signal
patterns simultaneously after a single LUT has Processing and Networking (WiSPNET), 2021, pp. 89-93, doi:
been divided into several LUTs with various 10.1109/WiSPNET51692.2021.9419365.
7. Dharani, M., Kumar, P. A., Venkatakrishnamoorthy, T.,
weights dependent on the assignment of the Bharghavi, N., & Kumar, B. A. (2020). High level montgomery
input signal. modular multiplier for CSA architecture. Materials Today:
Proceedings.
8.
CONCLUSION 9.
10.
The most efficient way to determine the product
terms of a given sequence is to use the MAC
core, which is a fundamental building component
in a DSP processor[4]. The use of multipliers
considerably extends the time required for
conventional computing. A slower speed and a
longer output delay are the results. The
Distributed Arithmetic approach places
restrictions on the performance of DSP
processors. So, we'll use a DA-based approach
with a multiplier-free implementation. In this
instance, pre-calculated lookup tables are
employed for multiplication. Bit-serial format is
used to store the data.Figure 6& 7 depicts the bar
graph of proposed logic architecture, which
claims that 2BAAT DA with two LUT has a
61.48% area savings over 1BBAT DA with a
single LUT.Similarly, compared to a traditional
one, dynamic and static power are reduced by
4.662% and 0.39%, respectively. These static
and dynamic power can be used to compute the
proposed MAC power relation to the current
MAC core.
REFERENCES

152 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Post and Pre Coordination
75% (4)
Post and Pre Coordination
4 pages
Design and Implementation of Arithmetic Based FIR Filter 27-28 2023
No ratings yet
Design and Implementation of Arithmetic Based FIR Filter 27-28 2023
6 pages
Distributed Arithmetic (Da)
No ratings yet
Distributed Arithmetic (Da)
13 pages
Distributed Arithmetic - Part 1 K. Sridharan
No ratings yet
Distributed Arithmetic - Part 1 K. Sridharan
16 pages
4-Distributed Arithmetic SD
No ratings yet
4-Distributed Arithmetic SD
13 pages
Distributed Arithmetic Architectures For FIR Filters-A Comparative Review
No ratings yet
Distributed Arithmetic Architectures For FIR Filters-A Comparative Review
7 pages
Rc Presentation
No ratings yet
Rc Presentation
10 pages
Introduction To Distributed Arithmetic K. Sridharan, IIT Madras
No ratings yet
Introduction To Distributed Arithmetic K. Sridharan, IIT Madras
24 pages
da ramsow
No ratings yet
da ramsow
7 pages
DSP Arch
No ratings yet
DSP Arch
10 pages
Ece-Vii-dsp Algorithms & Architecture U2
No ratings yet
Ece-Vii-dsp Algorithms & Architecture U2
21 pages
Priyanka - 50300 16 130
No ratings yet
Priyanka - 50300 16 130
4 pages
Unit 1dspa
No ratings yet
Unit 1dspa
95 pages
Digital Signal Processing Unit V: DSP Processor
No ratings yet
Digital Signal Processing Unit V: DSP Processor
20 pages
Imp 22
No ratings yet
Imp 22
31 pages
Distributed Arithmetic: Implementations and Applications: A Tutorial
No ratings yet
Distributed Arithmetic: Implementations and Applications: A Tutorial
30 pages
3.1 Distributed Arithmetic Technique
No ratings yet
3.1 Distributed Arithmetic Technique
8 pages
Implementation of MAC Unit Using Booth Multiplier & Ripple Carry Adder
No ratings yet
Implementation of MAC Unit Using Booth Multiplier & Ripple Carry Adder
3 pages
DSP R20 Unit V
No ratings yet
DSP R20 Unit V
23 pages
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
No ratings yet
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
8 pages
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
No ratings yet
Implementation of Low Power and High Speed Multiplier-Accumulator Using SPST Adder and Verilog
8 pages
Mac
No ratings yet
Mac
20 pages
DSP Notes Unit1 and 2
No ratings yet
DSP Notes Unit1 and 2
45 pages
Dspa 17ec751 M2
No ratings yet
Dspa 17ec751 M2
27 pages
Useful PDF
No ratings yet
Useful PDF
16 pages
Fgmos Based Low-Voltage Low-Power High Output Impedance Regulated Cascode Current Mirror
No ratings yet
Fgmos Based Low-Voltage Low-Power High Output Impedance Regulated Cascode Current Mirror
18 pages
1 PB
No ratings yet
1 PB
5 pages
Daniel J. Allred, Heejong Yoo, Venkatesh Krishnan, Walter Huang, and David V. Anderson
No ratings yet
Daniel J. Allred, Heejong Yoo, Venkatesh Krishnan, Walter Huang, and David V. Anderson
4 pages
15ec751-module-2-notes
No ratings yet
15ec751-module-2-notes
37 pages
Unit-5 DSP Processor
No ratings yet
Unit-5 DSP Processor
28 pages
Architectures For Programmable Digital Signal Processing Devices
No ratings yet
Architectures For Programmable Digital Signal Processing Devices
24 pages
Module 2 Notes
No ratings yet
Module 2 Notes
28 pages
Thesis Phase 1 Report
No ratings yet
Thesis Phase 1 Report
7 pages
Vlsi Architecture of Parallel Multiplier - Accumulator Based
No ratings yet
Vlsi Architecture of Parallel Multiplier - Accumulator Based
8 pages
LUT Optimization Using Combined APC-OMS Technique For Memory-Based Computation
No ratings yet
LUT Optimization Using Combined APC-OMS Technique For Memory-Based Computation
9 pages
A Novel High Performance Implemance and Design of 64 Bit MAC Unit& Their Delay Comparision
No ratings yet
A Novel High Performance Implemance and Design of 64 Bit MAC Unit& Their Delay Comparision
17 pages
Module 2-1
No ratings yet
Module 2-1
93 pages
Final - PPT LUT Mul
No ratings yet
Final - PPT LUT Mul
31 pages
Chap 15
No ratings yet
Chap 15
60 pages
Unit 5
No ratings yet
Unit 5
24 pages
A New VLSI Architecture of Parallel Multiplier-Accumulator Based On Radix-2 Modified Booth Algorithm
No ratings yet
A New VLSI Architecture of Parallel Multiplier-Accumulator Based On Radix-2 Modified Booth Algorithm
8 pages
Architecture
No ratings yet
Architecture
112 pages
Lilly 2020
No ratings yet
Lilly 2020
4 pages
Dafir
No ratings yet
Dafir
4 pages
Da PDF
No ratings yet
Da PDF
8 pages
SP Unit 3 SB
No ratings yet
SP Unit 3 SB
72 pages
An Optimized Modified Parallel Implementation Design of Multiplier and Accumulator Operator
No ratings yet
An Optimized Modified Parallel Implementation Design of Multiplier and Accumulator Operator
39 pages
Unit 2 Architectures For Programmable Digital Signal-Processors
No ratings yet
Unit 2 Architectures For Programmable Digital Signal-Processors
57 pages
chap15
No ratings yet
chap15
61 pages
DSP Unit-6
No ratings yet
DSP Unit-6
26 pages
Parallel MAC
No ratings yet
Parallel MAC
6 pages
8159-Article Text-14636-1-10-20210604
No ratings yet
8159-Article Text-14636-1-10-20210604
8 pages
Dit 705 - DSP - 5
No ratings yet
Dit 705 - DSP - 5
14 pages
DSP C16 - UNIT-6 (Ref-2)
No ratings yet
DSP C16 - UNIT-6 (Ref-2)
26 pages
Computational Building Blocks of DSP
80% (5)
Computational Building Blocks of DSP
28 pages
Integer Multiplication and Accumulation
No ratings yet
Integer Multiplication and Accumulation
5 pages
Digital Signal Processing Definition of A Digital Signal Processor
No ratings yet
Digital Signal Processing Definition of A Digital Signal Processor
18 pages
Implementation of Modulo 2n-1 Multiplier Using Radix-8 Modified Booth Algorithm
50% (2)
Implementation of Modulo 2n-1 Multiplier Using Radix-8 Modified Booth Algorithm
78 pages
MACIo T
No ratings yet
MACIo T
5 pages
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
DNRGPS Documentation: Release 6.0.0.4
No ratings yet
DNRGPS Documentation: Release 6.0.0.4
57 pages
MCHF
No ratings yet
MCHF
7 pages
Mentor Matches - For Devs
No ratings yet
Mentor Matches - For Devs
56 pages
SRS Document For Tourism Portal
50% (2)
SRS Document For Tourism Portal
16 pages
Pre-Oral New Japay
No ratings yet
Pre-Oral New Japay
56 pages
Cot DLP Esp
No ratings yet
Cot DLP Esp
8 pages
AI Godfather Hinton
No ratings yet
AI Godfather Hinton
2 pages
Wa0008.
No ratings yet
Wa0008.
2 pages
Quiz 5 Etech
No ratings yet
Quiz 5 Etech
3 pages
Untitled
No ratings yet
Untitled
2 pages
Updated Presentation
No ratings yet
Updated Presentation
19 pages
10 Flowchart and Pseudocode - 12feb2022
No ratings yet
10 Flowchart and Pseudocode - 12feb2022
22 pages
AVR Playground: User Manual
No ratings yet
AVR Playground: User Manual
29 pages
Ans Questions Choice A Choice B Choice C Choice D
100% (1)
Ans Questions Choice A Choice B Choice C Choice D
12 pages
M. Fahmi Fachrozi: Career Summary Personal Profile
No ratings yet
M. Fahmi Fachrozi: Career Summary Personal Profile
1 page
Delta Tau Acc-72E
No ratings yet
Delta Tau Acc-72E
97 pages
Codeforces List of Resources (Inishan, Expert)
No ratings yet
Codeforces List of Resources (Inishan, Expert)
27 pages
M1 final exam 2022-2023
No ratings yet
M1 final exam 2022-2023
3 pages
Reading Sample Sap Press Reporting With Sap S4hana
No ratings yet
Reading Sample Sap Press Reporting With Sap S4hana
32 pages
Online Job Consultancy Website in ASP
No ratings yet
Online Job Consultancy Website in ASP
5 pages
Google Classroom Guide For Student
No ratings yet
Google Classroom Guide For Student
21 pages
Yaser Alamin CV English June23 W - Ref
No ratings yet
Yaser Alamin CV English June23 W - Ref
3 pages
Irfanview Tutorial
No ratings yet
Irfanview Tutorial
4 pages
ADS CH 2 - Flowcharts
No ratings yet
ADS CH 2 - Flowcharts
27 pages
Page
No ratings yet
Page
163 pages
POSIX Thread Programming
No ratings yet
POSIX Thread Programming
36 pages
Using The AWS DMS To Migrate Data To An Aurora Database - A Cloud Guru
No ratings yet
Using The AWS DMS To Migrate Data To An Aurora Database - A Cloud Guru
5 pages
Communication Cycle
No ratings yet
Communication Cycle
1 page
Replication Server Configuration Guide UNIX en
No ratings yet
Replication Server Configuration Guide UNIX en
124 pages

VLSI Synthesis of MAC Structures Using Distributed Arithmetic - IITCEE 27-28-01 - 2023

Uploaded by

VLSI Synthesis of MAC Structures Using Distributed Arithmetic - IITCEE 27-28-01 - 2023

Uploaded by

VLSI Synthesis of Multiply and Accumulate

Algorithm of Distributed Arithmetic Distributed Arithmetic architecture has three

Let be an N-bit scaled two’s complement

Figure 2: Block Diagram of Distributed

• PROPOSED DESIGN OF 1BAAT &

One manner to put into

=a0b0n+a1b1n + a2b2n +a3b3n +a4b4n+ a5b5n

For example, the fixed coefficients are a0, a1, a2,

Figure 6: Area Utilization of 1BAAT&

Figure 4 Design of 2BAAT with Single LUT

You might also like