0% found this document useful (0 votes)
15 views

Asynchronous Wrapper Based Low Power GALS Structural QDMA

The document discusses an asynchronous wrapper-based low-power globally asynchronous locally synchronous (GALS) structural queue direct memory access (QDMA). It proposes using asynchronous wrappers with locally synchronous modules that communicate through asynchronous finite state machines defined by signal transition graphs. This GALS architecture implemented for point-to-point interfaces can be modified for multipoint interfaces. The QDMA subsystem is implemented using a simulator and evaluated based on latency, power, and gate count, showing improvements over previous methods.

Uploaded by

VINAY B K ECE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Asynchronous Wrapper Based Low Power GALS Structural QDMA

The document discusses an asynchronous wrapper-based low-power globally asynchronous locally synchronous (GALS) structural queue direct memory access (QDMA). It proposes using asynchronous wrappers with locally synchronous modules that communicate through asynchronous finite state machines defined by signal transition graphs. This GALS architecture implemented for point-to-point interfaces can be modified for multipoint interfaces. The QDMA subsystem is implemented using a simulator and evaluated based on latency, power, and gate count, showing improvements over previous methods.

Uploaded by

VINAY B K ECE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IETE Journal of Research

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/tijr20

Asynchronous Wrapper-Based Low-Power GALS


Structural QDMA

B.K. Vinay, S. Pushpa Mala & S. Deekshitha

To cite this article: B.K. Vinay, S. Pushpa Mala & S. Deekshitha (2022): Asynchronous
Wrapper-Based Low-Power GALS Structural QDMA, IETE Journal of Research, DOI:
10.1080/03772063.2021.2021814

To link to this article: https://doi.org/10.1080/03772063.2021.2021814

Published online: 24 Jan 2022.

Submit your article to this journal

Article views: 104

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tijr20
IETE JOURNAL OF RESEARCH
https://doi.org/10.1080/03772063.2021.2021814

Asynchronous Wrapper-Based Low-Power GALS Structural QDMA

B.K. Vinay 1,2 , S. Pushpa Mala2 and S. Deekshitha3


1 Department of Electronics and Communication Engineering, Vidyavardhaka College of Engineering, Mysuru 570002, India; 2 Department of
Electronics and Communication, Dayananda Sagar University, Bangalore, Karnataka 560114, India; 3 CMR Institute of Technology, Bangalore,
Karnataka 560037, India

ABSTRACT KEYWORDS
The design of System-on-Chip systems using synchronous circuits involves complex clock distribu- Asynchronous wrapper;
tion strategies, which envisage challenges for designers to integrate large-scale systems. Globally Globally asynchronous
Asynchronous Locally Synchronous architectures containing asynchronous port controllers encap- locally synchronous (GALS);
sulated in the self-timed wrapper have been adopted in this work. These port controllers com- Handshake protocol; Muller C
element; Port controller;
municate through Asynchronous Finite State Machines defined by Signal Transition Graphs are Signal transition graphs
implemented adopting the C element. This GALS architecture implemented for the point-to-point (STG); Synthesis logic
interface can also be modified for the multipoint interface. The proposed methodology uses a two-
phase handshake protocol to communicate between two Locally Synchronous modules as it has
fewer signal transitions, which, in turn, reduces latency. In this paper, the Queue Direct Memory
Access subsystem is implemented using the Vivado simulator on UltraScale+TM device at a maxi-
mum frequency of 257.4MHz, and various parameters are reported. A comparison shows that the
proposed wrapper has improved latency time of 53%, with a reduction in power dissipated by 27%
and an increase in gate count by 13%.

1. INTRODUCTION
a method known as clock stretching. Clock stretching
A System-on-Chips (SoC) integrates individual IPs with eliminates metastability. The performance gain is less due
a specific functionality onto a single platform. Syn- to increased modular multiplications during the execu-
chronous circuits require optimization of clock distri- tion phase [4]. A novel architecture for the asynchronous
bution networks to attain reduced latency, which is a GALS wrapper has been proposed for port controllers
complex process [1]. Consequently, a methodology for to communicate with the asynchronous wrapper using
implementing asynchronous designs has to be developed. direct mapping style [5]. With an increase in the input
Globally Asynchronous Locally Synchronous (GALS) and output ports for data communication, the wrapper
architectures encompass asynchronous wrappers with befits complexity and the area increases. The data trans-
locally synchronous (LS) modules. These LS modules fer in SoC with GALS systems uses multi-point interfaces
communicate with each other using handshake proto- reducing latency and area [6]. Asynchronous wrapper
cols, which are technically asynchronous [2]. An asyn- (AW) (AW) realizes fault tolerance for autonomous mod-
chronous wrapper espouses three handshake processes. ules and delay insensitive (DI) designs, which does not
The first handshake is at the input port for receiving data, require isochronic fork conditions to be met [7].
the second handshake between the input port and the
output port for generating the clock and the third hand- Various design methodologies for GALS architecture
shake at the output port for transferring the data. This include plausible clocks, asynchronous and locally syn-
wrapper has an average power consumption of 1mW for chronous modules. Plausible clocks avoid metastability
a stream of data with 50 MHz [3]. by delaying the sampling of the clock until the arrival of
data. In asynchronous interface design styles, the signal
Stretchable Clock Asynchronous Flexible FPGA Inter- received from the outer clock domain is transferred to the
faces (SCAFFI) interconnect the LS modules to Field Pro- local clock domain by synchronizers [8]. LS design styles
grammable Gate Array (FPGA) for GALS architectures. analyze time bounds, overcoming the need for hand-
These architectures use arbiters to pause the LS mod- shaking for data transfer [9]. Signal Transition Graphs
ules’ clock before the data are transferred. At a later stage, (STG) represent the flow of positive and negative edges of
the clock is restarted once the data achieve a stable state, the signals. In the proposed wrapper, a modified STG is

© 2022 IETE
2 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA

adopted to reduce the communication time between two developing individual LS modules and they are inte-
LS modules. A latch is added between two LS modules grated via port controllers with asynchronous logic devel-
to store data for efficient communication [10]. Further- oped using CAD tools. Furthermore, two-phase and
more, a gated clock-based interface for GALS has been four-phase handshake protocols are implemented by
suggested wherein the external clock is gated to drive port controllers to initiate asynchronous communica-
the local clock of the LS modules based on the request tion between the sender and receiver LS modules. In
from port controllers [11]. The GALS interface uses First this proposed methodology, a two-phase handshake pro-
in First out (FIFO) buffers operating in asynchronous tocol is adopted since it has a fewer transitions and
mode for data transfer between mixed clock-based LS reduced latency compared to a four-phase handshake
modules [12]. The latency involved in synchronization protocol. The communication between two LS mod-
between two LS modules is reduced using high band- ules can be point-to-point or point-to-multipoint. The
width communication called STARI-based GALS inter- AW encapsulates port controllers besides the LS mod-
face deploying single-stage FIFO at receiver with the ules. The port controllers modeled by AFSMs for pro-
advantage of the stability of the clock [13]. Oliveira et viding asynchronous communication between LS mod-
al. [7] proposed a single-port controller for managing ules are made hazard-free by implementing the same
data communication in multipoint and point-to-point using STG. The logical equations are mapped by the
GALS for reduced area consumption. Stretchable clocks STG into standard library cells using a 3D tool, and
are realized to control the clock generator [14]. Asyn- finally the gate-level netlist is generated. Point-to-point
chronous elements, such as join and fork, could be used, communication between AW involves a single incom-
like “join” various data signals and send to GALS module ing and a single outgoing signal. The AW wrappers
and “fork” being used to send data to various sinks [3]. can be generalized to multi-point GALS with multiple
incoming and outgoing signals, which cannot be acti-
Applications, involving SoC with multiple IPs integrated vated concurrently as the arbiters are not used. Although
on a single chip, are quite challenging to design due to point-to-multipoint GALS wrappers consume area on
advancements in the scale of integration. The complex- the chip compared to point-to-point GALS wrapper,
ity of an SoC circuit design escalates due to a constant they eliminate redundancy to a greater extent. They
reduction in feature size instigated by scaling. The design coordinate in sending and receiving data by activat-
of an SoC circuit plays a vital role in increasing the per- ing LS modules accordingly through stretchable clocks.
formance of the system. The synchronous strategies for These stretchable clocks are chosen over plausible clocks
designing an SoC adopt a master clock for the synchro- to design the wrapper to handle reduced performance
nization of various data signals across the chip. These issues.
synchronous design strategies contribute to various chal-
lenges, such as the clock skew and high dynamic power
2. GALS INTERFACE: AN OVERVIEW
consumption at high frequencies. This encompasses the
need for complex timing analysis to be performed by GALS modules adopt LS modules with their own clock
Considered Capacitive Loads and Interconnect Resis- generator and asynchronous port controllers encapsu-
tances of clock signals. Due to the complexity involved lated in a self-timed wrapper. The operation of an asyn-
in synchronous design strategies, SoC applications have chronous port controller is modeled by AFSMs imple-
adopted asynchronous design strategies. Hence, GALS mented through STG. Since implementation through
techniques are introduced for asynchronous designs to STG promises hazard-free ports, the ports can be
achieve the maximum performance of the SoC system. designed in burst mode or extended burst mode formats.
The process-voltage-temperature (PVT) variations are The local clock generator is made tunable for stopping
within the tolerance levels for asynchronous circuits and adjusting the frequency to synchronize data transfer
compared to synchronous circuits, meeting the require- (Figure 1).
ments for robust applications. The performance parame-
ters, corresponding to low power, high speed and reduc- The new clock pulse is generated only if the request from
tion of electromagnetic interferences are improved, opt- all ports is low to stop the clock. The metastability is
ing for asynchronous design strategies over synchronous resolved by receiving all the requests from ports with
design strategies. different mutual exclusion (MuTex) elements.

GALS techniques simplify timing analysis, time to Data communication between the sender and the
market for an SoC circuit by reusing functional IP receiver in asynchronous systems follows handshake pro-
blocks. These structures adopt a modular approach by tocols indicating data arrival and availability. Figure 2 [15]
B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA 3

Figure 1: Asynchronous wrapper

Figure 2: Communication protocol: (a) four-phase handshake. (b) Two-phase handshake

represents two communication protocols i.e. two-phase Return to Zero (RTZ) signaling is also known as a four-
and four-phase handshake protocols, communicating phase bundled data protocol. The sender transfers data,
through request and acknowledge signals. In a two- which are indicated by setting the Request signal high.
phase protocol, Figure 2(a), a request signal is sent from The receiver accepts the data, which is indicated by set-
the transmitting circuit to the receiving circuit, indi- ting the Ack signal high. The response from the sender
cating the presence of data, and as the receiver circuit is indicated by high-to-low transition on the Request sig-
receives the data, the acknowledge signal undergoes a nal (this shows that data validity is not guaranteed fur-
transition. In a four-phase protocol, Figure 2(b), the ther). Finally, a high-to-low transition on the Ack signal
start of data transmission is indicated by the transmit- indicates an acknowledgement by the receiver. Hence-
ter circuit, and the request signal takes a transition, forth, the sender may initiate the next communication
the receiver acknowledgement is denoted by a transi- cycle. Although simplicity is its advantage, due to the
tion in acknowledging signal. This, in turn, causes the RTZ transition nature of this protocol, more energy and
request signal to go its initial state at the transmission time are consumed. If time to process valid data (when
side. Furthermore, data are accepted by the receiver. The Request signal is high) and time to process null data
acknowledge signal is restored after the restored signal is (when Request signal is low) are equal, then the resul-
restored [15]. tant data rate or throughput is reduced by a factor of
4 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA

Figure 3: Proposed design flow

2. To overcome these disadvantages, a two-phase bun- functionality is defined using Hardware Description Lan-
dled data protocol could be used, and it is also known as guage on Vivado Tool Suite at the RTL development
Non-Return to Zero (NRZ), indicating signaling or tran- stage. The IP integrator in Vivado interconnects various
sition signaling. Hence, information on the Request and IP cores by instantiating them to build the final Queue
Ack signal is transferred as signal transitions, and there is Direct Memory Access (QDMA) module. Design ver-
no difference in 1- > 0 and 0- > 1 transition. This shows ification is done using the Vivado Simulator to verify
that four-phase protocols have several signal transitions specific functionalities of the QDMA module. The syn-
compared to two-phase protocols during data transfer. thesized netlist generated is used to analyze the hierarchy
As a result, two-phase protocols are chosen over four- of design and ensure design optimization by eliminat-
phase protocols, and comparatively, a higher latency time ing redundant logic modules. The syntax is verified, and
is obtained. the obtained netlist is saved as a Native Generic Circuit
(NGC) file.

3. PROPOSED DESIGN FLOW Furthermore, into the process, Floor planning, placement
and route (PAR) are performed as a part of the design
GALS architectures comprise asynchronous wrappers implementation. Translate constitutes the design file, a
constituting LS modules and port controllers to han- combination of relevant netlist with constraints, wherein
dle communication between various LS modules. The constraints assign the ports to the physical component in
proposed design flow is depicted in Figure 3. The IP FPGA. This information is saved as UCF.
B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA 5

Figure 4: Asynchronous wrapper with point-to-point communication

the quantitative summary of time delay and power con-


sumed. The generated bitstream after the design imple-
mentation is used to configure the target Virtex Ultra-
Scale FPGA device.

3.1 QDMA
Figure 5: Stretchable clock Various integrated blocks in the UltraScale+TM encom-
pass QDMA for large DMA. This provides improved
performance and flexibility with its bridge infrastructure,
Map process fits the submodules of the entire circuit data transfer with a large packet count and higher band-
onto the FPGA. PAR carries out the placement and rout- width. QDMA implements queues that could be config-
ing process. Functional simulation is performed after ured to be operated in different modes with PCI Express
the translation process to validate the functionality of interface for virtualized application spaces and a broad
the module. Static timing analysis and power analysis range of malfunctioning. It also provides for enhanced
reports are generated after the PAR process, comprising traffic management. Descriptors, incorporating QDMA,
6 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA

new data, the start and the done signal go to their ini-
tial state i.e. from 1 to 0. AFSMs are implemented in
port controllers, and communication is enabled through
implemented STGs, and their structures are defined by
the C element. A latch is added to the dataflow path to
prevent metastability. Logic hazards are not introduced
as each signal in control occupies one Look-Up Table
(LUT) for conventional mapping. The plausible clocking
in GALS systems has a major drawback of metastabil-
ity [16] due to the arrival clock’s rising edge and Request
signal occurring simultaneously.

Stretchable clock circuits are used to resolve the issue of


synchronization failure and reduce power consumption.
These stretchable clock architectures employed using
basic gates are unreliable in a few input states [17].
Stretchable clock circuits with standard cells have a
Figure 6: Signal transition graph (STG) of a stretchable clock large delay compared to basic gates. Thus, these stretch-
able clock circuits in the proposed wrapper employ the
Muller-C element in which the Stretch signal controls the
could be used to transfer data from Host to Card (H2C) clock generator, as shown in Figure 5. During data trans-
or vive versa. fer between two LS modules, the phase of clock domains
St1 and St2 signals is stretched.
3.2 Fast GALS Interface
Figure 6 shows signal transition involved in W-port and
In asynchronous wrappers with point-to-point commu- R-port. If the LS module is ready to transfer data, then
nication (Figure 4), the clock generator is only active wr+ is triggered and when the LS module is ready to
during data transfer. When signal Start transitions from accept data, the rd+ is triggered. Figure 7 shows the
0 to 1, the XOR gate drives the stop signal from 0 to 1, trace plot of the stretchable clock STG simulated using
activating the clock. After processing the data, the Done Petrinets.
signal goes from 0 to 1, and hence data are transferred to
the output port through the FIFO block, which is a stor- Wrappers can be designed for the multipoint GALS archi-
age element at the receiver in the data path. To receive tecture (Figure 8), with two incoming signals and a single

Figure 7: Simulation of stretchable clock


B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA 7

Figure 8: Asynchronous wrapper with point-to-multipoint communication

outgoing signal, while the data are not transferred con- 4. RESULTS AND DISCUSSION
currently as the arbiter is not present. The control module
processes one data signal at a given time and transfers the The QDMA module is simulated using Vivado Tool Suite
data to the FIFO block after processing. The design of on UltraScale+TM xcvu9pfsgd2104 device at a maximum
a multipoint GALS architecture is much more complex frequency of 257.4MHz. Table 1 shows the confidence
than point-to- point GALS. The point-to-point GALS levels obtained for the proposed GALS architecture. The
wrapper contains a single input port and an output port. proposed design uses stretchable clocks, thus improving
The wrapper architecture can be modified as a multipoint the source clock and destination clock delay, respectively
topology with several inputs and output ports depending (Table 2).
on the application, but only one request can be processed
at a time. Thus, wrapper reusability is availed in a multi- There is improvement in latency compared with [18]
point wrapper, saving overall area and power consump- since the proposed GALS wrapper comprising plausi-
tion. “Join” and “Fork” are used for multipoint GALS ble clock with four-phase handshake protocol has sev-
interfaces. Arbiters can be used for concurrent processing eral transitions. The circuit functions if these bounds
in multipoint GALS. are met correctly. The average latency is reduced by
8 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA

Table 1: Confidence level Table 3: Average time of latency for GALS architectures
User input data Confidence AW [18] AW [2] AW [4] AW [19] Proposed AW
Design implementation state Low 33.8 ns 38.3 ns 35.6 ns 25.52 ns 6.40 ns
Clock nodes activity High
I/O nodes activity Low
Internal nodes activity Medium
Overall confidence level Medium
Table 4: Dynamic power dissipation
With stretchable clock Without stretchable clock
Table 2: Environment set-up 164.98 nW 235.16 nW
Ambient temp (C) 25.0
ThetaJA (C/W) 0.5
Airflow (LFM) 250
Heat sink Medium (medium profile) Table 5: Results obtained by the proposed GALS architecture
ThetaSA (C/W) 0.7 Specification Power without GALS Power with GALS
Board selection medium (10"×10")
# of board layers 12 to 15 (12–15 Layers) Total on-chip power (W) 5.422 3.990
Design power budget (W) Unspecified∗ Unspecified∗
Power budget margin (W) NA NA
Dynamic (W) 2.910 1.500
Device static (W) 2.512 2.490
82% compared to [18], 81% compared to [2] and 83% Effective TJA (C/W) 0.5 0.5
compared to [4] by the proposed GALS architecture. Max ambient (C) 97.2 97.9
Junction temperature (C) 27.8 27.1
Another reason or improvement achieved is due to the Confidence level Medium Medium
implementation of two-phase handshaking signals over
four-phase handshaking signals. These two-phase hand-
shaking signals have the edge over four-phase handshak-
ing signals, with fewer signal transitions and dynamic the dynamic power dissipation reduces by 29% when a
power dissipation. stretchable clock is used in the GALS wrapper. Hence,
stretchable clocking schemes can be used in the GALS
Synchronization is achieved using D-latch followed by technique requiring low-power application.
a T-flip flop to avoid metastability, circumventing sys-
tem failure. The signal reaching D-latch is asynchronous, Static power dissipation remains unchanged. The average
and this signal will not reach T-flip-flop if the signal is power dissipated is reduced by 48% due to the implemen-
metastable. The signal resolves from a metastable state tation of an asynchronous wrapper compared to circuits
and contains logic levels, further passes through the T- without GALS (Table 3). Comparatively, a two-phase
flip flop, which gives the output with respect to the bundled-data protocol is more efficient than a four-phase
synchronized signal. These circuits are called Synchro- bundled-data protocol since return-to-zero transition
nizer circuits and combine D-latch and T-flip flop that has high performance and power dissipation is avoided.
convert asynchronous signal to synchronous signal, thus Edge-sensitive devices are often more complicated than
eliminating the issue of metastability. These synchro- level-sensitive devices.
nizers are low power strategies, consume less area, are
highly reliable with high MTBF (Mean Time Between Response by control logic, storage elements to transition
Failures), and have low latency. However, synchroniza- on signal is more complex. Thus, a two-phase bundled
tion between wrappers is accomplished using handshake data protocol is a chosen approach in a high-speed system
signals. In this proposed methodology the synchronizer with unconditional data flow. Besides, there is a signifi-
circuit of [19] is replaced with FIFO-based synchronizer cant reduction in dynamic power dissipation as there are
,which reduces bandwidth and ensures communication reduced transitions in two-phase handshake protocol and
to be reliable. FIFO-based synchronizer will ensure the clock gating techniques. The improvement in the reduced
matching of frequency rate. power dissipation and better throughout attainment is
achieved at the cost of a marginal increase in average gate
Stretchable clock architecture in the proposed GALS count.
wrapper has the Muller-C element; hence, it operates
at higher frequencies than circuits employing standard There is a trade-off in gate count of up to 13%. There
cells. Thus, the stretchable clock gating scheme used is a marginal increase in the LUT utilization due to
in this work improves the performance by resolving logic implementation, as shown in Tables 4 and 5. The
the issue of metastability encountered, while plausible improvement in performance and latency achieved is 7%
clocking is applied in the wrapper. Table 4 shows that and 5%, respectively (Tables 6–9).
B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA 9

Table 6: Power consumed by different modules with GALS the QDMA subsystem. The proposed work uses a two-
Entity Power (W) phase bundled-data protocol over a four-phase bundled-
local_sync_core 1.915 data protocol since it provides increased performance
LS_core 1.915 although the circuit implementation is quite complex.
inst_core 1.915
The two-phase communication protocol inculcated in
this paper reduces the latency as it has fewer signal transi-
Table 7: Power consumed by different modules without tions. The single port controller controls the entire com-
GALS munication between LS modules, and it is modified for
Entity Power (W) multi-point and point-to-point interfaces. The motive
GALS_async_core 1.489 behind this approach is to design the system in a mod-
LS_core 1.489 ular way, wherein each module of the system provides
inst_core 1.489
more optimistic delay models, and the interconnection
between independent modules is established based on
the Delay Insensitive models. Therefore, the proposed
Table 8: Power consumption report
architecture contributes to latency reduction and adopts
Area
without Area with better power efficient techniques traded off for a marginal
Site type GALS Utilization % GALS Utilization % increase in the gate count.
CLB LUTs 25751 2.18 29647 2.51
LUT as logic 22799 1.93 26667 2.26
LUT as memory 2952 0.50 2980 0.50 DISCLOSURE STATEMENT
LUT as 2952 2980
distributed No potential conflict of interest was reported by the author(s).
RAM
LUT as shift 0 0
register ORCID
CLB registers 53706 2.27 62347 2.64
Register as flip 53642 2.27 62283 2.63 B.K. Vinay http://orcid.org/0000-0001-7778-1376
flop
Register as 64 < 0.01 64 < 0.01
latch REFERENCES
CARRY8 1201 0.81 1683 1.14
F7 Muxes 1393 0.24 1198 0.20 1. E. G. Friedman, “Clock distribution networks in syn-
F8 Muxes 689 0.23 351 0.12 chronous digital integrated circuits,” Proc. IEEE, Vol. 89,
F9 Muxes 0 0.00 0 0.00 pp. 665–92, 2001. doi:10.1109/5.929649

2. J. Muttersbach, “Globally-asynchronous locally-


Table 9: Area utilization report synchronous architectures for VLSI systems,” Ph.D. thesis,
Without
ETH, Zurich, 2001.
GALS With GALS
On-chip power (W) power (W) 3. M. Krstic, and E. Grass, “System integration by request-
Clocks 0.495 0.329 driven GALS design,” IEEE Proc. Comput. Digit Tech.,
CLB logic 0.409 0.341 Vol. 153, no. 5, pp. 362–72, September 2006. doi:10.1049/
LUT as logic 0.274 0.238 ip-cdt:20050210
LUT as distributed RAM 0.067 0.06
Register 0.059 0.039
CARRY8 0.009 0.005
4. J. Potes, R. Soares, E. Carvalho, F. Moraes, and N. Calazans,
BUFG < 0.001 < 0.001 “SCAFFI: An intrachip FPGA asynchronous interface
Others 0 0 based on hard macros,” in 25th International Confer-
F7/F8 Muxes 0 0 ence on Computer Design, 2007, pp. 541–546. doi:10.1109/
Signals 0.584 0.418 ICCD.2007.4601950
Block RAM 1.404 0.395
URAM 0.015 0.015
I/O 0.002 0.002 5. D. L. Oliveira, L. A. Faria, and E. Lussari, “Design of
Static power 2.512 2.49 an improved and robust asynchronous wrapper (AW) for
Total 5.422 3.99 FPGA applications,” J. Integr. Circuits Syst., Vol. 8, no. 1,
pp. 54–63, 2013. doi:10.29292/jics.v8i1.372

6. D. L. Oliveira, T. Curtinhas, L. A. Faria, J. L. V. Oliveira, and


5. CONCLUSION L. Romano, “Design of gated-clock asynchronous wrappers
for multi-point GALS systems,” IEEE ANDESCON, 1–4,
Some of the challenges faced in implementing an SoC
2016. doi:10.1109/ANDESCON.2016.7836214
circuit can be overcome by adopting GALS architec-
tures. The architecture proposed in this paper contains an 7. D. L. Oliveira, E. Lussari, S. S. Sato, and L. A. Faria,
asynchronous wrapper with a single port controller for “An asynchronous interface with robust control for
10 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA

globally asynchronous locally-synchronous systems,” J. 14. D. L. Oliveira, T. Curtinhas, L. A. Faria, H. A. Delsoto,


Technol. Manage., Vol. 5, no. 1, pp. 91–102, 2013. and L. Romano, “Design of low-latency asynchronous
doi:10.5028/jatm.v5i1.191 wrapper for GALS systems,” in XVIII Simpósio de Apli-
cações Operacionais em Áreas de Defesa (SIGE), 2016.
8. D. M. Chapiro, “Globally-asynchronous locally-
synchronous systems,” PhD thesis, Stanford University. 15. A. Reddy Ravi, “Globally-asynchronous, locally-
October 1984. synchronous wrapper configurations for point-to-point
and multi-point data communication,” Master thesis, Uni-
9. P. Techan, M. Greenstreet, and G. Lemieux, “A sur- versity of Central Florida, 2004.
vey and taxonomy of GALS design styles,” IEEE Des.
Test Comput., Vol. 24, pp. 418–28, September–October 16. K. Y. Yun, and R. P. Donohue, “Pausible clocking: A first
2007. doi:10.1109/MDT.2007.151 step toward heterogeneous systems,” in Proceedings of
International Conference on Computer Design (ICCD),
10. Y.-T. Chang, W.-C. Chen, H.-Y. Tsai, W.-M. Cheng, C.-J. Texas, USA, Oct. 7–9, 1996, pp. 118–23.
Chen, and F.-C. Cheng, “A Low-latency GALS interface
implementation,” in 2010 IEEE Asia Pacific Conference 17. D. S. Bormann, and P. Y. K. Cheung, “Asynchronous wrap-
on Circuits and Systems, 6–9 Dec. 2010, pp. 1183–1186. per for heterogeneous systems,” in Proceedings of Inter-
doi:10.1109/APCCAS.2010.5774997 national Conference on Computer Design (ICCD), Texas,
USA, Oct. 12–15, 1997, pp. 307–14.
11. E. Amini, M. Najibi, and H. Pedram, “Globally asyn-
chronous locally synchronous wrapper circuit based on 18. D. L. Oliveira, T. Curtinhas, L. A. Faria, L. and Romano,
clock gating,” in Symposium on Emerging VLSI Tech- “A novel asynchronous interface with pausible clock for
nologies and Architectures, 2006. doi:10.1109/ISVLSI. partitioned synchronous modules,” in IEEE 6th Latin
2006.48 American Symposium on Circuits & Systems (LASCAS),
2015, pp. 1–4. doi:10.1109/LASCAS.2015.7250441
12. D. Kim, M. Kim, and G. E. Sobelman, Asynchronous FIFO
interfaces for GALS on-chip switched networks. SoC, 2005. 19. T. Curtinhas, D. L. Oliveira, O. Saotome, and J. B.
Brandolin, “FPGA implementation of low-latency robust
13. A. Chakraborty, and M. R. Greenstreet. Efficient self- asynchronous interfaces for GALS systems,” in Elec-
timed interfaces for crossing clock domains. Ninth Inter- tronics Electrical Engineering and Computing (INTER-
national Symposium on Asynchronous Circuits and Sys- CON) 2018 IEEE XXV International Conference, 2018,
tems, ASYNC, 2003. pp. 78–88. doi:10.1109/ASYNC.2003. pp. 1–4. doi:10.1109/LASCAS.2015.7250441
1199168

AUTHORS S. Pushpa Mala has completed her Ph.D.


in Jain University, Bangalore. Her research
B.K. Vinay has a Bachelor’s Degree in interests include Image Processing, Sig-
Electronics and Communication Engi- nal Processing and Very Large-Scale Inte-
neering and Master’s in Signal processing grated Systems. Some of her projects have
& VLSI Design. He has more than 7 years been funded by Karnataka State Coun-
of teaching and industry experience. Cur- cil for Science and Technology. She has
rently he is pursuing his Ph.D. in VLSI published around 30 papers in various
design. Some of his projects have been indexed international journals and conferences. She is a
funded by the Department of Science & SMIEE.
Technology Government of India, Karnataka State Council for
Science and Technology. His research interest includes Low
Corresponding author. Email: [email protected]
power VLSI Design, Analog and Mixed signal VLSI Design,
Circuit design and simulations, DSP and Embedded Systems S. Deekshitha is an undergraduate stu-
Design. dent in the Department of Electronics
and Communication Engineering from
Corresponding author. Email: [email protected] CMRIT, Visvesvaraya Technological Uni-
versity. Her research interest includes
VLSI Design.

Email: [email protected]

You might also like