Asynchronous Wrapper Based Low Power GALS Structural QDMA
Asynchronous Wrapper Based Low Power GALS Structural QDMA
To cite this article: B.K. Vinay, S. Pushpa Mala & S. Deekshitha (2022): Asynchronous
Wrapper-Based Low-Power GALS Structural QDMA, IETE Journal of Research, DOI:
10.1080/03772063.2021.2021814
ABSTRACT KEYWORDS
The design of System-on-Chip systems using synchronous circuits involves complex clock distribu- Asynchronous wrapper;
tion strategies, which envisage challenges for designers to integrate large-scale systems. Globally Globally asynchronous
Asynchronous Locally Synchronous architectures containing asynchronous port controllers encap- locally synchronous (GALS);
sulated in the self-timed wrapper have been adopted in this work. These port controllers com- Handshake protocol; Muller C
element; Port controller;
municate through Asynchronous Finite State Machines defined by Signal Transition Graphs are Signal transition graphs
implemented adopting the C element. This GALS architecture implemented for the point-to-point (STG); Synthesis logic
interface can also be modified for the multipoint interface. The proposed methodology uses a two-
phase handshake protocol to communicate between two Locally Synchronous modules as it has
fewer signal transitions, which, in turn, reduces latency. In this paper, the Queue Direct Memory
Access subsystem is implemented using the Vivado simulator on UltraScale+TM device at a maxi-
mum frequency of 257.4MHz, and various parameters are reported. A comparison shows that the
proposed wrapper has improved latency time of 53%, with a reduction in power dissipated by 27%
and an increase in gate count by 13%.
1. INTRODUCTION
a method known as clock stretching. Clock stretching
A System-on-Chips (SoC) integrates individual IPs with eliminates metastability. The performance gain is less due
a specific functionality onto a single platform. Syn- to increased modular multiplications during the execu-
chronous circuits require optimization of clock distri- tion phase [4]. A novel architecture for the asynchronous
bution networks to attain reduced latency, which is a GALS wrapper has been proposed for port controllers
complex process [1]. Consequently, a methodology for to communicate with the asynchronous wrapper using
implementing asynchronous designs has to be developed. direct mapping style [5]. With an increase in the input
Globally Asynchronous Locally Synchronous (GALS) and output ports for data communication, the wrapper
architectures encompass asynchronous wrappers with befits complexity and the area increases. The data trans-
locally synchronous (LS) modules. These LS modules fer in SoC with GALS systems uses multi-point interfaces
communicate with each other using handshake proto- reducing latency and area [6]. Asynchronous wrapper
cols, which are technically asynchronous [2]. An asyn- (AW) (AW) realizes fault tolerance for autonomous mod-
chronous wrapper espouses three handshake processes. ules and delay insensitive (DI) designs, which does not
The first handshake is at the input port for receiving data, require isochronic fork conditions to be met [7].
the second handshake between the input port and the
output port for generating the clock and the third hand- Various design methodologies for GALS architecture
shake at the output port for transferring the data. This include plausible clocks, asynchronous and locally syn-
wrapper has an average power consumption of 1mW for chronous modules. Plausible clocks avoid metastability
a stream of data with 50 MHz [3]. by delaying the sampling of the clock until the arrival of
data. In asynchronous interface design styles, the signal
Stretchable Clock Asynchronous Flexible FPGA Inter- received from the outer clock domain is transferred to the
faces (SCAFFI) interconnect the LS modules to Field Pro- local clock domain by synchronizers [8]. LS design styles
grammable Gate Array (FPGA) for GALS architectures. analyze time bounds, overcoming the need for hand-
These architectures use arbiters to pause the LS mod- shaking for data transfer [9]. Signal Transition Graphs
ules’ clock before the data are transferred. At a later stage, (STG) represent the flow of positive and negative edges of
the clock is restarted once the data achieve a stable state, the signals. In the proposed wrapper, a modified STG is
© 2022 IETE
2 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA
adopted to reduce the communication time between two developing individual LS modules and they are inte-
LS modules. A latch is added between two LS modules grated via port controllers with asynchronous logic devel-
to store data for efficient communication [10]. Further- oped using CAD tools. Furthermore, two-phase and
more, a gated clock-based interface for GALS has been four-phase handshake protocols are implemented by
suggested wherein the external clock is gated to drive port controllers to initiate asynchronous communica-
the local clock of the LS modules based on the request tion between the sender and receiver LS modules. In
from port controllers [11]. The GALS interface uses First this proposed methodology, a two-phase handshake pro-
in First out (FIFO) buffers operating in asynchronous tocol is adopted since it has a fewer transitions and
mode for data transfer between mixed clock-based LS reduced latency compared to a four-phase handshake
modules [12]. The latency involved in synchronization protocol. The communication between two LS mod-
between two LS modules is reduced using high band- ules can be point-to-point or point-to-multipoint. The
width communication called STARI-based GALS inter- AW encapsulates port controllers besides the LS mod-
face deploying single-stage FIFO at receiver with the ules. The port controllers modeled by AFSMs for pro-
advantage of the stability of the clock [13]. Oliveira et viding asynchronous communication between LS mod-
al. [7] proposed a single-port controller for managing ules are made hazard-free by implementing the same
data communication in multipoint and point-to-point using STG. The logical equations are mapped by the
GALS for reduced area consumption. Stretchable clocks STG into standard library cells using a 3D tool, and
are realized to control the clock generator [14]. Asyn- finally the gate-level netlist is generated. Point-to-point
chronous elements, such as join and fork, could be used, communication between AW involves a single incom-
like “join” various data signals and send to GALS module ing and a single outgoing signal. The AW wrappers
and “fork” being used to send data to various sinks [3]. can be generalized to multi-point GALS with multiple
incoming and outgoing signals, which cannot be acti-
Applications, involving SoC with multiple IPs integrated vated concurrently as the arbiters are not used. Although
on a single chip, are quite challenging to design due to point-to-multipoint GALS wrappers consume area on
advancements in the scale of integration. The complex- the chip compared to point-to-point GALS wrapper,
ity of an SoC circuit design escalates due to a constant they eliminate redundancy to a greater extent. They
reduction in feature size instigated by scaling. The design coordinate in sending and receiving data by activat-
of an SoC circuit plays a vital role in increasing the per- ing LS modules accordingly through stretchable clocks.
formance of the system. The synchronous strategies for These stretchable clocks are chosen over plausible clocks
designing an SoC adopt a master clock for the synchro- to design the wrapper to handle reduced performance
nization of various data signals across the chip. These issues.
synchronous design strategies contribute to various chal-
lenges, such as the clock skew and high dynamic power
2. GALS INTERFACE: AN OVERVIEW
consumption at high frequencies. This encompasses the
need for complex timing analysis to be performed by GALS modules adopt LS modules with their own clock
Considered Capacitive Loads and Interconnect Resis- generator and asynchronous port controllers encapsu-
tances of clock signals. Due to the complexity involved lated in a self-timed wrapper. The operation of an asyn-
in synchronous design strategies, SoC applications have chronous port controller is modeled by AFSMs imple-
adopted asynchronous design strategies. Hence, GALS mented through STG. Since implementation through
techniques are introduced for asynchronous designs to STG promises hazard-free ports, the ports can be
achieve the maximum performance of the SoC system. designed in burst mode or extended burst mode formats.
The process-voltage-temperature (PVT) variations are The local clock generator is made tunable for stopping
within the tolerance levels for asynchronous circuits and adjusting the frequency to synchronize data transfer
compared to synchronous circuits, meeting the require- (Figure 1).
ments for robust applications. The performance parame-
ters, corresponding to low power, high speed and reduc- The new clock pulse is generated only if the request from
tion of electromagnetic interferences are improved, opt- all ports is low to stop the clock. The metastability is
ing for asynchronous design strategies over synchronous resolved by receiving all the requests from ports with
design strategies. different mutual exclusion (MuTex) elements.
GALS techniques simplify timing analysis, time to Data communication between the sender and the
market for an SoC circuit by reusing functional IP receiver in asynchronous systems follows handshake pro-
blocks. These structures adopt a modular approach by tocols indicating data arrival and availability. Figure 2 [15]
B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA 3
represents two communication protocols i.e. two-phase Return to Zero (RTZ) signaling is also known as a four-
and four-phase handshake protocols, communicating phase bundled data protocol. The sender transfers data,
through request and acknowledge signals. In a two- which are indicated by setting the Request signal high.
phase protocol, Figure 2(a), a request signal is sent from The receiver accepts the data, which is indicated by set-
the transmitting circuit to the receiving circuit, indi- ting the Ack signal high. The response from the sender
cating the presence of data, and as the receiver circuit is indicated by high-to-low transition on the Request sig-
receives the data, the acknowledge signal undergoes a nal (this shows that data validity is not guaranteed fur-
transition. In a four-phase protocol, Figure 2(b), the ther). Finally, a high-to-low transition on the Ack signal
start of data transmission is indicated by the transmit- indicates an acknowledgement by the receiver. Hence-
ter circuit, and the request signal takes a transition, forth, the sender may initiate the next communication
the receiver acknowledgement is denoted by a transi- cycle. Although simplicity is its advantage, due to the
tion in acknowledging signal. This, in turn, causes the RTZ transition nature of this protocol, more energy and
request signal to go its initial state at the transmission time are consumed. If time to process valid data (when
side. Furthermore, data are accepted by the receiver. The Request signal is high) and time to process null data
acknowledge signal is restored after the restored signal is (when Request signal is low) are equal, then the resul-
restored [15]. tant data rate or throughput is reduced by a factor of
4 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA
2. To overcome these disadvantages, a two-phase bun- functionality is defined using Hardware Description Lan-
dled data protocol could be used, and it is also known as guage on Vivado Tool Suite at the RTL development
Non-Return to Zero (NRZ), indicating signaling or tran- stage. The IP integrator in Vivado interconnects various
sition signaling. Hence, information on the Request and IP cores by instantiating them to build the final Queue
Ack signal is transferred as signal transitions, and there is Direct Memory Access (QDMA) module. Design ver-
no difference in 1- > 0 and 0- > 1 transition. This shows ification is done using the Vivado Simulator to verify
that four-phase protocols have several signal transitions specific functionalities of the QDMA module. The syn-
compared to two-phase protocols during data transfer. thesized netlist generated is used to analyze the hierarchy
As a result, two-phase protocols are chosen over four- of design and ensure design optimization by eliminat-
phase protocols, and comparatively, a higher latency time ing redundant logic modules. The syntax is verified, and
is obtained. the obtained netlist is saved as a Native Generic Circuit
(NGC) file.
3. PROPOSED DESIGN FLOW Furthermore, into the process, Floor planning, placement
and route (PAR) are performed as a part of the design
GALS architectures comprise asynchronous wrappers implementation. Translate constitutes the design file, a
constituting LS modules and port controllers to han- combination of relevant netlist with constraints, wherein
dle communication between various LS modules. The constraints assign the ports to the physical component in
proposed design flow is depicted in Figure 3. The IP FPGA. This information is saved as UCF.
B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA 5
3.1 QDMA
Figure 5: Stretchable clock Various integrated blocks in the UltraScale+TM encom-
pass QDMA for large DMA. This provides improved
performance and flexibility with its bridge infrastructure,
Map process fits the submodules of the entire circuit data transfer with a large packet count and higher band-
onto the FPGA. PAR carries out the placement and rout- width. QDMA implements queues that could be config-
ing process. Functional simulation is performed after ured to be operated in different modes with PCI Express
the translation process to validate the functionality of interface for virtualized application spaces and a broad
the module. Static timing analysis and power analysis range of malfunctioning. It also provides for enhanced
reports are generated after the PAR process, comprising traffic management. Descriptors, incorporating QDMA,
6 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA
new data, the start and the done signal go to their ini-
tial state i.e. from 1 to 0. AFSMs are implemented in
port controllers, and communication is enabled through
implemented STGs, and their structures are defined by
the C element. A latch is added to the dataflow path to
prevent metastability. Logic hazards are not introduced
as each signal in control occupies one Look-Up Table
(LUT) for conventional mapping. The plausible clocking
in GALS systems has a major drawback of metastabil-
ity [16] due to the arrival clock’s rising edge and Request
signal occurring simultaneously.
outgoing signal, while the data are not transferred con- 4. RESULTS AND DISCUSSION
currently as the arbiter is not present. The control module
processes one data signal at a given time and transfers the The QDMA module is simulated using Vivado Tool Suite
data to the FIFO block after processing. The design of on UltraScale+TM xcvu9pfsgd2104 device at a maximum
a multipoint GALS architecture is much more complex frequency of 257.4MHz. Table 1 shows the confidence
than point-to- point GALS. The point-to-point GALS levels obtained for the proposed GALS architecture. The
wrapper contains a single input port and an output port. proposed design uses stretchable clocks, thus improving
The wrapper architecture can be modified as a multipoint the source clock and destination clock delay, respectively
topology with several inputs and output ports depending (Table 2).
on the application, but only one request can be processed
at a time. Thus, wrapper reusability is availed in a multi- There is improvement in latency compared with [18]
point wrapper, saving overall area and power consump- since the proposed GALS wrapper comprising plausi-
tion. “Join” and “Fork” are used for multipoint GALS ble clock with four-phase handshake protocol has sev-
interfaces. Arbiters can be used for concurrent processing eral transitions. The circuit functions if these bounds
in multipoint GALS. are met correctly. The average latency is reduced by
8 B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA
Table 1: Confidence level Table 3: Average time of latency for GALS architectures
User input data Confidence AW [18] AW [2] AW [4] AW [19] Proposed AW
Design implementation state Low 33.8 ns 38.3 ns 35.6 ns 25.52 ns 6.40 ns
Clock nodes activity High
I/O nodes activity Low
Internal nodes activity Medium
Overall confidence level Medium
Table 4: Dynamic power dissipation
With stretchable clock Without stretchable clock
Table 2: Environment set-up 164.98 nW 235.16 nW
Ambient temp (C) 25.0
ThetaJA (C/W) 0.5
Airflow (LFM) 250
Heat sink Medium (medium profile) Table 5: Results obtained by the proposed GALS architecture
ThetaSA (C/W) 0.7 Specification Power without GALS Power with GALS
Board selection medium (10"×10")
# of board layers 12 to 15 (12–15 Layers) Total on-chip power (W) 5.422 3.990
Design power budget (W) Unspecified∗ Unspecified∗
Power budget margin (W) NA NA
Dynamic (W) 2.910 1.500
Device static (W) 2.512 2.490
82% compared to [18], 81% compared to [2] and 83% Effective TJA (C/W) 0.5 0.5
compared to [4] by the proposed GALS architecture. Max ambient (C) 97.2 97.9
Junction temperature (C) 27.8 27.1
Another reason or improvement achieved is due to the Confidence level Medium Medium
implementation of two-phase handshaking signals over
four-phase handshaking signals. These two-phase hand-
shaking signals have the edge over four-phase handshak-
ing signals, with fewer signal transitions and dynamic the dynamic power dissipation reduces by 29% when a
power dissipation. stretchable clock is used in the GALS wrapper. Hence,
stretchable clocking schemes can be used in the GALS
Synchronization is achieved using D-latch followed by technique requiring low-power application.
a T-flip flop to avoid metastability, circumventing sys-
tem failure. The signal reaching D-latch is asynchronous, Static power dissipation remains unchanged. The average
and this signal will not reach T-flip-flop if the signal is power dissipated is reduced by 48% due to the implemen-
metastable. The signal resolves from a metastable state tation of an asynchronous wrapper compared to circuits
and contains logic levels, further passes through the T- without GALS (Table 3). Comparatively, a two-phase
flip flop, which gives the output with respect to the bundled-data protocol is more efficient than a four-phase
synchronized signal. These circuits are called Synchro- bundled-data protocol since return-to-zero transition
nizer circuits and combine D-latch and T-flip flop that has high performance and power dissipation is avoided.
convert asynchronous signal to synchronous signal, thus Edge-sensitive devices are often more complicated than
eliminating the issue of metastability. These synchro- level-sensitive devices.
nizers are low power strategies, consume less area, are
highly reliable with high MTBF (Mean Time Between Response by control logic, storage elements to transition
Failures), and have low latency. However, synchroniza- on signal is more complex. Thus, a two-phase bundled
tion between wrappers is accomplished using handshake data protocol is a chosen approach in a high-speed system
signals. In this proposed methodology the synchronizer with unconditional data flow. Besides, there is a signifi-
circuit of [19] is replaced with FIFO-based synchronizer cant reduction in dynamic power dissipation as there are
,which reduces bandwidth and ensures communication reduced transitions in two-phase handshake protocol and
to be reliable. FIFO-based synchronizer will ensure the clock gating techniques. The improvement in the reduced
matching of frequency rate. power dissipation and better throughout attainment is
achieved at the cost of a marginal increase in average gate
Stretchable clock architecture in the proposed GALS count.
wrapper has the Muller-C element; hence, it operates
at higher frequencies than circuits employing standard There is a trade-off in gate count of up to 13%. There
cells. Thus, the stretchable clock gating scheme used is a marginal increase in the LUT utilization due to
in this work improves the performance by resolving logic implementation, as shown in Tables 4 and 5. The
the issue of metastability encountered, while plausible improvement in performance and latency achieved is 7%
clocking is applied in the wrapper. Table 4 shows that and 5%, respectively (Tables 6–9).
B.K. VINAY ET AL: ASYNCHRONOUS WRAPPER-BASED LOW-POWER GALS STRUCTURAL QDMA 9
Table 6: Power consumed by different modules with GALS the QDMA subsystem. The proposed work uses a two-
Entity Power (W) phase bundled-data protocol over a four-phase bundled-
local_sync_core 1.915 data protocol since it provides increased performance
LS_core 1.915 although the circuit implementation is quite complex.
inst_core 1.915
The two-phase communication protocol inculcated in
this paper reduces the latency as it has fewer signal transi-
Table 7: Power consumed by different modules without tions. The single port controller controls the entire com-
GALS munication between LS modules, and it is modified for
Entity Power (W) multi-point and point-to-point interfaces. The motive
GALS_async_core 1.489 behind this approach is to design the system in a mod-
LS_core 1.489 ular way, wherein each module of the system provides
inst_core 1.489
more optimistic delay models, and the interconnection
between independent modules is established based on
the Delay Insensitive models. Therefore, the proposed
Table 8: Power consumption report
architecture contributes to latency reduction and adopts
Area
without Area with better power efficient techniques traded off for a marginal
Site type GALS Utilization % GALS Utilization % increase in the gate count.
CLB LUTs 25751 2.18 29647 2.51
LUT as logic 22799 1.93 26667 2.26
LUT as memory 2952 0.50 2980 0.50 DISCLOSURE STATEMENT
LUT as 2952 2980
distributed No potential conflict of interest was reported by the author(s).
RAM
LUT as shift 0 0
register ORCID
CLB registers 53706 2.27 62347 2.64
Register as flip 53642 2.27 62283 2.63 B.K. Vinay http://orcid.org/0000-0001-7778-1376
flop
Register as 64 < 0.01 64 < 0.01
latch REFERENCES
CARRY8 1201 0.81 1683 1.14
F7 Muxes 1393 0.24 1198 0.20 1. E. G. Friedman, “Clock distribution networks in syn-
F8 Muxes 689 0.23 351 0.12 chronous digital integrated circuits,” Proc. IEEE, Vol. 89,
F9 Muxes 0 0.00 0 0.00 pp. 665–92, 2001. doi:10.1109/5.929649
Email: [email protected]