0% found this document useful (0 votes)
63 views

A Pausible Bisynchronous FIFO For GALS Systems: Ben Keller, Matthew Fojtik, and Brucek Khailany

This document describes a novel pausible bisynchronous FIFO design for globally asynchronous, locally synchronous (GALS) systems that achieves low latency asynchronous communication. The design uses standard two-ported synchronous FIFOs with a new pointer flow control scheme that allows data transfer across asynchronous interfaces using two-phase increment and acknowledge signals. Analysis shows the design achieves an average latency of 1.34 clock cycles across asynchronous interfaces while using less energy and area than traditional synchronization techniques.

Uploaded by

koorapatisagar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

A Pausible Bisynchronous FIFO For GALS Systems: Ben Keller, Matthew Fojtik, and Brucek Khailany

This document describes a novel pausible bisynchronous FIFO design for globally asynchronous, locally synchronous (GALS) systems that achieves low latency asynchronous communication. The design uses standard two-ported synchronous FIFOs with a new pointer flow control scheme that allows data transfer across asynchronous interfaces using two-phase increment and acknowledge signals. Analysis shows the design achieves an average latency of 1.34 clock cycles across asynchronous interfaces while using less energy and area than traditional synchronization techniques.

Uploaded by

koorapatisagar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Pausible Bisynchronous FIFO for GALS Systems

Ben Keller∗† , Matthew Fojtik∗ , and Brucek Khailany∗


∗ NVIDIACorporation † University of California, Berkeley
[email protected], [email protected], [email protected]

Abstract—Many of the challenges of modern SoC design can


be mitigated or eliminated with globally asynchronous, locally Data In Data Out
synchronous (GALS) design techniques. Partitioning a design
TX Clock
into many synchronous islands introduces myriad asynchronous Dual-Port FIFO
Write Enable
boundary crossings which typically incur high latency. We have
designed a pausible bisynchronous FIFO that achieves low inter- Write Read
Address Address
face latency with a pausible clocking scheme. While traditional Ready Ready
synchronizers have a non-zero probability of metastability and er- Write Read
Valid Pointer Pointer Valid
ror, pausible clocking enables error-free operation by permitting Logic Write Pointer Read Pointer Logic
infrequent slowdowns in the clock rate. Unlike prior pausible syn- (Gray coded) (Gray coded)
chronizers, our circuit employs standard two-ported synchronous
FIFOs, common circuit elements that integrate well with standard
toolflows. The pausible bisynchronous FIFO achieves an average
latency of 1.34 cycles across an asynchronous interface while
using less energy and area than traditional synchronizers.
TX Clock Brute Force RX Clock
Synchronizers
I. I NTRODUCTION
Modern SoCs built in deeply scaled process nodes present Fig. 1. A brute-force bisynchronous FIFO.
extraordinary design challenges. Slow wires and process, volt-
age, and temperature (PVT) variation make the synchronous
abstraction increasingly untenable over large chip areas, re- the interface with very low latency. The contributions of this
quiring immense effort to achieve timing closure. The globally paper include:
asynchronous, locally synchronous (GALS) design methodol-
1) A novel flow-control scheme for FIFO pointers that
ogy is one means of mitigating the difficulty of global timing
uses two-phase increment and acknowledge signals to
closure. GALS design flows delimit ”synchronous islands” of
transmit data across an asynchronous interface.
logic that operate on local clocks and communicate with each
2) A low-latency bisynchronous FIFO design that uses
other asynchronously.
pausible clocking techniques with standard two-ported
GALS has a decades-long history in academia [1], and the synchronous FIFOs that integrate easily into standard
use of multiple clock domains is common in industry today toolflows.
[2] [3]. However, individual clock domains in large commercial 3) A thorough analysis of the timing constraints imposed
designs still span many square millimeters, and so many of the by pausible clocking systems, including consideration
design challenges posed by a fully synchronous design persist of the delay required for signals to traverse the distance
in these systems. The full advantages of GALS design can between the interface and the clock generator circuit.
only be realized if large SoCs are partitioned into myriad small
synchronous blocks, not a handful of large areas, an approach We believe that this work overcomes many of the barriers to
we refer to as fine-grained GALS. Industry has been reluctant the adoption of fine-grained GALS in modern SoCs.
to adopt this approach due to three main issues: the difficulty
of generating many local clocks, the latency incurred by asyn- II. BACKGROUND
chronous boundary crossings, and the challenge of integrating While industry has not yet embraced the GALS approach,
GALS methodology into standard toolflows. Overcoming these progress has been made in overcoming the barriers to GALS.
difficulties will permit widespread adoption of GALS and ease Previous work addresses some of the challenges of local clock
the timing closure challenge in large SoCs. generation and synchronization latency.
We propose a novel design for low-latency asynchronous
boundary crossings using pausible clocks. Our interface uses A. Local Clock Generation
the standard two-port first-in first-out (FIFO) queues common Historically, on-chip clocks have typically been generated by
in digital designs, but synchronizes read and write pointer phase-locked loop (PLL) circuits. These circuits can reliably
updates using two-phase signals that allow data to traverse generate a fixed target frequency, but are large, power-hungry,
and difficult to design, making them poor candidates for
This research was developed, in part, with funding from the Defense inclusion in each synchronous island of a GALS system.
Advanced Research Projects Agency (DARPA). The views, opinions, and/or Recently, some systems have abandoned the goal of a fixed
findings contained in this article are those of the authors and should not be
interpreted as representing the official views or policies of the Department target frequency in favor of adaptive clocking schemes that
of Defense or the U.S. Government. Distribution Statement A (approved for can temporarily vary the clock period in response to noise
public release, distribution unlimited). events [4]. Going further, some adaptive clocks do not target
a particular frequency at all, instead using replica critical
path circuits to continuously adjust the generated clock as r1 g1

MUTEX
Request Synchronized Request
local conditions change [5]. Since they do not attempt to
C r2 g2
lock to particular frequency targets, adaptive clock generators
avoid much of the complexity of PLL circuits. These circuits
Clock
impose minimal overhead on the overall system, and are ideal C
Generator
candidates for local clock generation in GALS designs [6].
Clock Clock
B. Synchronization Latency
(a) (b)
Signals crossing the boundary between fully asynchronous
clock domains, such as those crossing between synchronous Fig. 2. Adaptive clock generators (a) can be extended with a mutex to
islands in a GALS design, must be synchronized to mini- synchronize requests, forming a pausible clock circuit (b).
mize the risk of metastability and operational failure. This
synchronization is typically achieved by sending such signals
through several flip-flops in series in the receiver clock do- r2 go high at the same time. However, there is no longer any
main. The flip-flops delay the signal for one or more cycles, danger of metastability at the asynchronous input, and typical
providing extra time for any metastability to resolve. While circuit operation synchronizes input signals with roughly one
these brute force (BF) synchronizers do not eliminate the cycle of latency. Prior work describes the design and operation
possibility of metastability, they can reduce the probability of pausible clocking circuits in detail [6] [9] [10].
until it is negligible. BF synchronizers can be used with a FIFO
memory to construct a BF bisynchronous FIFO as shown in D. Related Work
Figure 1. This FIFO safely transmits data between two clock Pausible clocking enables low-latency synchronization of
domains, synchronizing the read and write pointers with BF signals with arbitrary relative phase, and as such represents
synchronizers. The pointers must be gray coded so that any an attractive option for boundary crossings in GALS design.
synchronization error does not disrupt the pointer location by Several prior proposals for GALS boundary crossings integrate
more than one increment; the logic to encode and decode the pausible clocks with FIFO queues to synchronize data across
pointers is an overhead of this scheme. an interface [6] [9] [10] [11] [12]. These designs typically
Several circuits have been designed with the explicit purpose require a fully asynchronous FIFO that services two-phase
of reducing the synchronization latency penalty. Chakraborty request and acknowledge signals to store data words in transit.
and Greenstreet demonstrate circuits that can synchronize data There are many asynchronous FIFO designs in the literature,
with low latency if some information about the relative phase from Sutherland’s classic micropipelines [13] to GasP and
of the two clocks is known at design time [7]. However, their Mousetrap FIFOs [14] [15]. However, these asynchronous
scheme is not practical in the case of fully asynchronous clock FIFOs have several disadvantages over their synchronous
domains. Dally and Tell devised an even/odd synchronizer that counterparts. Rather than keeping data in place and updating
achieves low-latency communication across an asynchronous pointers to the data, these FIFOs propagate data from the
interface, but their circuit requires a complicated phase predic- back to the front of the queue. This data movement incurs
tor and functions only with stable clock frequencies [8]. These a penalty in both energy and latency, a penalty that increases
circuits are useful in certain applications, but do not provide with the queue depth. Furthermore, many asynchronous FIFOs
a satisfactory solution to the barriers to GALS adoption. require careful delay matching to satisfy two-sided timing
constraints. Some of these issues can be mitigated by the use
C. Pausible Clocking of circular FIFOs, such as the one proposed in [16]. However,
asynchronous FIFOs necessarily require careful asynchronous
A different method to reduce synchronization latency is circuit design and verification that is poorly supported by
pausible clocking. As described in [9], pausible clocks take standard VLSI toolflows.
advantage of the adaptive clock circuits already present in
many GALS implementations. A simple adaptive clock circuit III. T HE PAUSIBLE B ISYNCHRONOUS FIFO
consists of one or more inverting delay lines fed into the
input of a Muller C-element (see Figure 2a). These delay lines We propose a pausible clocking scheme that achieves flow
replicate the various critical paths found in the synchronous control via a standard two-ported synchronous memory ele-
logic island; the C-element ensures that the next clock edge ment that is synchronously written in one clock domain and
will not be generated until the slowest replica path resolves. asynchronously read in another. We refer to this circuit as
The pausible clock circuit adds another input to the C-element a pausible bisynchronous FIFO. Like the BF bisynchronous
that can be triggered asynchronously by signals entering the FIFO, data is stored in the FIFO while the read and write
clock domain (see Figure 2b). A mutual exclusion (mutex) pointers are synchronized between clock domains. In the pau-
circuit ensures that the asynchronous input cannot toggle sible FIFO, however, this synchronization is completed with a
simultaneously with the rising clock edge. When the mutex pausible clock network, not with slow BF synchronizers. This
input r2 is high, the mutex is opaque and signals at the data design combines the low-latency synchronization of pausible
input r1 are delayed until the clock edge has passed. When r2 clocking with the favorable characteristics of standard two-
is low, the mutex is transparent and signals at r1 are passed ported FIFOs.
through. Because the mutex can become metastable if its inputs Figure 3 shows the pausible bisynchronous FIFO circuit.
toggle simultaneously, the clock can pause for an arbitrarily The pausible synchronizers that provide synchronization in
long duration (with vanishingly small probability) if r1 and both the transmit (TX) and receive (RX) clock domains are
Data In Data Out
A
TX Clock
Dual-Port FIFO
Write Enable

Write Read
Address Address
Ready Ready
Write Read
Valid Pointer Pointer Valid
Pointer Increment Pointer Increment
A Logic B Logic D

G
From RX Side To TX Side
Pointer Acknowledge Pointer Acknowledge

LAT LAT
F C E

TX Clock RX Clock
MUTEX

MUTEX
MUTEX

MUTEX
r1 g1 r1 g1
MUTEX

MUTEX
TX Pausible RX Pausible
Synchronizer r2 g2 r2 g2 Synchronizer

Clock Clock
C Generator C Generator

TX Clock RX Clock

Fig. 3. The pausible bisynchronous FIFO. Only one of the increment-acknowledge paths is shown for clarity; in the complete system, each increment and
acknowledge line requires its own mutex and synchronization circuitry. The labeled letters show the sequence necessary to synchronize data through the FIFO.

shown in gray. These circuits are similar to those in [6], RX clock periods, but this is likely unnecessary, as sending
except that the feedback FF has been replaced with a latch to data across an interface with such mismatched periods would
reduce overhead. Note that each pausible clock circuit requires quickly fill or empty the FIFO.
its input pointer increment or acknowledge signal to use a Each of the increment and acknowledge signals must be
two-phase request-acknowledge protocol. This ensures that the synchronized through their own mutex circuit in the pausible
unsynchronized signal can only toggle once, and then must clock network. The g2 outputs of all mutexes are ANDed
wait for an acknowledgement before toggling again, preventing together, and this result is used as the synchronizing input
additional switching at an unsafe clock phase. By design, to the C-element, ensuring that the clock edge is not gen-
this protocol prevents multiple toggles within a single clock erated until every mutex guarantees a safe phase. Additional
period; however, this is problematic for the synchronization of interfaces (e.g., to multiple different synchronous islands) can
pointer updates, because it implies that each pointer can only also be accommodated in this way: the g2 outputs from every
be updated once per cycle, restricting throughput to the slower interface can be ANDed together to ensure that all interfaces
of the two clock periods. Accordingly, the pausible FIFO synchronize correctly. This does have the side effect that
does not synchronize the multi-bit pointers directly. Instead, a clock pause from any one interface will stall the entire
several single-bit, two-phase pointer increment lines signal an synchronous domain, but we found clock pauses to be so rare
update to the read or write pointers, and corresponding pointer that we do not believe this will pose a significant problem in
acknowledge signals are sent back once the increments are practice (see Section VI).
synchronized. This allows multiple pointer increments to occur
in succession within a single clock period, and allows full The write pointer logic stores the value of the write pointer,
throughput even at mismatched clock periods. Our experimen- as well as its best knowledge of the read pointer (possibly
tation found that three increment-acknowledge pairs in either delayed from the actual read pointer position as updates are
direction guaranteed full throughput for TX:RX clock period synchronized from the RX domain). It uses these values to
ratios as high as 2 or as low as 1/2. Additional increment and calculate whether the FIFO is full, and to signal backpressure
acknowledge lines could be added to ensure full throughput accordingly. The write pointer logic also sends write pointer
in the case of more extreme mismatches between TX and increment signals by toggling one of the two-phase write
pointer increment lines in the event of a write to the FIFO.
TABLE I. T IMING VARIABLES
tfb
Variable Description
T The nominal clock period of the synchronous block.
TL The average latency of a data word through the interface. Request Synchronized Request
LAT
tins The insertion delay of the clock for the synchronous block.
tr2 The delay from the output of the C-element to the mutex r2 input.
The delay from the mutex r2 input through the mutex and around the
tf b
feedback path to the mutex r1 input.
The delay from the mutex r1 input through the mutex to the output r1 g1

MUTEX
tg2
of the C-element.
The minimum time available to perform combinational work on the r2 g2
tg2
tCL
synchronized request signal before the next clock edge.
Time allotted to resolve mutex metastability, used to reduce the
tm
frequency of clock pauses. tr2 C
The wire delay from the boundary of the synchronous island to the
tw
local clock generator.
Clock

A state machine tracks which increment signals are in flight Fig. 4. The key timing paths in the pausible synchronizer.
and which have been acknowledged and can be used again.
The read pointer logic performs similar calculations in the RX
clock domain to determine whether the FIFO is empty. With delayed for a longer time. However, the update will eventually
this logic and the pausible clocks synchronizing the pointer be synchronized into the RX domain (C), at which point the
updates, the pausible bisynchronous FIFO can synchronize read pointer logic can increment its internal tracking of the
new input data in roughly one cycle on average. write pointer and assert the valid signal at the output of the
The dual-port FIFO is clocked by the TX clock, and can system (D).
be implemented as FFs, a latch array, or an SRAM macro At this point, the data can be synchronously read from the
as appropriate for its size. Such FIFOs are standard circuit FIFO in the RX domain. (Once this read occurs, the RX pointer
elements in modern designs, and the numerous area and energy logic will need to toggle one of the read pointer increment
optimizations for these memory elements can be leveraged signals to inform the TX domain; this series of toggles is
with no additional design effort. No custom design is needed not shown.) However, from the perspective of the TX domain,
to implement the FIFO, and standard scan and test structures the write pointer update is still in flight, as a corresponding
can be easily implemented. acknowledge signal has not yet been received. Accordingly,
The pausible bisynchronous FIFO could be easily modified after the synchronization, the RX clock edge toggles the
to interface between a clock domain with pausible clocking corresponding acknowledge line (E). As this toggle occurs in
and one with a traditional fixed reference, such as a PLL. the RX clock domain, it must be synchronized through the
By replacing the pausible synchronizer on the fixed-reference TX pausible clock network (F). This acknowledge signal then
side of the interface with brute-force synchronizing FFs to updates the TX logic state machine, freeing the write pointer
synchronize the increment and acknowledge pointers, low increment line for future use (G).
latency in one direction would still be maintained. This would
allow a system to be partially converted to a GALS style V. T IMING A NALYSIS
while maintaining legacy IP with traditional clocking where Pausible clocking integrates the logic for asynchronous
necessary. These advantages make the pausible bisynchronous boundary crossings into the clock generation mechanism for
FIFO a good candidate to overcome the barriers to widespread the entire synchronous island. This integration imposes con-
GALS adoption. straints on the operating conditions of each of these systems.
Previous work in pausible clocks does not fully address these
IV. C IRCUIT O PERATION constraints, but this paper contributes a thorough accounting
The sequence labeled on Figure 3 shows the series of steps of the capabilities and limitations of pausible clock timing,
involved in writing a data word to the FIFO. In this example, which is critical to designing a realistic system. In this section,
the FIFO is initially empty, and all two-phase increment and we derive expressions for the average latency of the pausible
acknowledge lines are available for use. On the rising edge of interface, as well as the constraints imposed upon the clock
the TX clock, data is written to the FIFO address pointed to period, insertion delay, and wire delay across the synchronous
by the write pointer and the input valid signal is asserted (A). island. We neglect the effects of variation in this analysis,
At this point, data is available to be read out of the FIFO. The treating circuit delays as fixed quantities. In reality, stochastic
write pointer logic then increments the write pointer internally, or worst-case corner analysis would be needed to ensure timing
and toggles one of the two-phase write pointer increment robustness, although post-silicon tuning could alleviate the
lines (B). This write pointer increment line is toggled in the effects of process variation. Table I describes each of the
TX domain, and so it is asynchronous to the RX domain variables used in the analysis in this section.
and must be synchronized through the RX pausible clock
network. Depending upon the phase at which the write pointer A. Timing Fundamentals
increment toggle arrives at the RX domain, it may pass through The important delays through the pausible clock network
immediately, be delayed until after the next RX clock edge, are shown in Figure 4. tr2 is the delay from the output of the
or (in rare cases) cause metastability in the mutex and be C-element to the mutex r2 input. tf b is the delay from the r2
Clock Edge
tr2 tfb tg2 Delayed e s
lit
y lv
bi so
Clock a re
st y
et
a lit
bi
m ta
e x as
r2 ut et tfb tg2
M M

Clock
r1
r2
(a)
r1
tr2 tfb tg2
Clock tCL

r2
Fig. 6. The setup time requirement for the synchronized request signal. In the
r1 worst case, metastability consumes any available timing margin, so that when
g1 finally goes high, the next clock edge is guaranteed to occur as soon as
(b) the signal can propagate through tf b and tg2 . This is therefore the maximum
allowable delay for logic that depends on the synchronized output.
Fig. 5. The clock period timing constraint of pausible clocking. In the worst
case, a request arrives just before r2 goes high. In (a), the sum of the delays
through the pausible circuit is longer than half the clock period, and so the
next clock edge is delayed. In (b), the delays are shorter and the clock edge TCL available for logic before this clock edge is only
occurs on time.
tCL = tf b + tg2 . (3)
This parameter is constrained by the complexity of the pointer
input through the mutex and around the feedback loop to the logic; if a long enough time is not apportioned for tCL , then
r1 input. tg2 is the delay from the mutex r1 input to the output an extra register must be inserted before the logic to “pipeline”
of the C-element, including delay through the AND tree when the computation, increasing the latency of the interface by
multiple mutexes contribute timing information. The sum of one cycle. If tm > 0, then increasing tf b by adding delay
these three delays cannot exceed the delay through the clock to the feedback path trades off excess tm to increase the time
generator, or else the clock will frequently pause, increasing available for same-cycle combinational work.
the clock period beyond the target for the synchronous island
(see Figure 5). Since the clock generator delay is set to T /2 In order to derive the average latency of the interface, the
for a desired clock period T , these delays collectively enforce phase at which the request signal arrives must be considered.
a minimum clock period for the synchronous block: As shown in Figure 7, if a request signal arrives while the
mutex is transparent, the request can be serviced within the
T /2 ≥ tr2 + tf b + tg2 (1) same cycle. Assuming that the fully asynchronous request
signal is equally likely to arrive at any phase, the average
If this clock period constraint is exceeded, then the timing latency of such requests is 0.75T − tr2 . If the request arrives
slack in the system translates into a margin tm that guards while the mutex is opaque, then the request cannot be serviced
against the effect of clock pauses: until the next cycle. The average latency of such requests is
1.25T − tr2 . If the duty cycle of the clock is 50%, then taking
tm = T /2 − (tr2 + tf b + tg2 ) (2) the mean of these two expressions gives the average latency
tL of the interface as a whole:
Mutex metastability can be seen as a temporary increase in tf b
caused by simultaneous toggling of the mutex inputs. If (1) is tL = T − tr2 (4)
just satisfied (that is, if T /2 = tr2 + tf b + tg2 ), then tm = 0, Increasing tr2 decreases the average latency of the interface
and any mutex metastability that delays its output will cause because it shifts the transparent phase of the mutex closer to
the clock to pause. If tm > 0, then some metastability can be the next clock edge. If tm > 0, then increasing tr2 by adding
tolerated before a clock pause occurs. In practice, we found delay to the mutex r2 input trades off excess tm to decrease
mutex metastability to be an infrequent event, with long clock the average latency through the interface. Since tm can also be
pauses rare (see Section VI). Accordingly, in this analysis we traded for additional tCL , this means that there is a trade-off
will tend to trade off tm in favor of other more critical timing between reducing latency and increasing the time available for
parameters. combinational work in the read pointer logic.
As detailed in Section IV, the low latency of the pausible
bisynchronous FIFO depends on the ability of the RX pointer B. Insertion Delay
logic to immediately respond to a write pointer update by In real systems, the clock distribution network within the
asserting data valid before the next RX clock edge arrives. synchronous island will have some insertion delay tins be-
The worst-case setup time for this logic is shown in Figure 6. tween the generation of the clock edges and their propaga-
We refer to this available time to complete combinational work tion through the clock network to the register endpoints. As
within the same cycle as a received request as tCL . In the worst first noted in [17], this insertion delay mis-aligns the mutex
case for this timing path, metastability in the mutex causes a transparent phase, which could lead to circuit failure (see
clock pause before resolving in favor of r1. When g1 toggles, a Figure 8). Small insertion delays can be compensated by
clock edge will be generated as soon as this signal propagates intentionally increasing tr2 to match tins , realigning the phases
around the feedback loop to the clock generator. Thus, the time and protecting against metastability. However, the clock period
tL Lockup
Latch

Clock Request Synchronized Request


LAT LAT
LAT

r2

r2 Clock

MUTEX
Average arrival time during opaque phase

MUTEX
r1 g1

MUTEX
(a) r2 g2

tL Clock
C Generator

Clock
Clock Root
r2
Fig. 9. The pausible synchronizer with an added lockup latch to guard against
Average arrival time during transparent phase races caused by large insertion delays. The latch guards the clocked FF until
after the clock can propagate from the root through the clock tree.
(b)
Fig. 7. The latency of the interface. Possible arrival times for each case are
shaded. Data is generally available to be read out of the FIFO at the positive
clock edge after the request passes through the synchronizer. Data that arrives Pausible
during the opaque phase of the mutex (a) will average more than one cycle of Synchronizer
Wire
latency. Data that arrives during the transparent phase of the mutex (b) will Wire Delay
average less than one cycle of latency. Delay

Synchronizer
Synchronizer
Local Clock

Pausible
Pausible
Local Clock Generator and
Generator Pausible
Clock Root Synchronizers
tins
Clock tr2
Pausible
Synchronizer
r2 Synchronization Failure

(a) (b)
Fig. 8. The effect of insertion delay on pausible clock timing. If tr2 is small
relative to the insertion delay, then the mutex is transparent when the clock Fig. 10. The delays imposed on the pausible clocking system by the wire
edges arrive at the FFs. If requests arrive during the shaded period, they will delay incurred to traverse each synchronous island. The synchronizer circuits
pass through the mutex and may induce metastability in the FF following the can either be placed near the boundary (a) or near to the local clock generator
pausible synchronizer. (b).

constraint from (1) limits the increase in tr2 . Setting tr2 = tins latch before it reaches the combinational network.
yields the constraint on the insertion delay permitted for a Expressions for average latency and tCL must be adjusted
given clock period: when insertion delay is considered:
tins ≤ T /2 − tf b − tg2 (5)
tL = T + tins − tr2 (7)
Handling large insertion delays is a fundamental challenge tCL = T /2 + tins − tr2 (8)
of pausible clocking schemes. One technique to allow larger
Increasing insertion delay increases the average latency of the
insertion delay places all FFs adjacent to the interface on a sep-
interface because the next clock edge is delayed relative to the
arate clock with a much smaller clock tree [6]. However, this
transparent phase of the mutex. This delay also increases the
approach could pose challenges with standard toolflows. [18]
time available for combinational work.
proposed adding lockup latches to the circuit as in Figure 9.
We propose a similar scheme, except that our transparent high
latches are enabled by the r2 input, so they are transparent only C. Wire Delay
when the mutex is not. The latches allow requests to propagate In addition to the non-idealities of clock insertion delay, a
through the transparent mutex before the clock signal arrives nonzero wire delay is required to traverse the physical distance
at the leaf nodes, but then delays the request at the transparent between the block interface and the clock generation circuit.
mutex until after the clock edge has safely arrived at the FF Prior analysis of pausible clock timing neglects this delay, but
clock input. The latches do not increase the latency of the collocating all of the blocks involved is impractical for a real
interface because signals that would not race the clock would GALS system. For instance, a tiled partitioning with a nearest-
still have to wait for the next clock edge to be synchronized. neighbor communication scheme would require four interfaces
Adding latches marginally increases the area and energy of the for each block, each communicating to a different neighbor
circuit, but allows an additional T /2 of insertion delay: (see Figure 11). It would not be physically possible to collocate
the clock generator with these different interfaces.
tins ≤ T − tf b − tg2 (6)
Different design decisions impose this wire delay on dif-
However, tCL is decreased by the delay through the transparent ferent paths in the pausible clock circuit. A traditional ap-
latch, as the asynchronous request must propagate through the proach places each mutex at the boundary of the synchronous
TX Clock Domain RX Clock Domain

Mutex Delay (ps)


TX Block Interface RX Block
102

Nominal Mutex Delay TX Clock RX Clock


Generator Generator
101 10-4 10-3 10-2 10-1 100 101 102
Difference In Arrival Times (ps) Fig. 12. The setup used to simulate the interfaces. The shaded area was
synthesized to obtain area and energy comparisons.
(a) (b)
Fig. 11. A mutual exclusion (mutex) circuit. The circuit was simulated in
SPICE to find the magnitude of mutex metastability as a function of the arrival Compiler in the same 28nm process as described in Section VI.
time difference between signals r1 and r2. The local clock generators were implemented as behavioral
cells because the design of an adaptive clocking macro is
outside the scope of this work. The mutex circuits were treated
island, with the local clock generator centrally located as in as black boxes with custom library definitions based roughly
Figure 10a. This adds a wire delay tw to tr2 and tg2 , increasing on SPICE simulations. They were defined as state elements
the minimum achievable cycle time (from (1)) and decreasing for the synthesis tool so as to break the combinational loops
the maximum allowable insertion delay (from (5)). Alternately, inherent to pausible clock design. Unlike synthesis of a stan-
all mutexes can be placed near the clock generator as in dard digital block, the pausible bisynchronous FIFO circuit has
Figure 10b. This adds tw to the latency of the system, but several timing paths that cross between clock domains. These
does not impact the cycle time or insertion delay constraints. paths should not be unconstrained; as noted previously, the
In either approach, tw could be reduced by using higher metal timing of many of these paths is critical for circuit operation.
layers and dedicated routing channels to transmit these critical Accordingly, we used the timing analysis from Section V to
signals. Even with these considerations, tw will likely be a define custom timing constraints on these paths so the synthesis
substantial fraction of the clock period for most systems, and tool would appropriately optimize timing. We constrained the
will therefore have a noticeable impact on system performance. tf b and tg2 paths to 200ps each, with tr2 negligible. According
to (1), these constraints should permit operation with a clock
VI. I MPACT OF M UTEX M ETASTABILITY period of 800ps or larger. An insertion delay of 250ps was
The above analysis assumed that mutex metastability is so explicitly added to the clock generation circuits, well below
rare that its impact on average latency and cycle time will be the maximum insertion delay permitted by (6). We did not
negligible. To confirm this assumption, we simulated the mutex insert additional wire delay for physical distance as described
circuit in Figure 11a in SPICE in a 28nm CMOS process. in Section V-C.
Inputs were toggled with random relative arrival times and the The post-synthesis netlist was simulated using VCS with
resulting delay through the circuit was measured. Figure 11b delays annotated from synthesis. The probability of clock
shows the results of the simulation. Assuming that the relative pausing was modeled with a simple exponential distribution
arrival times are uniformly distributed, integrating under the estimated from the SPICE simulations described in Section VI,
curve reveals that a 1ns clock period would be increased by although the average impact of these clock pauses on latency
an average of just 0.23ps from clock pauses, an impact of less was negligible. Power was measured by Primetime PX, with
than 0.1%. Furthermore, long clock pauses are exceedingly activity factors back-annotated from the gate-level simulation.
rare: pauses longer than 100ps make up less than one event
in 106 . Adding a small tm eliminates most clock pauses and VIII. P ERFORMANCE R ESULTS
reduces the average impact on cycle time even further. We compared the performance of three different interfaces:
• Synchronous, a fully synchronous FIFO queue that
VII. E XPERIMENTAL S ETUP
functions only within a single clock domain. This is a
To evaluate the pausible bisynchronous FIFO, we imple- standard element in all digital designs. Since the read
mented the circuit as an interface module in Verilog RTL. Sim- and write pointers do not have to be synchronized across
ple transmit and receive blocks were constructed to send and clock domains, the control logic is simpler than either
receive data; the interface manages communication between of the bisynchronous interfaces.
these two blocks (see Figure 12). The communication between • BFSync, a brute force bisynchronous FIFO with three
the interface and its neighbors is fully synchronous, and uses series FFs used to synchronize the pointers, as shown in
standard ready-valid interfaces; all asynchronous communi- Figure 1.
cation takes place within the interface itself, similar to the
“GALS wrapper” approach described in [18]. This allows other • Pausible, the pausible bisynchronous FIFO described in
interfaces to be swapped in for the pausible bisynchronous Section III and shown in Figure 3.
FIFO for straightforward comparison. The mutex circuit and Each interface was used in the experimental setup shown in
the local clock generators are described with behavioral Ver- Figure 12. 128-bit data words were sent across the interface.
ilog, but the rest of the circuit is fully synthesizable. Each interface included an 8-element FIFO built from FFs.
To ensure that our circuit integrates well with standardized Figure 13 shows the average latency through the interface
toolflows, we synthesized the circuit using Synopsys Design as the ratio of clock periods is varied. (The RX clock period is
TABLE II. S YNTHESIS R ESULTS

Average Latency (cycles) Area (µm2 ) Power (mW) Energy (fJ/bit)


Synchronous 1 4968 4.08 39.8
BFSync 4 5005 6.03 58.9
Pausible 1.34 4808 5.41 52.8

constraints imposed by the system allows full integration with


8
Pausible standard toolflows. We believe that this circuit represents a key
7 BFSync enabling technology for fine-grained GALS systems, which
Latency (RX cycles)

6 can mitigate many of the challenges of modern SoC design.


5
R EFERENCES
4 [1] D. M. Chapiro, “Globally-asynchronous locally-synchronous systems,”
3 Ph. D. Thesis, vol. 1, p. 50, 1984.
[2] E. Fluhr et al., “Power8: A 12-core server-class processor in 22nm soi
2 with 7.6tb/s off-chip bandwidth,” in Proc. IEEE International Solid-
1 State Circuits Conference, 2014, pp. 96–97.
[3] S. Rusu et al., “Ivytown: A 22nm 15-core enterprise xeon processor
0.5 1.0 2.0 4.0 family,” in Proc. IEEE International Solid-State Circuits Conference,
TX Clock Period / RX Clock Period 2014, pp. 102–103.
[4] A. Grenat et al., “Adaptive clocking system for improved power effi-
Fig. 13. Simulation results showing the latency of each interface as the ratio ciency in a 28nm x86-64 microprocessor,” in Proc. IEEE International
of TX clock period to RX clock period is varied. The latency of the pausible Solid-State Circuits Conference, 2014, pp. 106–107.
bisynchronous FIFO averages 1.34 cycles, and is much less than the latency
through the brute-force synchronizer. [5] R. Jevtic et al., “Per-Core DVFS With Switched-Capacitor Converters
for Energy Efficiency in Manycore Processors,” IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, pp. 1–8, 2014.
[6] R. Mullins and S. Moore, “Demystifying Data-Driven and Pausible
held fixed at 1.25ns, while the TX clock period is swept from Clocking Schemes,” in Proc. IEEE Symposium on Asynchronous Cir-
0.625ns to 5ns.) The pausible bisynchronous FIFO achieves cuits and Systems, 2007, pp. 175–185.
an average latency of just 1.34 cycles throughout this range, [7] A. Chakraborty and M. Greenstreet, “Efficient Self-Timed Interfaces for
much lower than the 4 cycles of BFSync and comparable to Crossing Clock Domains,” in Proc. IEEE Symposium on Asynchronous
the 1-cycle synchronous latency. In accordance with (7), much Circuits and Systems, 2003, pp. 78–88.
of the increase in latency beyond one cycle is caused by 250ps [8] W. J. Dally and S. G. Tell, “The Even/Odd Synchronizer: A Fast,
insertion delay. The result is slightly higher than predicted by All-Digital, Periodic Synchronizer,” in Proc. IEEE Symposium on
Asynchronous Circuits and Systems, 2010, pp. 75–84.
(7) because of the non-zero delay between the TX and RX
interface estimated by the synthesis tool. If we include a wire [9] K. Yun and R. Donohue, “Pausible clocking: a first step toward
heterogeneous systems,” in Proc. IEEE International Conference on
delay tw of 150ps in the simulation as described in Section V-C Computer Design, 1996, pp. 118–123.
and assume the floorplan of Figure 10b, this delay is directly [10] P. Teehan et al., “A Survey and Taxonomy of GALS Design Styles,”
added to the latency of the interface. For a 1ns clock period, in Proc. IEEE Design & Test of Computers, vol. 24, no. 5, 2007, pp.
this increases the average latency from 1.34 cycles to 1.49 418–428.
cycles. [11] S. Moore et al., “Point to point GALS interconnect,” in Proc. IEEE
Symposium on Asynchronous Circuits and Systems, 2002, pp. 69–75.
Table II shows the results of synthesis of each design. The
[12] E. Tuncer et al., “Enabling adaptability through elastic clocks,” in Proc.
area of each design is dominated by that of the FIFO, so the ACM/IEEE Design Automation Conference, 2009, pp. 8–10.
total area of each design is similar. The energy cost of the
[13] I. E. Sutherland, “Micropipelines,” Communications of the ACM,
pausible interface per bit of data sent is somewhat less than vol. 32, no. 6, pp. 720–738, 1989.
the BF synchronizer because the gray-coding logic is removed. [14] I. Sutherland and S. Fairbanks, “GasP: a minimal FIFO control,” in
The synthesized design is able to operate with a clock period Proc. IEEE Symposium on Asynchronous Circuits and Systems, 2001,
as low as 800ps (as predicted by (1)) before clock pauses start pp. 46–53.
to become much more frequent. The interface still operates [15] M. Singh and S. Nowick, “MOUSETRAP: High-Speed Transition-
correctly at a clock period of 600ps, although faster periods Signaling Asynchronous Pipelines,” IEEE Transactions on Very Large
cause the setup time constraint in (8) to be violated, leading Scale Integration (VLSI) Systems, vol. 15, no. 6, pp. 684–698, Jun.
2007.
to incorrect functionality.
[16] A. Ghiribaldi et al., “A Transition-Signaling Bundled Data NoC Switch
Architecture for Cost-effective GALS Multicore Systems,” Proc. IEEE
IX. C ONCLUSION Design, Automation & Test in Europe, pp. 332–337, 2013.
We have designed a low-latency asynchronous interface [17] A. E. Sjogren and C. J. Myers, “Interfacing synchronous and asyn-
that works well with standard design tools. The pausible chronous modules within a high-speed pipeline,” IEEE Transactions on
bisynchronous FIFO achieves an average of 1.34 cycles of Very Large Scale Integration (VLSI) Systems, vol. 8, pp. 573–583, 2000.
latency, while incurring minimal energy and area overhead [18] X. Fan et al., “Analysis and optimization of pausible clocking based
over a synchronous interface. Careful analysis of the timing GALS design,” IEEE International Conference Computer Design, 2009.

You might also like