A Pausible Bisynchronous FIFO For GALS Systems: Ben Keller, Matthew Fojtik, and Brucek Khailany
A Pausible Bisynchronous FIFO For GALS Systems: Ben Keller, Matthew Fojtik, and Brucek Khailany
MUTEX
Request Synchronized Request
local conditions change [5]. Since they do not attempt to
C r2 g2
lock to particular frequency targets, adaptive clock generators
avoid much of the complexity of PLL circuits. These circuits
Clock
impose minimal overhead on the overall system, and are ideal C
Generator
candidates for local clock generation in GALS designs [6].
Clock Clock
B. Synchronization Latency
(a) (b)
Signals crossing the boundary between fully asynchronous
clock domains, such as those crossing between synchronous Fig. 2. Adaptive clock generators (a) can be extended with a mutex to
islands in a GALS design, must be synchronized to mini- synchronize requests, forming a pausible clock circuit (b).
mize the risk of metastability and operational failure. This
synchronization is typically achieved by sending such signals
through several flip-flops in series in the receiver clock do- r2 go high at the same time. However, there is no longer any
main. The flip-flops delay the signal for one or more cycles, danger of metastability at the asynchronous input, and typical
providing extra time for any metastability to resolve. While circuit operation synchronizes input signals with roughly one
these brute force (BF) synchronizers do not eliminate the cycle of latency. Prior work describes the design and operation
possibility of metastability, they can reduce the probability of pausible clocking circuits in detail [6] [9] [10].
until it is negligible. BF synchronizers can be used with a FIFO
memory to construct a BF bisynchronous FIFO as shown in D. Related Work
Figure 1. This FIFO safely transmits data between two clock Pausible clocking enables low-latency synchronization of
domains, synchronizing the read and write pointers with BF signals with arbitrary relative phase, and as such represents
synchronizers. The pointers must be gray coded so that any an attractive option for boundary crossings in GALS design.
synchronization error does not disrupt the pointer location by Several prior proposals for GALS boundary crossings integrate
more than one increment; the logic to encode and decode the pausible clocks with FIFO queues to synchronize data across
pointers is an overhead of this scheme. an interface [6] [9] [10] [11] [12]. These designs typically
Several circuits have been designed with the explicit purpose require a fully asynchronous FIFO that services two-phase
of reducing the synchronization latency penalty. Chakraborty request and acknowledge signals to store data words in transit.
and Greenstreet demonstrate circuits that can synchronize data There are many asynchronous FIFO designs in the literature,
with low latency if some information about the relative phase from Sutherland’s classic micropipelines [13] to GasP and
of the two clocks is known at design time [7]. However, their Mousetrap FIFOs [14] [15]. However, these asynchronous
scheme is not practical in the case of fully asynchronous clock FIFOs have several disadvantages over their synchronous
domains. Dally and Tell devised an even/odd synchronizer that counterparts. Rather than keeping data in place and updating
achieves low-latency communication across an asynchronous pointers to the data, these FIFOs propagate data from the
interface, but their circuit requires a complicated phase predic- back to the front of the queue. This data movement incurs
tor and functions only with stable clock frequencies [8]. These a penalty in both energy and latency, a penalty that increases
circuits are useful in certain applications, but do not provide with the queue depth. Furthermore, many asynchronous FIFOs
a satisfactory solution to the barriers to GALS adoption. require careful delay matching to satisfy two-sided timing
constraints. Some of these issues can be mitigated by the use
C. Pausible Clocking of circular FIFOs, such as the one proposed in [16]. However,
asynchronous FIFOs necessarily require careful asynchronous
A different method to reduce synchronization latency is circuit design and verification that is poorly supported by
pausible clocking. As described in [9], pausible clocks take standard VLSI toolflows.
advantage of the adaptive clock circuits already present in
many GALS implementations. A simple adaptive clock circuit III. T HE PAUSIBLE B ISYNCHRONOUS FIFO
consists of one or more inverting delay lines fed into the
input of a Muller C-element (see Figure 2a). These delay lines We propose a pausible clocking scheme that achieves flow
replicate the various critical paths found in the synchronous control via a standard two-ported synchronous memory ele-
logic island; the C-element ensures that the next clock edge ment that is synchronously written in one clock domain and
will not be generated until the slowest replica path resolves. asynchronously read in another. We refer to this circuit as
The pausible clock circuit adds another input to the C-element a pausible bisynchronous FIFO. Like the BF bisynchronous
that can be triggered asynchronously by signals entering the FIFO, data is stored in the FIFO while the read and write
clock domain (see Figure 2b). A mutual exclusion (mutex) pointers are synchronized between clock domains. In the pau-
circuit ensures that the asynchronous input cannot toggle sible FIFO, however, this synchronization is completed with a
simultaneously with the rising clock edge. When the mutex pausible clock network, not with slow BF synchronizers. This
input r2 is high, the mutex is opaque and signals at the data design combines the low-latency synchronization of pausible
input r1 are delayed until the clock edge has passed. When r2 clocking with the favorable characteristics of standard two-
is low, the mutex is transparent and signals at r1 are passed ported FIFOs.
through. Because the mutex can become metastable if its inputs Figure 3 shows the pausible bisynchronous FIFO circuit.
toggle simultaneously, the clock can pause for an arbitrarily The pausible synchronizers that provide synchronization in
long duration (with vanishingly small probability) if r1 and both the transmit (TX) and receive (RX) clock domains are
Data In Data Out
A
TX Clock
Dual-Port FIFO
Write Enable
Write Read
Address Address
Ready Ready
Write Read
Valid Pointer Pointer Valid
Pointer Increment Pointer Increment
A Logic B Logic D
G
From RX Side To TX Side
Pointer Acknowledge Pointer Acknowledge
LAT LAT
F C E
TX Clock RX Clock
MUTEX
MUTEX
MUTEX
MUTEX
r1 g1 r1 g1
MUTEX
MUTEX
TX Pausible RX Pausible
Synchronizer r2 g2 r2 g2 Synchronizer
Clock Clock
C Generator C Generator
TX Clock RX Clock
Fig. 3. The pausible bisynchronous FIFO. Only one of the increment-acknowledge paths is shown for clarity; in the complete system, each increment and
acknowledge line requires its own mutex and synchronization circuitry. The labeled letters show the sequence necessary to synchronize data through the FIFO.
shown in gray. These circuits are similar to those in [6], RX clock periods, but this is likely unnecessary, as sending
except that the feedback FF has been replaced with a latch to data across an interface with such mismatched periods would
reduce overhead. Note that each pausible clock circuit requires quickly fill or empty the FIFO.
its input pointer increment or acknowledge signal to use a Each of the increment and acknowledge signals must be
two-phase request-acknowledge protocol. This ensures that the synchronized through their own mutex circuit in the pausible
unsynchronized signal can only toggle once, and then must clock network. The g2 outputs of all mutexes are ANDed
wait for an acknowledgement before toggling again, preventing together, and this result is used as the synchronizing input
additional switching at an unsafe clock phase. By design, to the C-element, ensuring that the clock edge is not gen-
this protocol prevents multiple toggles within a single clock erated until every mutex guarantees a safe phase. Additional
period; however, this is problematic for the synchronization of interfaces (e.g., to multiple different synchronous islands) can
pointer updates, because it implies that each pointer can only also be accommodated in this way: the g2 outputs from every
be updated once per cycle, restricting throughput to the slower interface can be ANDed together to ensure that all interfaces
of the two clock periods. Accordingly, the pausible FIFO synchronize correctly. This does have the side effect that
does not synchronize the multi-bit pointers directly. Instead, a clock pause from any one interface will stall the entire
several single-bit, two-phase pointer increment lines signal an synchronous domain, but we found clock pauses to be so rare
update to the read or write pointers, and corresponding pointer that we do not believe this will pose a significant problem in
acknowledge signals are sent back once the increments are practice (see Section VI).
synchronized. This allows multiple pointer increments to occur
in succession within a single clock period, and allows full The write pointer logic stores the value of the write pointer,
throughput even at mismatched clock periods. Our experimen- as well as its best knowledge of the read pointer (possibly
tation found that three increment-acknowledge pairs in either delayed from the actual read pointer position as updates are
direction guaranteed full throughput for TX:RX clock period synchronized from the RX domain). It uses these values to
ratios as high as 2 or as low as 1/2. Additional increment and calculate whether the FIFO is full, and to signal backpressure
acknowledge lines could be added to ensure full throughput accordingly. The write pointer logic also sends write pointer
in the case of more extreme mismatches between TX and increment signals by toggling one of the two-phase write
pointer increment lines in the event of a write to the FIFO.
TABLE I. T IMING VARIABLES
tfb
Variable Description
T The nominal clock period of the synchronous block.
TL The average latency of a data word through the interface. Request Synchronized Request
LAT
tins The insertion delay of the clock for the synchronous block.
tr2 The delay from the output of the C-element to the mutex r2 input.
The delay from the mutex r2 input through the mutex and around the
tf b
feedback path to the mutex r1 input.
The delay from the mutex r1 input through the mutex to the output r1 g1
MUTEX
tg2
of the C-element.
The minimum time available to perform combinational work on the r2 g2
tg2
tCL
synchronized request signal before the next clock edge.
Time allotted to resolve mutex metastability, used to reduce the
tm
frequency of clock pauses. tr2 C
The wire delay from the boundary of the synchronous island to the
tw
local clock generator.
Clock
A state machine tracks which increment signals are in flight Fig. 4. The key timing paths in the pausible synchronizer.
and which have been acknowledged and can be used again.
The read pointer logic performs similar calculations in the RX
clock domain to determine whether the FIFO is empty. With delayed for a longer time. However, the update will eventually
this logic and the pausible clocks synchronizing the pointer be synchronized into the RX domain (C), at which point the
updates, the pausible bisynchronous FIFO can synchronize read pointer logic can increment its internal tracking of the
new input data in roughly one cycle on average. write pointer and assert the valid signal at the output of the
The dual-port FIFO is clocked by the TX clock, and can system (D).
be implemented as FFs, a latch array, or an SRAM macro At this point, the data can be synchronously read from the
as appropriate for its size. Such FIFOs are standard circuit FIFO in the RX domain. (Once this read occurs, the RX pointer
elements in modern designs, and the numerous area and energy logic will need to toggle one of the read pointer increment
optimizations for these memory elements can be leveraged signals to inform the TX domain; this series of toggles is
with no additional design effort. No custom design is needed not shown.) However, from the perspective of the TX domain,
to implement the FIFO, and standard scan and test structures the write pointer update is still in flight, as a corresponding
can be easily implemented. acknowledge signal has not yet been received. Accordingly,
The pausible bisynchronous FIFO could be easily modified after the synchronization, the RX clock edge toggles the
to interface between a clock domain with pausible clocking corresponding acknowledge line (E). As this toggle occurs in
and one with a traditional fixed reference, such as a PLL. the RX clock domain, it must be synchronized through the
By replacing the pausible synchronizer on the fixed-reference TX pausible clock network (F). This acknowledge signal then
side of the interface with brute-force synchronizing FFs to updates the TX logic state machine, freeing the write pointer
synchronize the increment and acknowledge pointers, low increment line for future use (G).
latency in one direction would still be maintained. This would
allow a system to be partially converted to a GALS style V. T IMING A NALYSIS
while maintaining legacy IP with traditional clocking where Pausible clocking integrates the logic for asynchronous
necessary. These advantages make the pausible bisynchronous boundary crossings into the clock generation mechanism for
FIFO a good candidate to overcome the barriers to widespread the entire synchronous island. This integration imposes con-
GALS adoption. straints on the operating conditions of each of these systems.
Previous work in pausible clocks does not fully address these
IV. C IRCUIT O PERATION constraints, but this paper contributes a thorough accounting
The sequence labeled on Figure 3 shows the series of steps of the capabilities and limitations of pausible clock timing,
involved in writing a data word to the FIFO. In this example, which is critical to designing a realistic system. In this section,
the FIFO is initially empty, and all two-phase increment and we derive expressions for the average latency of the pausible
acknowledge lines are available for use. On the rising edge of interface, as well as the constraints imposed upon the clock
the TX clock, data is written to the FIFO address pointed to period, insertion delay, and wire delay across the synchronous
by the write pointer and the input valid signal is asserted (A). island. We neglect the effects of variation in this analysis,
At this point, data is available to be read out of the FIFO. The treating circuit delays as fixed quantities. In reality, stochastic
write pointer logic then increments the write pointer internally, or worst-case corner analysis would be needed to ensure timing
and toggles one of the two-phase write pointer increment robustness, although post-silicon tuning could alleviate the
lines (B). This write pointer increment line is toggled in the effects of process variation. Table I describes each of the
TX domain, and so it is asynchronous to the RX domain variables used in the analysis in this section.
and must be synchronized through the RX pausible clock
network. Depending upon the phase at which the write pointer A. Timing Fundamentals
increment toggle arrives at the RX domain, it may pass through The important delays through the pausible clock network
immediately, be delayed until after the next RX clock edge, are shown in Figure 4. tr2 is the delay from the output of the
or (in rare cases) cause metastability in the mutex and be C-element to the mutex r2 input. tf b is the delay from the r2
Clock Edge
tr2 tfb tg2 Delayed e s
lit
y lv
bi so
Clock a re
st y
et
a lit
bi
m ta
e x as
r2 ut et tfb tg2
M M
Clock
r1
r2
(a)
r1
tr2 tfb tg2
Clock tCL
r2
Fig. 6. The setup time requirement for the synchronized request signal. In the
r1 worst case, metastability consumes any available timing margin, so that when
g1 finally goes high, the next clock edge is guaranteed to occur as soon as
(b) the signal can propagate through tf b and tg2 . This is therefore the maximum
allowable delay for logic that depends on the synchronized output.
Fig. 5. The clock period timing constraint of pausible clocking. In the worst
case, a request arrives just before r2 goes high. In (a), the sum of the delays
through the pausible circuit is longer than half the clock period, and so the
next clock edge is delayed. In (b), the delays are shorter and the clock edge TCL available for logic before this clock edge is only
occurs on time.
tCL = tf b + tg2 . (3)
This parameter is constrained by the complexity of the pointer
input through the mutex and around the feedback loop to the logic; if a long enough time is not apportioned for tCL , then
r1 input. tg2 is the delay from the mutex r1 input to the output an extra register must be inserted before the logic to “pipeline”
of the C-element, including delay through the AND tree when the computation, increasing the latency of the interface by
multiple mutexes contribute timing information. The sum of one cycle. If tm > 0, then increasing tf b by adding delay
these three delays cannot exceed the delay through the clock to the feedback path trades off excess tm to increase the time
generator, or else the clock will frequently pause, increasing available for same-cycle combinational work.
the clock period beyond the target for the synchronous island
(see Figure 5). Since the clock generator delay is set to T /2 In order to derive the average latency of the interface, the
for a desired clock period T , these delays collectively enforce phase at which the request signal arrives must be considered.
a minimum clock period for the synchronous block: As shown in Figure 7, if a request signal arrives while the
mutex is transparent, the request can be serviced within the
T /2 ≥ tr2 + tf b + tg2 (1) same cycle. Assuming that the fully asynchronous request
signal is equally likely to arrive at any phase, the average
If this clock period constraint is exceeded, then the timing latency of such requests is 0.75T − tr2 . If the request arrives
slack in the system translates into a margin tm that guards while the mutex is opaque, then the request cannot be serviced
against the effect of clock pauses: until the next cycle. The average latency of such requests is
1.25T − tr2 . If the duty cycle of the clock is 50%, then taking
tm = T /2 − (tr2 + tf b + tg2 ) (2) the mean of these two expressions gives the average latency
tL of the interface as a whole:
Mutex metastability can be seen as a temporary increase in tf b
caused by simultaneous toggling of the mutex inputs. If (1) is tL = T − tr2 (4)
just satisfied (that is, if T /2 = tr2 + tf b + tg2 ), then tm = 0, Increasing tr2 decreases the average latency of the interface
and any mutex metastability that delays its output will cause because it shifts the transparent phase of the mutex closer to
the clock to pause. If tm > 0, then some metastability can be the next clock edge. If tm > 0, then increasing tr2 by adding
tolerated before a clock pause occurs. In practice, we found delay to the mutex r2 input trades off excess tm to decrease
mutex metastability to be an infrequent event, with long clock the average latency through the interface. Since tm can also be
pauses rare (see Section VI). Accordingly, in this analysis we traded for additional tCL , this means that there is a trade-off
will tend to trade off tm in favor of other more critical timing between reducing latency and increasing the time available for
parameters. combinational work in the read pointer logic.
As detailed in Section IV, the low latency of the pausible
bisynchronous FIFO depends on the ability of the RX pointer B. Insertion Delay
logic to immediately respond to a write pointer update by In real systems, the clock distribution network within the
asserting data valid before the next RX clock edge arrives. synchronous island will have some insertion delay tins be-
The worst-case setup time for this logic is shown in Figure 6. tween the generation of the clock edges and their propaga-
We refer to this available time to complete combinational work tion through the clock network to the register endpoints. As
within the same cycle as a received request as tCL . In the worst first noted in [17], this insertion delay mis-aligns the mutex
case for this timing path, metastability in the mutex causes a transparent phase, which could lead to circuit failure (see
clock pause before resolving in favor of r1. When g1 toggles, a Figure 8). Small insertion delays can be compensated by
clock edge will be generated as soon as this signal propagates intentionally increasing tr2 to match tins , realigning the phases
around the feedback loop to the clock generator. Thus, the time and protecting against metastability. However, the clock period
tL Lockup
Latch
r2
r2 Clock
MUTEX
Average arrival time during opaque phase
MUTEX
r1 g1
MUTEX
(a) r2 g2
tL Clock
C Generator
Clock
Clock Root
r2
Fig. 9. The pausible synchronizer with an added lockup latch to guard against
Average arrival time during transparent phase races caused by large insertion delays. The latch guards the clocked FF until
after the clock can propagate from the root through the clock tree.
(b)
Fig. 7. The latency of the interface. Possible arrival times for each case are
shaded. Data is generally available to be read out of the FIFO at the positive
clock edge after the request passes through the synchronizer. Data that arrives Pausible
during the opaque phase of the mutex (a) will average more than one cycle of Synchronizer
Wire
latency. Data that arrives during the transparent phase of the mutex (b) will Wire Delay
average less than one cycle of latency. Delay
Synchronizer
Synchronizer
Local Clock
Pausible
Pausible
Local Clock Generator and
Generator Pausible
Clock Root Synchronizers
tins
Clock tr2
Pausible
Synchronizer
r2 Synchronization Failure
(a) (b)
Fig. 8. The effect of insertion delay on pausible clock timing. If tr2 is small
relative to the insertion delay, then the mutex is transparent when the clock Fig. 10. The delays imposed on the pausible clocking system by the wire
edges arrive at the FFs. If requests arrive during the shaded period, they will delay incurred to traverse each synchronous island. The synchronizer circuits
pass through the mutex and may induce metastability in the FF following the can either be placed near the boundary (a) or near to the local clock generator
pausible synchronizer. (b).
constraint from (1) limits the increase in tr2 . Setting tr2 = tins latch before it reaches the combinational network.
yields the constraint on the insertion delay permitted for a Expressions for average latency and tCL must be adjusted
given clock period: when insertion delay is considered:
tins ≤ T /2 − tf b − tg2 (5)
tL = T + tins − tr2 (7)
Handling large insertion delays is a fundamental challenge tCL = T /2 + tins − tr2 (8)
of pausible clocking schemes. One technique to allow larger
Increasing insertion delay increases the average latency of the
insertion delay places all FFs adjacent to the interface on a sep-
interface because the next clock edge is delayed relative to the
arate clock with a much smaller clock tree [6]. However, this
transparent phase of the mutex. This delay also increases the
approach could pose challenges with standard toolflows. [18]
time available for combinational work.
proposed adding lockup latches to the circuit as in Figure 9.
We propose a similar scheme, except that our transparent high
latches are enabled by the r2 input, so they are transparent only C. Wire Delay
when the mutex is not. The latches allow requests to propagate In addition to the non-idealities of clock insertion delay, a
through the transparent mutex before the clock signal arrives nonzero wire delay is required to traverse the physical distance
at the leaf nodes, but then delays the request at the transparent between the block interface and the clock generation circuit.
mutex until after the clock edge has safely arrived at the FF Prior analysis of pausible clock timing neglects this delay, but
clock input. The latches do not increase the latency of the collocating all of the blocks involved is impractical for a real
interface because signals that would not race the clock would GALS system. For instance, a tiled partitioning with a nearest-
still have to wait for the next clock edge to be synchronized. neighbor communication scheme would require four interfaces
Adding latches marginally increases the area and energy of the for each block, each communicating to a different neighbor
circuit, but allows an additional T /2 of insertion delay: (see Figure 11). It would not be physically possible to collocate
the clock generator with these different interfaces.
tins ≤ T − tf b − tg2 (6)
Different design decisions impose this wire delay on dif-
However, tCL is decreased by the delay through the transparent ferent paths in the pausible clock circuit. A traditional ap-
latch, as the asynchronous request must propagate through the proach places each mutex at the boundary of the synchronous
TX Clock Domain RX Clock Domain