0% found this document useful (0 votes)
7 views

Design and FPGA-implementation of Asynchronous Circuits Using Two-Phase Handshaking

This paper discusses the design and FPGA implementation of asynchronous circuits using a two-phase handshaking protocol. It provides a tutorial and scientific contributions, including design guidelines for implementing handshake components and examples of circuits such as Fibonacci number generation and GCD computation. The work aims to facilitate hands-on experience for students in asynchronous design, with all code available as open source.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Design and FPGA-implementation of Asynchronous Circuits Using Two-Phase Handshaking

This paper discusses the design and FPGA implementation of asynchronous circuits using a two-phase handshaking protocol. It provides a tutorial and scientific contributions, including design guidelines for implementing handshake components and examples of circuits such as Fibonacci number generation and GCD computation. The work aims to facilitate hands-on experience for students in asynchronous design, with all code available as open source.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2019 25th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC)

Design and FPGA-implementation of Asynchronous


Circuits Using Two-phase Handshaking
Adrian Mardari, Zuzana Jelčicová and Jens Sparsø
Department of Applied Mathematics and Computer Science
Technical University of Denmark
Email: [email protected], [email protected], [email protected]

Abstract—This paper addresses the design and FPGA- they can simulate the circuits, and they can implement and
prototyping of asynchronous circuits using static data-flow hand- operate their circuits using a conventional FPGA board. For
shake components implemented using the two-phase bundled- this purpose, the click-element template [18] using only D-flip
data protocol. The contributions are partly tutorial and partly
scientific. The paper introduces the design process, including flops and combinational gates, seems to be a good fit.
initialization and design of coupled rings with any number of The contributions of the paper are partly tutorial and partly
tokens. Following this, the paper presents gate-level implementa- scientific. The material and insights presented emerged from a
tions of the full set of handshake components as well as some course on asynchronous design, where students were asked to
peephole optimizations that merge the implementation of several design and build small asynchronous circuits. This turned out to
components. The components are implemented using the click-
template. The handshake register implementation is extended be surprisingly difficult. The reason is that when going beyond
with circuitry that decouples the phase of the handshake signals simple pipelines, many important details such as initialization,
on the input and output ports. Such decoupling is needed to numbers of tokens in rings, implementation of components etc.
facilitate implementation of rings with one token (or in the are not well covered in the literature. The aim of our paper is to
general case, rings with any number of tokens). Finally, the paper fill this void, and to enable newcomers to experiment with, and
illustrates the design process using two circuits: one that outputs
the sequence of Fibonacci numbers, and one that computes the get hands-on experience with, the design and implementation
greatest common divisor of two positive integers. All components of small asynchronous circuits, in order to support the learning
are described in VHDL, and all code is available as open source. process.
All components and the two circuits mentioned have been tested The paper makes three contributions: (1) We discuss and
on a Xilinx Nexys4DDR FPGA board. decide on a set of design guidelines, including how to
implement two-phase rings with any number of tokens (often
I. I NTRODUCTION
just a single token). (2) We present the design and FPGA-
When engineering students learn digital electronics, they typ- implementation of the set of handshake components from
ically do lab exercises where they design (small) synchronous [23, Ch. 3]. The handshake registers are what we call “phase-
sequential circuits and implement these in FPGA technology. decoupled” (based on ideas first proposed in [19]). The rest of
A similar situation does not exist for asynchronous design. the components are transparent to the handshaking, in contrast
Despite decades of research, there are no widely used tools, to what is used in most other works based on the click-template.
and the situation has not improved during the last decade. CAD (3) We illustrate the use of the design guidelines and the
tools are typically developed by and used within individual component library using two small circuits: one that emits the
university groups and companies. Many of these groups have sequence of Fibonacci numbers and one that computes the
used variants of CSP [10] to describe asynchronous circuits and greatest common divisor of two unsigned numbers. All code
systems. Some examples are [2], [16], [25]. The last of these and all examples are available as open source.
was later commercialized by the start-up company Handshake The paper is structured as follows: Section II presents
Solutions, and at one point, their Haste language and synthesis background and related work. Section III discusses design
tools were available to universities through Europractice [5]. challenges and presents a set of design guidelines or policies.
Our experience at that time was that students ended up writing Section IV presents the design and FPGA-implementation of
concurrent programs with very limited understanding of what the set of handshake components. Section V shows component
hardware their programs would generate – a paradox in light optimizations that fuse several components. Section VI presents
of the full transparency of syntax directed compilation. the two example data-flow structure circuits, and finally
For a newcomer, and in a teaching context, we believe that Section VII concludes the paper.
less-is-more, and for that reason we aim for a simple and
II. BACKGROUND AND RELATED WORK
straightforward component based-approach. Our aim is to pro-
vide students with FPGA-implementations (i.e., synthesizable A. Data-flow components
VHDL descriptions) of the handshake components presented Asynchronous circuits are often designed using data-flow
in [23, Ch. 3]. From this, they can then build static data flow components. The data-flow abstraction decouples high level
structures by simply wiring together the relevant components, thinking from low-level implementation details including what

978-1-5386-5883-3/19/$31.00 ©2019 IEEE 9


DOI 10.1109/ASYNC.2019.00010
uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
Register Source Sink C-element based Muller pipeline, possibly implemented using
other components than the C-element, as is the case with
Mousetrap and Click. In all cases, the behaviour of a pipeline
is that data (a token) from a predecessor stage is latched by or
Function block Join Fork clocked into a stage, if the data which the stage was previously
holding has been taken over by the successor stage. In two-
F
phase pipelines, a token is captured when the C-element in the
stage makes a transition. In four-phase pipelines, a valid token
is captured when the output of the C-element goes to “1”, and
likewise, an empty token is captured when the C-element goes
Merge MUX DEMUX
to “0”.
0 0 It is well known that a ring composed of Muller pipeline
stages needs at least 3 stages (C-elements) to oscillate [8],
1 1 [23]. The three C-elements repeatedly cycle through the
following sequence of states (010; 011; 001; 101; 100; 110)∗ .
ctl ctl This oscillation can be seen as a standing wave that propagates
by copying the crest and trough forward, with the restriction
Fig. 1. The set of data-flow components from [23, Ch. 3]. The "box-arrows"
represent handshake channels (a bundle of request, acknowledge and data that the circuit cannot enter states 111 and 000.
signals). Except for the handshake register (including Source and Sink), all
components are transparent to handshaking, i.e., they do not buffer data. D. FPGA implementation
Several papers have explored how to use conventional
FPGA-technology for building prototypes of asynchronous
handshake protocol to use. The set of handshake components circuits. There are several challenges involved in this: (1)
introduced in [23, Ch. 3] is shown in Fig. 1. Similar components implementation of C-elements and generalizations thereof,
are used in [3], [12], [15]. (2) handling of isochronic forks, and (3) implementation
of matched delay elements that are used in bundled data
B. Two-phase bundled-data handshake components circuits. The former is addressed in [9] that shows how a
Over the years, researchers have discussed what handshake LUT, whose output signal is fed back to one of its inputs,
protocol to use. Four-phase dual-rail quasi delay-insensitive can be used to realize hazard-free (generalized) C-elements.
handshaking has been in use since the early years [17] and The same paper argues that isochronic forks can be handled
is still widely used [1], [20]. However, its key feature – by setting proper timing constraints for the synthesis. The
insensitivity to delays in gates and wires – comes at a high implementation of delay elements is explored in quite some
cost in terms of area and power. detail in [14], which also gives full details of the design and
The bundled-data design styles avoid this overhead, and Ivan FPGA implementation of an asynchronous network on chip.
Sutherlands paper “Micropipelines” [24] created a big interest We have adopted these techniques for implementing our two-
in two-phase bundled-data circuits. The backbone control circuit phase bundled-data click-style circuits. For completeness, we
is (still) the well-known Muller pipeline based on C-elements. provide pointers to additional literature on this topic [6].
Later, the Mousetrap template [21] was introduced as a faster
III. D ESIGN METHODOLOGY
alternative. It uses only “conventional gates” (an XOR-gate and
a latch) to implement a controller that directly controls a level A. Introduction
sensitive latch for data. However, when going beyond simple There is a rich literature on the implementation of asyn-
pipelines, Mousetrap still needs C-elements, for example in chronous pipeline stages (handshake latches) and their use in
join and fork components. pipelined circuits, such as arithmetic circuits and routers for
In 2010, click elements [18] were introduced as a new networks on chips. A few representative examples are [7], [20].
template for implementing two-phase bundled-data designs. When such pipelined circuits are characterized, the typical
The click-template allows all handshake components to be test-setup is to use an initially empty pipeline connected to
implemented using only combinational gates and D-flip-flops, independent sources and sinks. This context does not expose
that are available in all standard-cell libraries. In addition, all a number of issues related to the design and initialization of
signal paths start and end in edge triggered D-flip-flops. This circuits containing rings. Static data-flow structures typically
is more in line with the view of conventional (synchronous) involve many coupled rings and short pipeline segments,
CAD-tools. The basic idea of the click template has since been possibly shared by several nested rings. Here initialization
used in several other works [4], [19]. plays a key role and when pipelines are connected to form
rings, the handshaking on the two handshake channels that are
C. Control circuits for bundled data pipeline stages connected must agree on the polarity of the signal transitions
The control circuit used in all two-phase and four-phase (rising or falling). Below we address these issues for rings
bundled data pipeline stages can be seen as variations of a using two-phase handshaking.

10

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
B. Rings using two-phase handshaking
In_Ack Out_Req
Out_Ack
In_Req In_Req Out_Ack
The static data flow structures view of a three-stage ring In_Ack Out_Req
using four-phase handshaking is that the three latches contain
a valid token, an empty token and a bubble. Such a three-stage
ring containing one valid token can be used to implement
- Pi Po
iterative computations where the result from the current step
depends on the result from the previous step. P In_Data Out_Data
n D n
For two-phase designs, the situation is different. The static In_Data Out_Data
D
data-flow structure abstraction involves only tokens and bubbles, n n

and the forward propagation of a token is associated with a (a) (b)


transition of the C-element in the control circuit. Consequently,
a three-stage ring using C-element based control circuits will Fig. 2. (a) The Click template from [18]. The phase flip-flops are marked with
P and the data registers are marked with D. (b) Our phase-decoupled Click
contain two tokens and one bubble. The two tokens relate to template using separate phase flip-flops to generate In_Ack and Out_Req.
the rising and falling transition of the wave that rotates in the
ring. A consequence of this is that rings in general can only
contain an even number of tokens. C. The phase-decoupled click template
Following the wave analogy, a fix would be that the ring A less radical solution to the “two-phase problem” involves
alternately propagate a rising transition and a falling transition. some form of decoupling of the phase (rising or falling)
This could be done by inverting the request and acknowledge of the handshake signals in the input and output channels
signals when the input and output ports of a pipeline are on some handshake registers in the circuit. In a click-based
connected to form a ring. We experimented with this line of handshake register, this can be done by introducing a second
thinking, but as static data-flow structures typically involve state-flip flop as shown in Fig. 2(b). This figure also shows an
many coupled rings and short pipeline segments, possibly alternative implementation of the circuit generating the click
shared by several nested rings, there are severe constraints on signal. The shaded rectangles indicate combinational logic that
where inverters can be inserted. This leads to a set of six design is implemented in a single LUT in a FPGA.
policies as opposed to only the two we present in the following. Conceptually, in two-phase design, there is no difference
The solution we present is inspired by [19] where Roncken et between rising and falling signal transitions. However, when it
al. write: “Did you know that a ring of original Mousetrap comes to implementation, the designer has to decide on the
modules cannot possibly hold an odd number of tokens? The initial signal levels. We note that all handshake channels are
same is true for rings of original Micropipeline and Click push channels and we adopt the following policy:
modules . . . This little recognized truth appears clearly in Fig.
P1: For all channels in the circuit, a transition on the request
8 . . . ”.
wire (signaling that the driving circuit has a token) is
The figure referred to shows a so-called canopy graph where followed by a transition on the acknowledge wire with the
the throughput of a 24-stage ring is plotted as a function of the same polarity (signaling that the token has been received).
number of tokens in the ring. All graphs have the same shape, Following this policy, we see that the input channel has a
but except for the new pipeline template introduced in the paper, new token when In_req = In_Ack and that a token on the output
all other graphs only have data-points corresponding to an even channel has been received by the downstream neighbour when
number of tokens. In our view, this 24-stage ring context blurs Out_Req = Out_Ack. The use of XOR and XNOR gates in the
the importance of the observation. The fact that rings with click control circuit in Fig. 2(b) shows this in a more explicit
a single token are not possible is a major issue because it way. However, the function of the circuits generating the click
precludes implementation of iterative/recursive computations. signals in Figs. 2(a) and 2(b) are identical.
As observed in [19], the problem is that all previously Fig. 3 shows static data-flow structure schematics of two-
published pipeline stages (C-element-based, Mousetrap [21] phase rings with two and three stages, both initialized to hold
and Click [18]) produce transitions on Out_req and In_ack with one token in the first stage (R1). The 0’s and 1’s annotated
the same polarity – typically the same polarity as the incoming to the ports of a click stage are the initial values of the state
In_req. This can be observed in Fig. 2(a) that shows the basic flip flops (driving the In_Ack and Out_Req signals on the
Click template. Following this observation, Roncken et al. then input and output ports respectively). We start by labeling the
develop a new paradigm for structural design of asynchronous channel connecting R1 and R2 with 1-0 meaning that a token is
circuits using “link” and “joint” components that unifies all about to propagate across this channel. The remaining channels
known pipeline templates. This represents a radical change of are all marked 0-0 or 1-1 because R2 and R3 driving these
viewpoint. Our aim is to stick to the static data-flow structures channels hold bubbles. If the marking on the input and output
view, and still use the set of handshake components introduced channels of a handshake register are identical, these stages can
in [23, Ch. 3]. be ordinary click-stages. Figs. 3(b) and 3(c) show alternative

11

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
Token Barrier Bubble Token Token Barrier Bubble
0 1 0 0 0 1 1 0 0 0

(a) R1 R2 (a) R1 R2 R3

Token Barrier Bubble Bubble Token Token Barrier Bubble


0 1 0 0 0 0 0 1 0 1 0 0

(b) R1 R2 R3 (b) R1 R2 R3

Token Barrier Bubble Bubble


Fig. 4. Static data-flow structure schematics of a three-stage ring with two
1 1 0 1 1 1 tokens. (a) Using the traditional initialization. (b) Initialized following policy
P2. Handshake registers R1 and R2 must now be phase-decoupled.

(c) R1 R2 R3

In order to safely bring the circuits out of reset and into


Fig. 3. Static data-flow structure schematics of a two-stage and a three-stage normal operation we introduce (where needed in order to keep
ring with a single token built from phase-decoupled click-stages. Initial values
of all phase flip-flops are shown at the input and output ports where they the circuit “frozen” in its initial state), a barrier on channels
directly drive the request and acknowledge signals. that initially propagate tokens. These barriers are controlled by
a global start-signal and they block the request signals on the
corresponding channels. In this way, reset can be de-asserted,
initializations of the three-stage ring. In Fig. 3(b) R1 is a phase- possibly with some skew, before the start-signal is asserted and
decoupled click-stage and R2 and R3 are ordinary click-stages. the circuit starts operating. According to policy P2, a channel
In Fig. 3(c) R2 is a phase-decoupled click-stage and R1 and propagating a token has the request signal set to high. This
R3 are ordinary click-stages. means that the barrier must output a request signal that is low.
This hints that a component library may need multiple An AND-gate is used to implement this forcing to zero.
versions of the handshake register component. To simplify,
we limit to the generic case, which is the phase-decoupled IV. I MPLEMENTATION
version, where every input and output channel has a state flip-
flop for the request and acknowledge signals respectively. A We now present our implementations of the handshake
possible optimization, that we have not implemented, is to components shown in Fig. 1.
allow the synthesis tool to perform what is called “register
sharing” of the state flip-flops in a decoupled click stage. A. Overview
We prefer the initial state shown in Fig. 3(b) to the initial
states shown in Fig. 3(a) and 3(c) because it can be expressed Following [23, Ch. 3], our handshake registers (including
in a single policy for initialization: their degenerate versions, Source and Sink) are the only com-
ponents which actively implement the handshaking that makes
P2: All channels conveying tokens are initialized with Req = 1
the data-flow circuits operate. All other components (Func,
and Ack = 0. All channels not conveying tokens are
Join, Fork, Merge, MUX and DEMUX) are passive/transparent
initialized with Req = 0 and Ack = 0. A handshake register
from a handshaking point of view. This is in contrast to the
where In_Ack = Out_Req must be phase-decoupled.
components presented in [18], where for example the Join
Handshake registers where In_Ack = Out_Req may be
and DEMUX components buffer data. We prefer to see such
ordinary click-stages.
components, which fuse for example a Join and a handshake
Fig. 4(a) shows the initial state of a three-stage ring with two register, or a DEMUX and a handshake register, as peephole
tokens using conventional click element handshake registers. optimizations that, if desired, may be performed later in the
Fig. 4(b) shows the initialization that result from adhering design process. In our view, a designer must first develop
to policy P2. Handshake registers R1 and R2 must now be a design with the desired number of handshake registers,
phase-decoupled registers. tokens and bubbles, and this should not be constrained by
the number of joins and forks that happen to be in the circuit.
D. Initialization Except for the handshake registers, all other components are
The entire circuit is reset to a state where the state flip-flops passive/transparent to the handshaking. This means that they
in the click-stages are set according to the levels annotated do not (need to) implement any phase decoupling. All of the
to the channels in the static data-flow structures diagram, and components have been implemented and tested on a Nexys4
where handshake registers in the data path that initially hold DDR board. The source files can be found in a Git-repository
tokens are set to the desired initial values. [11].

12

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
Listing 1. Phase-decoupled handshake register
e n t i t y decoupled_hs_reg i s Req delay
generic ( Ack
DATA_WIDTH : n a t u r a l : = DATA_WIDTH ;
VALUE : natural := 0; Data CL
PHASE_INIT_IN : s td _lo gi c := ’ 0 ’ ;
PHASE_INIT_OUT : s t d _ l o g i c : = ’ 0 ’ ) ;
port ( r s t : in s t d _ l o g i c ;
Fig. 5. Function Block
−− I n p u t channel
in_ack : out s t d _ l o g i c ;
in_req : in s t d _ l o g i c ;
in_data : in s t d _ l o g i c _ v e c t o r CLB CLB
(DATA_WIDTH−1 downto 0 ) ; Slice 1 Slice 1
−− Output channel X1Y1 X3Y1
o u t _ r e q : out s t d _ l o g i c ; Slice 0 Slice 0
o u t _ d a t a : out s t d _ l o g i c _ v e c t o r X0Y1 X2Y1
(DATA_WIDTH−1 downto 0 ) ;
out_ack : i n s t d _ l o g i c ) ;
end decoupled_hs_reg ; CLB Slice 1 CLB Slice 1
X1Y0 X3Y0
a r c h i t e c t u r e b e h a v i o r a l of decoupled_hs_reg i s
Slice 0 Slice 0
X0Y0 X2Y0
s i g n a l phase_in , phase_out , c l i c k : s t d _ l o g i c ;
signal data_sig : s t d _ l o g i c _ v e c t o r
(DATA_WIDTH−1 downto 0 ) ;
begin Fig. 6. A Xilinx FPGA is composed of slices (each containing a number of
o u t _ r e q <= phase_out ; LUTs and DFFs). Slices are identified by Cartesian coordinates.
in_ack <= phase_in ;
o u t _ d a t a <= d a t a _ s i g ;

c l i c k <= ( i n _ r e q xor phase_in ) and This can be done by adding a delay to the click signal (delaying
( out_ack xnor phase_out ) ; the clocking of phase and data flip-flops) or by adding a delay
c l o c k _ r e g s : process ( c l i c k , r s t ) after one of the phase flip-flops (delaying In_ack or Out_req).
begin
i f r s t = ’ 1 ’ then C. Function blocks and delay elements
phase_in <= PHASE_INIT_IN ;
phase_out <= PHASE_INIT_OUT ; A function block is an ordinary combinational circuit
d a t a _ s i g <= s t d _ l o g i c _ v e c t o r ( to_unsigned extended with a request and an acknowledge signal, see
(VALUE, DATA_WIDTH ) ) ;
e l s i f r i s i n g _ e d g e ( c l i c k ) then Fig. 5. The request signal must be delayed by more than
phase_in <= not phase_in ; the propagation delay of the combinational circuit. For this, we
phase_out <= not phase_out ; use delay elements that are initially set with a very large safety
d a t a _ s i g <= i n _ d a t a ;
end i f ; margin. Later, based on post place and route simulation, the
end process ; designer may manually trim down the delays to better match
end b e h a v i o r a l ;
the propagation delay in the logic. Automation of this process
is future work. This simple and straightforward implementation
of a function block does not offer any joining of inputs or
forking of outputs.
The delay elements are implemented following the guidelines
B. Handshake Register outlined in [14]: a chain of LUTs whose relative physical
placement on the FPGA is constrained/controlled. Listing 2 on
The handshake register and the phase-decoupled handshake
the next page shows the VHDL code for the delay element. The
register are described in the previous section and their imple-
LUT component used in this implementation has a single input
mentations can be seen in Fig. 2. The VHDL code for the
and implements a buffer (a so-called LUT1 initialized with
phase-decoupled handshake register is shown in Listing 1.
truth table "10"). In order to obtain reproducible delay values,
When deciding the initial state of the phase flip-flops in a
the placement of the LUTs that implement delay elements is
circuit, it is important to note that if a stage holds a token, the
crucial. The rloc attribute allows the designer to specify the
request and acknowledge signals in its output channel must
relative location of the slices in which the LUTs are placed. The
have the opposite phases. If a stage represents a bubble, the
relative placement is specified using the Cartesian coordinates
request and acknowledge signals in its output channel must
(X#Y#) of the slices, as illustrated in Fig. 6. The VHDL code
have the same phase as its downstream neighbor (c.f. policies
for the delay element is shown in Listing 2. As seen in the
P1 and P2).
code the chain of LUTs are placed in a single column in slices
The click pulse has a very short duration. This does not X0Y0, X0Y1, X0Y2, . . . in the following order, as specified
cause problems for the edge-triggered flip-flops (FF) on the by the Y-index of the slice: (0, 1, 0, 1, 0, 1, 0, 1, 2, 3, . . . ).
FPGA we used for testing. If desired, the pulse-width can be By always placing the next LUT in a different slice, we get a
increased by delaying the self-resetting of the control circuit. higher delay due to the delay of the wires.

13

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
Listing 2. Delay element
l i b r a r y IEEE ; OutB_Req InA_Req
InA_Req InB_Req
use IEEE . STD_LOGIC_1164 . ALL ; OutC_Req OutC_Req
l i b r a r y unisim ;
use u n i s i m . vcomponents . l u t 1 ; OutB_Ack
OutC_Ack
InA_Ack
e n t i t y delay_element i s
generic (
s i z e : n a t u r a l range 1 to 30 : = 1 0 ) ;
port (
d : i n s t d _ l o g i c ; −− Data i n P
z : out s t d _ l o g i c ) ;
end delay_element ; InA_Ack
P OutC_Ack
InB_Ack
a r c h i t e c t u r e l u t of delay_element i s
component l u t 1
generic ( (a) (b)
init : b i t _ v e c t o r : = " 10 " ) ;
port ( Fig. 7. (a) Fork. (b) Join.
I0 : in std_ulogic ;
O : out s t d _ u l o g i c
); sel_a
InA_Req
end component ;
InA_Ack
−− I n t e r n a l s i g n a l s .
Pa
s i g n a l s_connect : s t d _ l o g i c _ v e c t o r ( s i z e downto 0 ) ; InB_Req
−− s i g n a l c o n s t r a i n t s sel_b
a t t r i b u t e DONT_TOUCH : s t r i n g ; InB_Ack
a t t r i b u t e DONT_TOUCH of s_connect : s i g n a l Pb
is " true " ;
attribute rloc : string ; click_out

Pc
begin
s_connect ( 0 ) <= d ; click_in OutC_Req
−− Create a r i p l e −c h a i n o f l u t s OutC_Ack
l u t _ c h a i n : f o r i n d e x i n 0 to ( s i z e −1) generate
signal o : s t d _ l o g i c ; InA_Data
n
type y_placement i s array n
OutC_Data
( i n t e g e r range 0 to 29) of i n t e g e r ; InB_Data
n
−− y c o o r d i n a t e s f o r r e l a t i v e l o c a t i o n
constant y _ v a l : y_placement : = ( 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ,
2 ,3 ,2 ,3 ,2 ,3 ,2 ,3 ,4 ,5 ,4 ,5 ,4 ,5 ,4 ,5 ,6 ,7 ,6 ,7 ,6 ,7); Fig. 8. Merge

a t t r i b u t e r l o c of d e l a y _ l u t : l a b e l i s
"X0Y" & i n t e g e r ’ image ( y _ v a l ( i n d e x ) ) ;
are guaranteed to be in phase. Again, the shaded rectangles
begin
delay_lut : lut1 indicate combinational logic that is implemented in a single
generic map( LUT in a FPGA.
i n i t => " 10 " ) −−t r u t h t a b l e
port map(
I 0 => s_connect ( i n d e x ) , E. Merge
O => o
); The implementation of the Merge is shown in Fig. 8. It
assumes mutually exclusive inputs and therefore uses separate
s_connect ( i n d e x +1) <= o a f t e r 1 ns ;
end generate l u t _ c h a i n ;
phase flip-flops (denoted Pa and Pb) in the input ports. As
−− Connect t h e o u t p u t o f d e l a y element the input and output phase flip-flops are clocked by separate
z <= s_connect ( s i z e ) ; signals, it also needs a separate phase flip-flop (denoted Pc) in
end lut ; the output port.
The circuit functions as follows: A transition on either
InA_Req or InB_Req asserts either Sel_A or Sel_B and the
multiplexor propagates the proper input data to the output
D. Join and Fork (Out_Data). This also creates a rising edge on the signal
click_out, which causes a transition on Out_Req. Finally, this
Simple and straightforward implementations of the join and
creates a (silent) falling transition on signal click_in. When the
fork components are shown in Fig. 7. They are textbook imple-
right hand environment later acknowledges by transitioning
mentations [23, Sect. 5.2] using a click-circuit to implement
signal Out_Ack, this causes a rising edge on signal click_in.
the functionality of C-element.
This clocks both Pa and Pb and causes a transition on InA_Ack
Following design policies P1 and P2, the simple join in
if the operation of the merge started by a transition on InA_Req
Fig. 7 can always be used. The phase flip-flop is initialized
or a transition on InB_Ack if the operation of the merge started
according to the state of the input and output channels.
by a transition on InB_Req.
Because the component is transparent to handshaking, these

14

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
InA_Req
InA_Ack

Pa Join+Register Join+Register+Fork Register+Fork


InB_Req
InB_Ack Fig. 11. Schematic symbols for a handshake register fused with a Join and/or
a Fork.
Pb

InSel_Req
InA_Req OutC_Req
InSel_Ack
InA_Ack OutC_Ack
Ps
InSel_Data InB_Req OutD_Req
InB_Ack OutD_Ack

click_out Pa Pb Pc Pd

Pc
OutC_Req InA_Data OutC_Data
click_in OutC_Ack n D n
InB_Data OutD_Data
InA_Data n 1
OutC_Data
InB_Data 0 n
n
Fig. 12. Implementation of a phase-decoupled fused Join+Register+Fork
component with two input channels (A and B) and two output channels (C
Fig. 9. MUX and D).

InA_Req OutB_Req
InA_Ack OutB_Ack the output channel (signal OutC_Req) and is toggled whenever
InSel_Req OutC_Req there is a token on the selector channel and the selected input.
InSel_Ack OutC_Ack Similar to the Merge, the MUX has phase-decoupled channels
click_in
click_out due to the nature of its function.
Fig. 10 shows the implementation of the DEMUX (inspired
Pa Pb
by [13]). It has two input channels (InA and InSel) and two
InSel_Data output channels (OutB and OutC). The component joins the two
Pc
inputs and produces an output on the selected channel. Similar
OutB_Data to the MUX, the DEMUX has multiple internal phase flip-
InA_Data
n
n
flops. The phase flip-flops Pb and Pc are clocked when both
n OutC_Data request signals on the input channels transition. Phase flip-flop
Pa (participating in the input channel handshakes) is clocked
Fig. 10. DEMUX whenever an acknowledgement is received (as indicated by the
following expression: OutB_Ack = OutB_Req) ∧ (OutC_Ack
= OutC_Req). Again, we prefer this style of clocking to the
This way of using a phase-flip-flop to produce an acknowl- gated clocking used in the components described in [13], [18].
edge based on the corresponding request is a small variation
that we prefer instead of the clock-gating used in in the buffered V. P EEPHOLE OPTIMIZATIONS
Merge in [18] and the plain Merge described in [13]. A gated It is possible to reduce the hardware cost of a circuit by
clock produced by an AND-gate requires the gating signal to be performing peephole optimizations, where certain combinations
stable in a time window overlapping the period where the clock of handshake components are replaced by a single fused circuit.
signal is high. Our solution avoids this timing requirement. All of these optimizations involve merging handshake registers
and one or more of the passive components. The original click-
F. MUX and DEMUX paper [18] showed how easy it is to extend the click-template
The MUX and DEMUX components are used to implement with join-functionality on the input and fork functionality on
conditional flow control. The MUX has two input channels the output. The same is the case for our phase-decoupled
(InA, InB), a selection (input) channel for choosing between handshake register. Below we describe a range of such fused
InA and InB and an output channel (OutC). Fig. 9 shows the components.
implementation of the MUX. The phase flip-flops Pa, Pb and Ps
are all clocked on every transition of the incoming acknowledge A. Join+Register+Fork, Join+Register and Register+Fork
by the same signal derived from the function OutC_Req = The schematic symbols for a handshake register fused with
OutC_Ack. The phase flip-flop Pc drives the request signal of a Join and/or a Fork are shown in Fig. 11, and Fig. 12 shows

15

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
RF1
the implementation of a fused Join+Register+Fork circuit with R0 RF0 0
1
two input channels and two output channels. For simplicity the 0
1 1 0
0 1 0

0
figure shows a design with separate phase flip-flops for each 0

input and output channel. As all phase flip-flops are clocked


0
by the same signal, at most two phase flip-flops are needed – +
phase flip-flops initialized to the same value can share a single
CL0 J0 Barrier
flip-flop. It is easy to see how input and output channels can
be dropped from or added to the circuit by dropping or adding Fig. 13. Schematic of the Fibonacci circuit. The handshake register shown
XOR or XNOR gates. using dashed lines could be added for a more straightforward implementation,
but without it, the circuit offers better illustration of the spread-token operation.
B. Register+Merge, Register+MUX and Register+DEMUX
We have developed fused versions of a handshake register
fused with a Merge or a MUX or a DEMUX. All implementa- join simply forwards the acknowledge signal from the output
tions are in the Git-repository, but the implementations of these channel to the two input channels (without any buffering). This
fused components are only marginally smaller and faster than acknowledge signal is produced by handshake register R0.
compositions of the basic components. Because of this, and When the go signal is asserted and the barrier opens, the
due to space limitations, we do not include the descriptions circuit starts: J0 joins the tokens from RF0 and RF1 and the
here. This is also in line with our overall goal of simplifying resulting (single) token spreads across the J0, CL0 (the adder)
the design process. and into R0. At the same time, the environment consumes (a
forked copy of) the token in RF1. This spread-token operation
C. Join+Func+Register+Fork is mentioned/assumed in [23], and studied in detail in [22]. We
We considered fusing components that implement a complete have chosen this design, instead of the more straight-forward
pipeline stage, i.e., a fused Join+CL+Register+Fork circuit. implementation (with an additional handshake register shown
This could be done by fusing the Join and the Fork into the using dashed lines in Fig. 13), to better illustrate the spread
handshake register. In the general case, where nothing is known token semantics. The Git repository contains an illustration of
about the surrounding circuitry, this would require a matched the spread token operation of the Fibonacci circuit (as well as
delay element for each input channel. As the matched delay for the GCD circuit presented in the next section).
elements are expensive to implement in FPGA technology, we
B. Greatest common divisor
decided not to pursue this idea.
The greatest common divisor (GCD) circuit shown in Fig.
VI. D ESIGN EXAMPLES 14 was designed after [23, Sec. 3.7] with small modifications.
In this section, we illustrate the use of the components and As we use two-phase handshaking, we need fewer handshake
the design methodology by showing and explaining the design registers. In addition, we use a Merge, ME0, instead of a MUX
and implementation of two small circuits that contain multiple at the end of the if-then-else construct. The circuit is initialized
coupled rings and pipeline segments. Both circuits have been according to policies P1 and P2 with a token (with value ’1’)
implemented and tested on the Nexys4DDR FPGA-board and in handshake register R0 and with bubbles in the remaining
the code is available in the Git-repository. handshake registers. The circuit has no barrier since, after reset,
it waits for a token on the input channel.
A. Fibonacci
The Fibonacci circuit has no inputs; it simply computes
R0
and outputs the sequence of Fibonacci numbers (0, 1, 1, 2, 1
0
3, 5, . . . ). Our implementation using phase-decoupled two- 0 F0
0
phase handshake components is shown in Fig. 13. The figure
CL0 A != B DX0
also shows the initial state of the circuit and the use of fused 0
0
MX0 0
A,B 0 RF0 0
components. A matched delay is only required in the function 0 0
0
0 RESULT
0
block (CL0), since the LUTs generating the click signals in the 0
0

other components normally provide sufficient delay margins. 1 CL1 0


0
0 1
The circuit is initialized with a token in each of the two A>B
DX1
Register+Fork components. The design consists of two nested CL2 0 RF1
rings: an inner ring containing RF0 → J0 → CL0 → R0 and an A-B
0
1
0

0 0 0
outer ring containing the same components and handshake CL3 0
0
register RF1. The inner ring has two handshake registers and B-A
0
0
one token. The outer ring has three handshake registers and 0
ME0
two tokens. By following policy P2, we can ensure correct
initialization of both rings. Notice that the schematic shows Fig. 14. Schematic of the GCD circuit.
no annotations on the input channels of join J0; our passive

16

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
grouped by component. A file showing a similar simulation of
the GCD circuit is included in the GitHub repository.
Both circuits work correctly in simulation and on the
actual FPGA board. As this paper focuses on the design
process and on FPGA-prototyping, we use delay elements with
very conservative (high) values. For this reason, it does not
make sense to report performance measures. A more detailed
discussion of performance and performance optimization is
beyond the scope of this paper.
VII. C ONCLUSION
This paper presented a simple, structural approach to the
design and FPGA implementation of asynchronous circuits
using data-flow handshake components. The aim of the paper
is to enable students, and others who are in the process of
learning asynchronous design, to design and implement small
asynchronous circuits using FPGA technology.
The components use two-phase bundled-data handshaking
and are implemented using a novel phase-decoupled extension
of the click-element template. This phase-decoupling allow
implementation of nested rings with any number of tokens
including the most typical situation – rings with a single
token. In this way, two-phase bundled-data implementations
of iterative/recursive functions are now possible.
The paper presents the implementation (described in VHDL)
of all components in the library and it illustrates the design
Fig. 15. Post-synthesis timing simulation of the Fibonacci circuit. method using two example circuits: Fibonacci and greatest
common divisor. All code, including the design examples, is
available as open source.
Notice that R0 has different phases on the in- S OURCE CODE
put and output channels. This is because the ring The paper is accompanied by an on-line repository [11]
MX0 → RF0 → F0 → R0 has a single token. The other rings containing: (a) Schematics and VHDL source code for all the
in this circuit are: MX0 → RF0 → DX0 → RF1 → DX1 → ME0 handshake components. (b) Schematics and source code for
and MX0 → RF0 → F0 → DX0 → RF1 → DX1 → ME0. In all of the two design examples including VHDL test-benches for
the rings the tokens eventually get spread across several simulation. (c) A sequence of snapshots of the schematics
components as seen in the step-wise illustration provided in illustrating the token-flow operation of the circuits.
the Git repository.
R EFERENCES
C. FPGA Implementation
[1] Filipp Akopyan, Jun Sawada, Andrew Cassidy, et al. True North:
Both the Fibonacci circuit and the GCD circuit have been Design and Tool Flow of a 65 mW 1 Million Neuron Programmable
implemented on a Digilent Nexys4DDR FPGA-board (with Neurosynaptic Chip. IEEE Tran. Computer-Aided Design of Integrated
Circuits and Systems, 34(10):1537–1557, 2015.
a Xilinx Artix 7 chip) and the circuits have been operated [2] Erik Brunvand and Robert F. Sproull. Translating concurrent programs
manually. Input channels are implemented using a debounced into delay-insensitive circuits. In Proc. Int’l. Conf. Computer-Aided
pushbutton for the request signal, a set of switches for the Design, pages 262–265, November 1989.
[3] S. Chatterjee, M. Kishinevsky, and U. Y. Ogras. xmas: Quick formal
data, and an LED for the acknowledge signal. Output channels modeling of communication fabrics to enable verification. IEEE Design
are implemented using LEDs for the request signal and the Test of Computers, 29(3):80–88, 2012.
data signals, and a debounced pushbutton for the acknowledge [4] M. Davies, N. Srinivasa, T. Lin, et al. Loihi: A neuromorphic manycore
processor with on-chip learning. IEEE Micro, 38(1):82–99, 2018.
signal. The corresponding XDC-files (constraint files specifying [5] Europractice. URL: http://www.europractice.com.
the pinout) are included in the design sources in the GitHub [6] P. D. Ferguson, A. Efthymiou, T. Arslan, and D. Hume. Optimising
repository. In the component source files, the "DONT_TOUCH" self-timed FPGA circuits. In Proc. Euromicro Conference on Digital
System Design: Architectures, Methods and Tools, pages 563–570, 2010.
attribute is set for combinational signals and registers, to force [7] Alberto Ghiribaldi, Davide Bertozzi, and Steven M. Nowick. A transition-
the place and route tool to keep the signals. Therefore, minimal signaling bundled data NoC switch architecture for cost-effective GALS
project setup is necessary for using the designs. multicore systems. Proceedings - Design, Automation, and Test in Europe
Conference and Exhibition, pages 332–337, 2013.
A post synthesis simulation of the Fibonacci circuit is shown [8] Mark R. Greenstreet, Jørgen Staunstrup, and Ted E. Williams. Self-timed
in Fig. 15. The first five signals show the environment signals. iteration. In Carlo H. Séquin, editor, Proceedings of VLSI ’87, pages
Below these, some select internal signals are also plotted and 269–282. IFIP, August 1987.

17

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply
[9] Quoc Thai Ho, Jean-Baptiste Rigaud, Laurent Fesquet, Marc Renaudin, Stanford University Press, 1963.
and Robin Rolland. Implementing asynchronous circuits on LUT based [18] Ad Peeters, Frank te Beest, Mark de Wit, and Willem Mallon. Click
FPGAs. In Field-Programmable Logic and Applications: Reconfigurable elements: An implementation style for data-driven compilation. In Proc.
Computing Is Going Mainstream, pages 36–46. Springer, 2002. IEEE International Symposium on Asynchronous Circuits and Systems
[10] C. A. R. Hoare. Communicating sequential processes. Communications (ASYNC), pages 3–14, 2010.
of the ACM, 21(8):666–677, August 1978. [19] M. Roncken, S. M. Gilla, H. Park, N. Jamadagni, C. Cowan, and
[11] https://github.com/zuzkajelcicova/Async-Click-Library. I. Sutherland. Naturalized communication and testing. In Proc. IEEE
[12] Lana Josipović, Radhika Ghosal, and Paolo Ienne. Dynamically scheduled International Symposium on Asynchronous Circuits and Systems (ASYNC),
high-level synthesis. In Proc. ACM/SIGDA International Symposium on pages 77–84, 2015.
Field-Programmable Gate Arrays (FPGA), pages 127–136, 2018. [20] Basit Riaz Sheikh and Rajit Manohar. An asynchronous floating-point
[13] I Kotleas, D.R. Humphreys, R.B. Sørensen, E. Kasapaki, F. Brandner, multiplier. In Proc. IEEE International Symposium on Asynchronous
and J. Sparsø. A Loosely Synchronizing Asynchronous Router for TDM- Circuits and Systems (ASYNC), pages 89–96, 2012.
Scheduled NOCs. In Proc. IEEE/ACM International Symposium on [21] M. Singh and SM Nowick. MOUSETRAP: High-speed transition-
Networks-on-Chip (NOCS), pages 151–158, 2014. signaling asynchronous pipelines. IEEE Transactions on VLSI Systems,
[14] Jon Neerup Lassen. FPGA prototyping of asynchronous networks- 15(6):684–698, 2007.
on chip. Master’s thesis, Dept. of Information Technology, Technical [22] Danil Sokolov, Ivan Poliakov, and Alex Yakovlev. Analysis of static
University of Denmark, 2008. Report IMM-M.Sc.-2008-26) available at data flow structures. Fundamenta Informaticae, 88(4):581–610, 2008.
http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=7126. [23] J. Sparsø and S. Furber, editors. Principles of asynchronous circuit
[15] Rajit Manohar. Reconfigurable asynchronous logic. In Proc. Custom design – A systems perspective. Kluwer Academic Publishers, 2001.
Integrated Circuits Conference (CICC), pages 13–20. IEEE, 2006. [24] Ivan E. Sutherland. Micropipelines. Communications of the ACM,
[16] Alain J. Martin. Compiling communicating processes into delay- 32(6):720–738, June 1989.
insensitive VLSI circuits. Distributed Computing, 1(4):226–234, 1986. [25] C. H. van Berkel, C. Niessen, M. Rem, and R. J. J. Saeijs. VLSI
[17] David E. Muller. Asynchronous logics and application to information programming and silicon compilation: A novel approach from Philips
processing. In H. Aiken and W. F. Main, editors, Proc. Symp. on research. In Proceedings of the 1988 IEEE International Conference on
Application of Switching Theory in Space Technology, pages 289–297. Computer Design, pages 150–166. IEEE, 1988.

18

uthorized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 05,2025 at 15:39:54 UTC from IEEE Xplore. Restrictions apply

You might also like