POC_PPU
POC_PPU
Figure 31: Two different configurations are available to install a packet processing unit,
respectively on the read side and on the write side of each FIFO
This is probably the most critical section of this thesis since it contains the core of
this hardware application, that was build from scratches. The reader will be proposed
the same iter of development the application was given: feature after feature, it will be
brighter and plainer why it is referred to as a toolbox. As said at the end of section 5, the
idea is to build in the FPGA fabric a data path, well pipelined, made of several registers.
Such datapath must be able to bring the data from one side of the packet processing unit
to the other and apply possibly some modifications. This can be represented as a supply
chain, where there are many workers, standing in a fixed position, operating on the ob-
ject that is carried by the running belt. Such an object is, in this case, one byte (8 bits)
of a network packet, recalling that the GMII data interface is an 8-bit wide bus. If all the
workers can do their jobs before the belt moves, the whole system runs smoothly. More-
over, the whole stream of data preserves the same spacing in time between one packet
and the following; the only measurable difference is the time of travel through the PPU.
In this specific applications, it is roughly 200 ns, that is nothing if compared to the aver-
age latency in a wired gigabit communication (a hundred times greater). Almost all the
implemented features of the Packet Processing Unit must be configured via software. As
regards this section, only a hardware description will be provided; then, later, the reader
will go through each software configuration, in the same order of appearance. The start-
ing point for the hardware design is the implementation of the previously mentioned
80
6.4. Packet Processing Unit
“supply chain”. This is nothing but a chain of twenty-four 10-bit serial registers, with syn-
chronous resets, whose first input is wired to the read output of the FIFO, whereas the
last output is connected to the TX interface of the GMII to RGMII. The number of 24 was
chosen as a consequence of the ICMP killer function (explained in section 6.4.5). Indeed,
one can work on the Packet Processing Unit to reduce the number of registers; however,
this topic will be thoroughly examined in the Upgrades Section.
The clock setup is probably one of the most sensitive parts of the design. Being the
Packet Processing Unit placed right after the FIFO, the only way to guarantee the integrity
of the stream of data is to use the same clock as the read-side FIFO. Such clock will also be
fed to the TX side of GMII to RGMII. However, due to the implementation of filtering fea-
tures in the Packet Processing Unit, this choice has to be slightly changed. In the following
sections, the working principles of such filters will be explained in detail. For the moment,
the reader has to believe that there is the need for a clock running two times faster than
the main clock. Moreover, they must also be in phase: once every two events, they must
transition from low to high together. There are a couple of ways to figure out this issue,
but the bottom line is the same: the two clocks must have the same source. In order to
generate a clock signal inside an FPGA, there are two different sources: Phase-Locked
Loop (PLL) and Mixed-Mode Clock Manager (MMCM). While the first one is quite com-
mon in the world of electronics as a frequency synthesizer, the second is mostly used in
the world of embedded systems. An MMCM can be seen as an upgraded version of a PLL,
with additional features, among which phase control [54]. As reported in [55], this compo-
nent is used to generate multiple clocks with defined phase and frequency relationships
to a given input clock. As regards this specific application, the GMII to RGMII generates
the clock that controls its TX side and feeds the read side of the FIFO. The starting point
is the 200 MHz source, provided by the Zynq7 Processing System IP. Unfortunately, it is
not possible to instantiate a component inside a pre-packaged IP; therefore, since the TX
data path of the GMII to RGMII is synchronous to its internally generated TX clock, there
81
6. Case Study - Proof of Concept
is no way to enforce a relationship with another clock. Luckily there is a workaround, that
allows feeding an external clock to the GMII to RGMII, through the gmii_clk input. The
new configuration implies that now the TX data path will be synchronous to the external
clock, but there will be no more options to choose the speed of the link, being such input
clock fixed. Enabling the external clock disables the clock generator that was providing
the three different options for the TX data path. Although this little disadvantage does not
allow much flexibility, the external clock is considered a weightier advantage. Therefore,
starting from the 200 MHz clock, required anyways by the GMII to RGMII to configure the
IDELAYCTRL, two additional clocks are generated by an MMCM: a 125 MHz one, to drive
the Packet Processing Unit and the TX data path of the GMII interface, including the read
side of the FIFO and then a 250 MHz one, for other purposes, explained later. These clocks
are ensured to be in phase since they come from the same MMCM.
The most straightforward task that the Packet Processing Unit can perform consists
in detecting whether a network packet is about to be streamed through its registers. Re-
calling section 3.4, an Ethernet packet always begins with a preamble, that is a sequence
of 8 bytes all equal to 0x55, except the last one, equal to 0xD5. A simple Finite State Ma-
chine is sufficient to build a Preamble Detector, just by checking the value of the very first
register and keeping track of the number of consecutive positive events. If the right se-
quence of 7 times 0x55 followed by a 0xD5 is detected, the PPU asserts an output signal
for, roughly, 100 ms, called PACKET_DETECTED. Such output is driving a green LED on
the Ethernet module that accepted the packet. Although it seems to be quite a simple and
useless feature, it is instead fundamental, because it determines the starting point for all
the following data processing functions. Indeed, one can manipulate a network packet,
only if there is a clear sign of its inception.
82
6.4. Packet Processing Unit
An Ethernet frame contains the MAC address of the destination, followed by the MAC
address of the sender (source), right after the preamble. It would be useful to create a
blacklist of MAC addresses, both for source and destination. This feature allows the board
to discriminate the network packets if their address matches the corresponding list (des-
tination or source). Hence, a packet might be dropped on purpose, i.e., lost in wires, and
thus prevented from flowing through the board.
The operation of filtering according to a given list is performed by first storing such list
in a memory, in this case, a BRAM cell. Then, each time a new candidate MAC address is
under inspection, the blacklist is browsed. If a match is found, then a flag is raised, and
the registers of the Packet Processing Unit are cleared: the packet is immediately dropped,
and the GMII interface is even prevented from receiving valid data. However, since the
list might contain several entries, it must be browsed multiple times to check them all. The
operation of finding a match is a sensitive point: practically speaking, the candidate MAC
is compared with one of the stored elements, meaning that at least an equality check is
performed. Therefore, one can deduce that a good filtering algorithm is equivalent to the
implementation of a search algorithm. The time required by such an algorithm to run is
proportional to the size of the list. As the “supply chain” of registers is not infinitely long,
such algorithm should be optimized to run quickly; therefore the binary search algorithm
seemed to be a good source of inspiration [56].
Whenever a piece of code is written for a hardware platform, the developer should al-
ways be aware of its primitive cells and functions. This is the reason why the proposed
solution is quite elegant. Thinking about the search algorithm, as said above, a compari-
son operation is supposed to be carried out multiple times. In hardware description lan-
guages, such operation is not straightforward: it might be translated into a logical ex-
pression, or it might require to instantiate several multiplexers. The bottom line is that
the synthesizer has to correctly evaluate the comparison operation required and guess
how to implement it in the design correctly. This is not an elegant way of writing the code
83
6. Case Study - Proof of Concept
Figure 32: Basic functionality of DSP48E1 slice, as reported on the manufacturer datasheet [12]
since the outcome is not known a priori. Moreover, depending on the complexity of the
implemented logic, there might be problems in signal propagation causing the timing
constraints not to be met, due to the high number of consecutive logic levels. A much bet-
ter approach, way more elegant indeed, is to instantiate a DSP48E1 slice manually. Such
slice is a 7-series FPGA primitive, designed explicitly for Digital Signal Processing (DSP).
The basic functionality of the DSP48E1 slice is shown in figure 32, taken directly from the
datasheet[12], and for what concerns the filtering task, the DSP is mainly used to make
arithmetic operations. Trivially, a comparison operation (like greater than or smaller than)
can be translated into a subtraction, followed by a sign check:
A sign check is an effortless operation for binary data since it just requires the Most Sig-
nificant Bit (also called sign bit). If this bit is equal to 1 the number is surely negative, if
equal to 0, it is surely positive (or null). Of course, the whole analysis works only if the
variables are of the signed type. That being said, in order to configure the DSP to evalu-
ate algebraic comparisons, it is first necessary to understand how to fetch the terms of
comparison from memory. The reader must figure out the following steps:
84
6.4. Packet Processing Unit
Before continuing with the detailed description of these bullet points, a small spoiler is
given. To increase the number of maximum elements contained in the blacklist, it was de-
cided to store in memory only a portion of the MAC address to be filtered. Recalling from
section 3.4, the MAC address is a 48-bit number, whose former 24 bits are referred to as
Organizationally Unique Identifier (OUI). The implemented MAC filter feature here de-
scribed should be better referred to as OUI filter since the information stored in memory
that will be compared with the incoming data is just the first 24 bits of the MAC address.
Although it seems to be a restriction, the act of preventing a specific set of devices, be-
longing to the same manufacturer, from interacting with the network is a good security
application, used to cut off untrusted devices. In the future, one can surely improve the
design, allowing the user to input a regular MAC blacklist.
1. DSP48E1 Configuration
The DSP48E1 primitive allows a maximum of four different inputs (30-bit, 18-bit, 48-
bit, 25-bit). However, as regards arithmetic operations that are not involving multiplier
nor pre-adder needs, just two 48-bit inputs may be used (one is resulting from the com-
bination of 30-bit and 18-bit inputs). After having carefully read the datasheet [12], the
reader should figure out one of the most remarkable features of this component, i.e., the
Single-instruction-multiple-data (SIMD) arithmetic unit. This feature allows the DSP to
work in parallel onto multiple arithmetic operations, whose operands are concatenated
in the 48-bit input. Said with different words, one can perform four different 12-bit op-
erations or two different 24-bit operations, using the same component, and feeding the
operands through the 48-bit inputs. This feature is exploited to run two comparisons in
parallel over two 24-bit numbers, that is also the size of half MAC address (more specifi-
cally, the OUI). This motivation is justifying the design choice of saving only the OUI in the
blacklist. The DSP executes its operations in a combinational way: however, in order to
be correctly integrated into a sequential design, there are available registers for correctly
pipelining both inputs and outputs. In general, these registers should be activated, also to
achieve a better timing in the design. In this case, only one pipeline register was enabled,
85
6. Case Study - Proof of Concept
Figure 33: To acquire a signal from the “supply chain”, it is sufficient to wire an intermediate
signal to an external register. Such register has to be enabled at a specific time with respect to a
reference, that, in this case, is coincident with the arrival time of the first byte
for both inputs and output (up to two pipeline registers are available for the inputs). Be-
ing the pipeline registers instantiated in the design, indeed sequential, the developer has
to provide a reasonable clock.
Due to the reasons introduced before, and in general to speed up the design, the clock is
chosen to be running at 250 MHz, thus exploiting the previously generated clock, in phase
with the gmii_tx_clk. The choice is complying with the datasheet; therefore the output is
guaranteed to be stably generated in one clock cycle [57].
Summarizing, the DSP will receive two 24-bit values from memory, concatenated in
one 48-bit input. The second 48-bit input will receive the 24-bit OUI, concatenated twice.
Such 24-bit OUI is extracted as soon as the Packet Processing Unit receives the packet
bytes. After having detected the frame preamble, since the structure of an Ethernet Frame
is fixed, the PPU can count the number of received bytes and know exactly the MAC ad-
dress position inside the “supply chain”. As shown in figure 33, as soon as the desired
bytes transit through three specific consecutive registers1 , their values are conveyed to-
wards the DSP, that can successfully load the corresponding operand register. After the
subtraction is carried out, the sign of the output is checked by inspecting the carryout reg-
ister. Such register is supposed to be 1 if the first operand is greater (or equal) than the
second, 0 otherwise. The proof can be easily figured out through 2’s complement algebraic
1
recalling that each register stores an 8-bit input, hence three consecutive registers host 24 consecutive
bits.
86
6.4. Packet Processing Unit
analysis. The following example on a 4-bit pair of operands is left to the reader, recalling
that:
R The 2’s complement of an N-bit number (i.e., its negated form) is obtained by negat-
ing all its N bits and then summing 1
a = 2 = 0b0010 b = 3 = 0b0011
a − b = a + (−b) = −1
b − a = b + (−a) = 1
2. BRAM Configuration
Now that the DSP block is ready to compute all the comparisons, the BRAM must be pre-
pared accordingly, to feed the right inputs. In this section, the hardware configuration of
the BRAM will be checked out, although the remaining software programming will only
later complete the analysis. Therefore, the reader has to temporarily assume that the PS
can write the memory.
A BRAM cell can be configured to be a True Dual Port RAM[58]. This means that there
are two separate interfaces to interact with the stored contents. Assuming that the mem-
ory is correctly configured, the DSP block needs to fetch two entries per time, as men-
BRAM interface
clk Indeed, the clock signal
address Desired address to be read. Its size has to be configured
din Input data bus: not useful for reading memory. Grounded
dout Output data bus: its size has to be configured
en Master enable: when high, it enables the BRAM cell to be accessed
wen Write enable: not useful for reading memory. Grounded
Table 6.2: Bram interface signal description
87
6. Case Study - Proof of Concept
tioned in the previous section. Being the configuration step on behalf of the PS, the PL is
chosen to be only allowed to read memory. The signals required to perform a correct read
operation over a BRAM interface are listed in table 6.2.
There are several variables to be configured before instantiating one BRAM cell: among
these, the address and dout buses, and the write depth. As regards the first variable, the
most common configuration is to set the address bus width to 32 bits. This also makes it
compatible with a BRAM controller from the PS side, that will have to access such memory
later, for configuring it correctly. As regards the data output bus, it is chosen to be 32
bits wide, that is the minimum available size when configured to be compatible with a
BRAM controller. Eventually, the memory depth is set to 1024 entries, so that, one full
36 Kb unit can be filled. Table 6.3 summarizes the final BRAM configuration. There is a
useful remark about the dout bus size. Being the OUI a 24-bit variable, there are eight
additional bits per each memory entry that may encode extra information. A reasonable
design choice consists in specifying the filtering list per each MAC address, indicating if
it belongs to the source blacklist, the destination blacklist or both. In this case, the OUI
is stored in the least 24 bits of the 32-bit word, whereas the 25th and 26th bit encode its
belonging respectively to the destination or source list.
The procedure to read the BRAM is quite simple, and it takes only two clock cycles.
First of all, the en pin must be driven high for the whole read procedure, then the desired
address is written on the address bus. In the following clock cycle, the BRAM unit will put
on the dout bus the corresponding 32-bit word.
A very last remark concerning this chapter consists in the interface between the BRAM
cell and the PS, that has to configure the memory after powering up the system. This is
probably not the best strategy; however, a different one will be proposed in the imple-
88
6.4. Packet Processing Unit
Figure 34: Multiplexing a BRAM interface: detailed schematic of the interconnections inside the
PPU
mentation of the EtherType filter, that is probably better concerning wiring and resource
usage. Unfortunately, there was no time left in order to upgrade the old design to include
the new strategy, because the finite state machine and the hardware block design were
supposed to be re-drawn.
That being said, the reader should understand that the DSP uses both the two available
ports of the BRAM. Therefore, a multiplexer is required to connect also a BRAM controller.
As regards the MAC filter function, the Packet Processing unit has three BRAM interfaces
available: one Slave and two Master. The Slave interface is directly connected to the BRAM
controller, whereas the two Master interfaces are connected respectively to the two BRAM
ports. Inside the Packet Processing Unit, there is a multiplexer that is choosing whether
to connect the Slave interface and thus the BRAM controller (and therefore the PS) or
the DSP State Machine logic to the BRAM cell. The PS directly controls the select signal
of such multiplexer through a General Purpose Input Output (GPIO) channel. Figure 34
summarizes such interconnections. There is only one critical signal that should never be
multiplexed to avoid problems with timing constraints, i.e., the clock. This is the reason
why the same 250 MHz clock drives both BRAM and DSP. Such a clock is generated by the
MMCM described in section 6.4.1.
89
6. Case Study - Proof of Concept
0-2 3-5 6-8 9-11 12-14 15-17 18-20 21-23 24-26 27-29 30-32 33-35 36-38 39-41 42-44 45-47 48-50 51-53 54-56 57-59 60-62 63-65 66-68 69-71 72-74 75-77 78-80
As mentioned in section 6.4.3, the main source of inspiration for implementing in hard-
ware the search algorithm was the binary search algorithm [56]. Since the DSP is able to eval-
uate two comparisons per time, a ternary search algorithm is designed from scratches. A
fixed maximum number of entries is defined during the design phase, possibly a power
of 3 to make it optimized: in this case, such number is equal to 81. Then, upon its ini-
tialization from the PS, the list of entries is required to be sorted, still by the PS CPU, in
ascending order. Descending order is also possible, but it will lead to a different hard-
ware implementation. All the unused entries are set equal to the highest 24-bit value, that
is 0xFFFFFF. Last but not least, no duplicates are allowed (except, indeed, the unused en-
tries).
As soon as the half MAC address is ready to be searched in memory, the algorithm starts:
the list of entries is divided in three chunks of 27 elements each, as shown in figure 35. The
DSP is fed with the first element of the second and the third chunk. Depending on the re-
sults of the comparisons, one can understand which of the three blocks might contain a
matching entry. The matching condition takes into account not only the half MAC ad-
dress value but also the blacklist it belongs to, source or destination. Then, the algorithm
starts over with the same approach, dividing into three chunks the new block, until three
elements are left. The MAC State Machine keeps track of the number of cycles across the
states. A summary of the implemented state machine is given below, together with a pic-
ture (figure 36):
90
6.4. Packet Processing Unit
Figure 36: State machine diagram of the MAC Filtering. Destination and Source MAC Filtering
are performed in the same way but changing the matching condition to the corresponding list.
The variable cycle is counting the number of loops in the red colored mesh (Dest) or blue (Src)
N DSP_Idle_Dest: Idle state. The addresses to browse the BRAM are ready on their bus
N DSP_Read_Memory: The BRAM outputs the queried values, and they are given (con-
catenated) to the input register of the DSP
N DSP_Eval: The inputs are sampled by the DSP, which then performs the combina-
tional operations
B If one of the two queried entries matches the input MAC and the corresponding list: a
flag is raised to tell the main Finite State Machine to drop the packet.
B If it is the fifth time this state is run: it means that the input MAC does not belong
to the blacklist. Go to DSP_Idle_Src.
B If the input MAC is greater than both the two entries: the new entries to be fetched
will belong to the rightmost chunk. Go to DSP_Idle_Dest.
B If the input MAC is smaller than both the two entries: the new entries to be fetched
will belong to the leftmost chunk. Go to DSP_Idle_Dest.
91
6. Case Study - Proof of Concept
B If the input MAC is between the two entries: the new entries to be fetched will be-
long to the central chunk. Go to DSP_Idle_Dest.
There are two remarkable points to be discussed: one is good, the other is bad. The bad
news is that the algorithm is not properly optimized, as the fifth cycle across memory can
be avoided by reducing the maximum number of input entries to 80. Unfortunately this
issue and the relative solution came out only during the development of the following fil-
ter (EtherType); moreover, the fix is not so quick. The process of updating the addresses
for browsing the BRAM is the most sensitive part , therefore it was chosen to continue
with the additional features, and add a new entry in the Update list. For the sake of cu-
riosity, one example is proposed to the reader. Suppose there are just two entries in mem-
ory, whose value are 0x123456 and 0x789ABC. All the remaining 79 entries are filled with
0xFFFFFF as described before. Suppose that the received OUI is equal to 0x175317. In table
6.4, shown below, are reported all the steps performed by the algorithm.
Clearly, the last State Cycle is a waste of resources, since the DSP has two available
inputs but they are used to make a single search. Anyway, in the following chapter it will
be described how the improved algorithm works, solving this problem.
On the other hand, the good piece of news is that the algorithm always takes the same
amount of clock periods to run, either when a match is found or not. Even though the
drop condition is asserted, the MAC State Machine waits in the Idle state until the last
calculated State Cycle. It would be interesting to evaluate the number of clock periods
required: five cycles across four states are required by the State Machine to browse the
whole memory. Since the running clock is 250 MHz fast, it will take 4 ns ∗ 4 ∗ 5 = 80 ns.
Such amount of time is equivalent to 10 clock periods in the main Finite State Machine
State Cycle Entry number (1) Value (1) Entry number (2) Value (2) Next entry number (1) Next entry number (2)
1 27 0xFFFFFF 54 0xFFFFFF 9 18
2 9 0xFFFFFF 18 0xFFFFFF 3 6
3 3 0xFFFFFF 6 0xFFFFFF 1 2
4 1 0x789ABC 2 0xFFFFFF 0 0
5 0 0x123456 0 0x123456
Table 6.4: This table shows an implementation of the decision tree depicted in figure 35 applied
on the example proposed
92
6.4. Packet Processing Unit
that drives the network packets in the Packet Processing Unit. Since the MAC address
is made of 6 bytes, as soon as the Destination Filter is done, the Source is ready to start,
because both Source and Destination OUI will be already inside the Packet Processing
Unit. Considering the 10 bytes received during the Destination MAC Filtering, one can
find:
This is yet another good piece of news, as the EtherType Filter can start upon reception of
the following byte, in parallel with the Source MAC Filtering. Once again this is pointing
out the beauty of “parallelizing” tasks when working with FPGAs.
Recalling section 3.4, the EtherType is a 2-byte identifier that follows the MAC addresses
in an Ethernet frame. Such identifier contains the necessary information to climb the
OSI stack, i.e., the protocol with which data is encapsulated in the payload of the frame.
The reader should consider this scenario: suppose that one built an internal network in
which the IPv4 protocol mediates all the exchanges of data. For security purposes, one
might want to filter out, say, all the IPv6 packets, because they have nothing to do with this
stream. Possibly, they might have been generated by a third party software, maybe with
the purpose of stealing data. This is a good reason to implement an EtherType Filter. The
working principle of such filter is very similar to the previous MAC Filter, i.e., searching
in memory for a possible match. However, this is an upgraded version, correcting the
small imperfections shown before. Also, in this case, there will be three steps, namely, the
DSP48E1 setup, BRAM configuration and search algorithm implementation to be checked
out.
93
6. Case Study - Proof of Concept
1. DSP48E1 configuration
Another DSP unit is required to make this feature work since it will run in parallel with
the MAC filter. The configuration is almost the same, with two 24-bit simultaneous op-
erations. This time, however, the EtherType is only a 16-bit entry. Moreover, the BRAM
connections are chosen to be wired differently with respect to the previous one. The most
significant change, as regards the DSP, is that there is only one port of the BRAM available
to be accessed. Therefore, the two inputs must be provided by a single read in memory.
As regards the pipeline registers and the clock speed, the same configuration as before is
used.
2. BRAM Configuration
As regards the EtherType Filter, a different BRAM structure is chosen. First of all, the
choice of using a separate BRAM cell with respect to the MAC Filter is obvious for two rea-
sons: the first is that they run in parallel, the second is that the two ports are already busy;
therefore another fancy multiplexer would be required to connect another BRAM inter-
face to the same block. It also to avoid such multiplexers that the previous architecture is
slightly changed. First of all, the BRAM must be programmed by the PS; therefore its Con-
troller has to be instantiated and wired. One port is therefore reserved for this purpose,
whereas the second port is connected to the DSP. This time, since the EtherTypes are 16-
bit wide, it is sufficient to place them in memory, in contiguous positions, still sorted in
ascending order. The EtherType State Machine will take care of one single address for
querying the BRAM, and its 32-bit output will give two EtherTypes back. So, once again,
the configuration of the BRAM blocks is shown in table 6.5.
94
6.4. Packet Processing Unit
Figure 37: Interconnections between a single BRAM Controller and two twin BRAM cells
There is one last remark for the reader. The two Packet Processing Units (the reader
should remember that there are two ways for the data to flow across the FPGA) are identi-
cal; as a result, they need the same BRAM entries. Whenever the PS is accessing an exter-
nal memory mapped peripheral, it should correctly refer to it through its address offset.
As regards the BRAM, such mapping is given only to the Controller, because it is medi-
ating the communication. A BRAM cell itself is indistinguishable from another for the
Controller point of view. That is why in this design only one Controller is used to write on
two BRAM cells at the same time. The detailed schematic is shown in figure 37. There is no
need for additional components, but there is a small drawback to pay. Since the din bus
is an input for the BRAM controller, this cannot be driven twice by the two BRAM cells.
Therefore, one of the two BRAM pins is left merely disconnected. This is not causing any
malfunctioning because the Controller will always be able to write memory (both will be
written) and to read (only one will be read, but it has the same information as its twin).
The Search Algorithm, as said in the previous chapter is improved regarding optimiza-
tion. Unfortunately, it is not anymore a ternary search, because there is only one BRAM
port to be accessed. However, it is not even a proper binary algorithm, because the BRAM
95
6. Case Study - Proof of Concept
14-15
6-7 22-23
is giving two entries per time. The maximum number of entries is set to 30, and only four
cycles are necessary to browse the whole memory. The principle is the same as the binary
search: the whole list of entries is split into two chunks. Then, each of them is again split
into other two chunks and so on. However, if this approach is followed, it will lead to the
same conclusions as to the previous implementation, that was not the best. The difference
is quite peculiar: in the previous example, the entry to be queried was corresponding to
the first element of the equal-size chunks. However, this was not taking into account that
the DSP can provide both a comparison operation and an equality check. If such check
provides a negative result on a specific entry, it is useless to query it once again. There-
fore, following this principle, the decision tree is re-drawn, and looks like figure 38.
Summarizing, the median couple of the list is checked first. Then, if the input Ether-
Type is:
p greater than both the two entries: the rightmost chunk will be next
p smaller than both the two entries: the leftmost chunk will be next
p between the two entries: the input EtherType is not belonging to the list
There is still one point to be discussed, related with the encoding of the EtherTypes in
memory. In the previous example of the MAC Filter, there were eight empty bits per each
entry where to store additional information. In that case, it was stored the belonging to
the Source or Destination list. It was implicit, but, if this information was missing, the
96
6.4. Packet Processing Unit
Figure 39: State machine diagram of the EtherType Filter. The variable cycle counts the number
of loops in the red colored mesh
entry in memory was considered not to be valid. Therefore, if the user decides not to en-
able the MAC filter, the memory is filled with 0x00FFFFFF, leaving untouched both the
25th and 26th bit, encoding the above-mentioned information. If a packet with an OUI
equal to 0xFFFFFF is received, the MAC Filter will detect the match in memory, but not in
the corresponding list (either Source or Destination): this is, of course, the correct behav-
ior. Unfortunately, there are no extra bits as regards the EtherTypes, because the 32-bit
memory words are wholly filled with the two 16-bit EtherTypes. This time, if the user de-
cides not to enable the EtherType filter, the memory is filled with 0xFFFFFFFF values. If
a packet with an EtherType equal to 0xFFFF is received, the EtherType Filter will detect it
in memory and drop the packet. This is indeed a wrong mistake. Therefore, even though
it is not very elegant, a workaround is provided. Basically, from the PS it is possible to
wire some signals directly to the PL, through a GPIO channel, as described at the end of
section 6.4.3. As regards the design of the EtherType Filter, two signals are chosen to be
wired through the GPIO towards the Packet Processing Units. The first one implements
an additional feature that allows the user more flexibility: the EtherType list can be set to
be either a whitelist or a blacklist. The former indicates that only the elements matching
with the contents of the list are allowed to flow through the FPGA. On the other hand, a
blacklist indicates that the elements matching with the list are not allowed to flow through
the FPGA. The second input provided by the GPIO channel tells the state machine whether
97
6. Case Study - Proof of Concept
the particular case 0xFFFF is included in the list provided by the user. In this way, the
hardware surely knows whether to drop or save the packets with 0xFFFF EtherType.
Last but not least, the EtherType State Machine looks almost identical to the MAC State
Machine and is depicted in figure 39.
This feature is fascinating from the security point of view. In a few rough words, it puts
an invisibility cloak over the workstation connected to one side of the board. Before check-
ing this out, the reader must know what the Internet Control Message Protocol (ICMP)
is. It is a supporting protocol, used mainly to transmit malfunctionings of the network
or control information [59]. Among these, there are the common network applications
ping (packet internet groper) and traceroute: the first one is used to measure the time taken
by a packet to reach a device in the same network and come back [60]. To do that, a de-
vice sends an ICMP packet of the type echo request over the network and then listens for
a response. Such a response is another ICMP packet, of the type echo reply and it allows
drawing conclusions about the time of travel. This application is widely used to check if
a device is “alive”, i.e., reachable, on the network. According to the standards [60], the
ICMP protocol is encapsulated in IPv4. As a result, it uses IP datagrams for transport. Its
assigned protocol number on IP is 1 and is written to the 24th byte of an IPv4 packet.
That being said, this function was a source of inspiration to build a specific filter appli-
cation. It is working straightforwardly because it checks the EtherType and the IP proto-
col number. If they match the ICMP fingerprint, and so, respectively, 0x0800 and 0x01
the packet might be dropped. The final word depends on yet another signal through the
GPIO channel, that is set by the user in order to enable or disable this function. In the end,
when such a feature is enabled, every ICMP packet is prevented from flowing through the
board. Suppose that the FPGA is connecting a workstation to a local network. If another
device on the same network tries to send a ping towards the workstation, such packet will
be dropped. Therefore, the device will be tricked to believe that the workstation is not
connected to the network.
98
6.4. Packet Processing Unit
The last feature implemented for the Packet Processing Units is here referred to as an
encryption environment. The choice of these words is due to the adopted approach rather
than complexity. The implemented encryption system is not very strong, and of course
it must be upgraded in the future: however, the way it is implemented in the design is
quite impressive, since it allows a transparent modification of data, real-time, while they
are traveling through the Packet Processing Unit. The bottom line is always the same:
the most powerful advantage of working with FPGAs is the possibility of running several
tasks in parallel, possibly even interacting on the same set of data.
Before starting with the analysis of the encryption functions and algorithms, it is
mandatory to consider the immediate consequences when the data belonging to an Eth-
ernet packet are modified. The reader should remember that the last four bytes of a
valid packet, referred to as Frame Check Sequence (FCS), contain the so-called checksum.
Therefore, whenever one wants to apply modifications on the network data, the FCS must
be recomputed. Usually this task is performed by the MAC controller, however, since there
is no embedded MAC controller in this application, a checksum generator must be imple-
mented first.
1. Checksum Generator
The algorithm defined by the Ethernet Standard [17] to compute the FCS is the Cyclic
Redundancy Check (CRC32). Such an algorithm is easily implemented in hardware be-
cause it can be represented as a Linear Feedback Shift Register with a 32-bit character-
istic polynomial [61]. As regards the combinational part, some parts of the source code
were taken from an online resource [62]; on the other hand, the synchronization with the
main State Machine and the architectural implementation in the design were built from
scratches. The CRC32 algorithm has to start working from the first byte after the pream-
ble. It can be seen as a black box, taking an 8-bit input (each incoming Ethernet byte) and
giving a 32-bit output, starting from the fourth consecutive input received. As shown in
figure 40, it is easy to branch one of the “supply chain” registers, to feed the CRC genera-
99
6. Case Study - Proof of Concept
Figure 40: Architectural implementation of the CRC generator, extracting data from the main
flow and multiplexing the outputs one stage later
tor in parallel to the regular flow of data. However, it is a bit more challenging to provide
an “affluence” path for the computed FCS. This is indeed necessary to include the recom-
puted CRC, in case the data are modified on their way. To implement this insertion in the
supply chain, a simple multiplexer is placed between two consecutive registers, taking as
input the previous register in the chain, and the last register of the CRC generator block,
whereas taking as output the following register in the chain. The selector of this multi-
plexer must be enabled as soon as the last byte of the frame payload flows through it. In
this way, the fresh FCS is appended to the payload, smoothly.
2. Encryption Algorithm
Finally, the most awaited feature is about to be implemented. First of all, as a matter
of simplicity, it was chosen to operate only on 8 bits per time. The working principle is
simple: the user is supposed to input a password during the configuration step. Then, the
Fowler-Noll-Vo (FNV) hashing function is applied on such input, and the resulting 32-bit
hash is used to encrypt the 8-bit data.
The FNV hashing function [63] is chosen because it seems a good candidate for future
development. At the moment, such function is executed by the PS, that writes then the
output on the PS side and transmits it to the PL through a dedicated GPIO channel. How-
ever, in the future, the algorithm can be mapped into logic cells, in such a way that the
100
6.4. Packet Processing Unit
board can work on its own. The goal of this encryption implementation is to provide a
Proof-Of-Concept solution, even though it has not a true cryptographic proficiency. The
FPGA should deal smoothly with both encryption and decryption, since the packets are
supposed to be used sooner or later by someone else, having the same system, configured
with the same password. The goal is to find a quick function, able to produce modifica-
tions to the input data, but also able to retrieve the original information. There is a class
of functions referred to as symmetric functions: their definition is shown in (6.2):
From the cryptographic point of view this a weakness, however, as said previously, the
goal is to design first a Proof-of-Concept device. The simplest logic operation that is also
a symmetric function is the exclusive or (XOR, symbol ⊕). Indeed:
a⊕b⊕b=a
On the basis of the previously defined principles, the procedure for encrypting the stream
of data in the Packet Processing Unit can be broken down in the following steps:
K Such 8-bit key is XORed with the stream of data, starting from the first byte after
the EtherType (payload)
The slicing of the primary key allows a maximum of 32 different keys before they become
redundant. Such an operation is performed by merely taking a subset of the whole key and
increasing the boundaries by one unit. Therefore, if the primary key bit representation
is [31:0], the subsets will be [7:0], [8:1], [9:2] and so on, until the end of the key. Then,
those bits are merely wrapped, and the algorithm starts over. In order to make the whole
structure ordinate and deterministic, the first byte of the payload is always encrypted with
the first subset of the key. To ensure this, a signal is wired to the corresponding State
101
6. Case Study - Proof of Concept
Figure 41: Architectural implementation of the encryption box, extracting data from the main
flow and multiplexing the outputs one stage later
Machine. In order to replace the plain payload with its encrypted form, it is implemented
the same strategy as the CRC. Therefore, a fork is wired from a specific register in the
chain to the encryption box. Then, the output of the encryption box is multiplexed with
the output of the following register, as shown in figure 41. Of course these components
must be displaced before the CRC generator to ensure correct operation; otherwise, the
CRC32 output will not be consistent with data.
Before explaining how to decrypt data, one should take into account that from an ex-
ternal point of view there is not any evidence of encryption. Said differently, the packet
does still have a regular preamble, MAC address declaration, and EtherType. Moreover,
the FCS is also confirming that the packet is valid. So how can the Packet Processing
Unit detect an encrypted packet to start decrypting? That is the reason why a clear sig-
nature must be provided. The EtherType field was considered to be a right place where
to store the signature. First of all, if the user decides to use the encryption box, as a mat-
ter of design choice, only the IPv4 packets will be encrypted. Such choice prevents the
FPGA to mess up with other protocols that are ensuring the correct mapping of a device
on the network (i.e., Address Resolution Protocol, ARP). The corresponding EtherType of
regular IPv4 packets is 0x0800, so a new EtherType is searched, in order to label the en-
crypted packets. Such custom EtherType has to be unique, not taken by other protocols;
consequently, a suitable candidate is 0x1753. In conclusion, two additional registers are
included in the “supply chain”, to allow the insertion of the new EtherType in case the en-
102
6.5. Final Hardware Design
cryption feature is enabled and an IPv4 packet is received. The reader should know very
well how to do this job, after the CRC generator and encryption box examples.
As soon as the leading State Machine detects the custom EtherType, it raises a flag to
decrypt the payload in the Packet Processing Unit. First of all, the previous EtherType
has to be restored; then the payload is processed by the encryption box, which follows
precisely the same steps performed during encryption. If the password used to generate
the key is the same, it is 100% sure that the original data is restored, since the XOR is a
symmetric operation and the procedure is carried out in the same order. Indeed, the FCS
will be computed accordingly, and the PHY will gladly accept the restored packet.
The final hardware design looks like Figure 42. Although the picture is complicated
to be read by just a glance, the reader can find all the previously described components.
There are only a few more words to be written about the reset management and some
useful signals that guarantee the correct interaction between the logic-fabric, the PS and
the user.
As soon as the system is powered up, all the peripherals instantiated in the design must
be reset, in order to achieve a defined initialization before running the main application.
Such reset signal is usually asserted as soon as all the clocks are generated and stable.
Luckily, the Clock Generator instantiated in the design, mentioned in section 6.4.1, has
one useful output, namely, locked. As written in [64], when the locked output is asserted,
it indicates that the output clocks are stable and usable by downstream circuitry. Hence,
this signal can be used to trigger the reset of all the peripherals. In particular, a small state
machine is defined, taking as input the locked signal and giving as output the reset signal.
As soon as the input is asserted, the state machine counts, roughly, 50 ns before raising
the reset signal, lasting one clock cycle (8 ns).
103
6. Case Study - Proof of Concept
Signal Description
MASTER_CONTROLLER_ENABLE Enables the PPU as soon as the PS is done with configuration
IS_PL_WRITING_BRAM Switches the control of the BRAM between the PS (0) and the DSP (1)
IS_ICMP_KILLER Enables the relative feature in the PPU (see section 6.4.5)
IS_ETHERTYPE_BLACKLIST When high (1) the EtherType list is a Blacklist, when low (0) a Whitelist
IS_ETHERTYPE_SPECIAL_CASE Indicates that the EtherType 0xFFFF is a true input of the list
IS_PORT_0_CYPHERING Enables payload encryption out of Ethernet Port 0
IS_PORT_1_CYPHERING Enables payload encryption out of Ethernet Port 1
Although it is not easy to find oneself bearings, the first signal the reader should check
out from figure 42, is the so-called MASTER_CONTROLLER_ENABLE. Such signal is added
to the control logic of the FIFOs but in general the whole Packet Processing Unit. If the
software is still in the configuration phase, the PPU should not be active. That is why at the
end of the configuration phase, this signal is raised and propagated through the GPIO to
the PL. The link_status signal described in section 5.2.1 plays the same role in controlling
the flow through the PPU: whenever one of the two interfaces is down, the PPU should
immediately stop. That is why this couple of signals is conveyed to a synchronizer block,
referred to as Sync_Master_Enable: as the PPU works synchronously, also the reset signal
must be synchronous with the main clock to avoid unexpected behaviors. This concludes
the set of control signals delivered by the GPIO interface, and they are well summarized in
table 6.6. As regards debug signals, there is nothing special to mention: they are all routed
to different LEDs on the board, and they are summarized in table 6.7.
104
6.5. Final Hardware Design
The source code used to configure the two external PHYs is not complicated. The base
is similar to the First implementation of section 5.4. Few tasks have to be performed by
the processor on the PS side, and they are respectively:
8. Encryption configuration
XGpio_Initialize(&GpioPtr, GPIO_DEVICE_ID);
where GPIO_DEVICE_ID is the memory mapped address of the peripheral. Then, from
point number 2 to 4 the reader can refer to section 5.4.
The user configures the MAC filter through an interactive User Interface (UI) over
the Universal Asynchronous Receiver-Transmitter (UART). As mentioned in the Physi-
cal Setup section, the test workstation is connected to the board through UART via a USB
connection. Therefore, such a workstation can open a serial terminal to interact with the
board via the keyboard. The UI is designed in such a way that allows the user to insert
the half MAC addresses (OUI) to be filtered specifying for each of them if they have to
be included in the Source list, Destination list, or both. The UI checks for the existence
of duplicates and refuses to store them. In the end, the list is sorted by the embedded C
106
6.6. Software Design
quicksort algorithm [65], and finally encoded to the corresponding BRAMs, one per each
Packet Processing Unit.
The procedure is carried out almost the same way as regards the EtherType Filter: the
difference is that a list of the standard entries is displayed on the screen, to the help the
reader identifying the right protocol. Such list is also reported on table 6.8. Eventually,
the user is asked for the list configuration, either blacklist or whitelist.
During the Encryption setup, the user is asked to insert a password, that is double
checked to avoid unwanted inputs. In the following, the user must indicate what the Eth-
ernet port connecting the board to the external world is. Indeed, the encryption feature
must not be enabled on the port that connects the user’s workstation to the FPGA, because
otherwise, the workstation will receive a lot of incomprehensible data. Instead, both the
two Packet Processing Units are always able to decrypt any packet with a 0x1753 Ether-
Type. Lastly, the ICMP Killer setup is a simple yes/no question.
107