Project
Project
University of Pennsylvania
Department of Electrical and System Engineering
System-on-a-Chip Architecture
1 Goal
Develop a compressor that can receive data in real time at modern ethernet speeds and
compress it into memory using deduplication and compression. Specifically, we’ll look at
Content-Defined Chunking to break the input into chunks, SHA-256 (or SHA3-384) hashes
to screen for duplicate chunks, and LZW compression to compress non-duplicate chunks.
For full points, your goal for implementation is to achieve real-time guaranteed support of
400 Mb/s1 but you may need to consider intermediate goals (e.g. 100Mb/s, 200Mb/s) along
the way. Slower designs will receive partial credit for the performance portion of the project
grade.
found
ethernet Find ethernet
input Chunk SHA Chunk? Send output
not
found LZW
This is a 5-week project assignment; the intent is to allow you to plan and execute a signifi-
cant, open-ended design exploration and mapping. You will not achieve the implementation
goal or the course learning goals by trying to do this in one week. We give you milestones to
help provide some structure, but the milestones are minimal and doing the minimum to hit
the milestone each week will be insufficient to get you where you need to be at the end. We
are giving you flexibility in planning and ordering rather than lock-step specifying exactly
what you need to do each week.
• Project work is done in teams of 3. You select partners during first week.
• Collaboration between teams is limited as specified on the course web page.
• Milestones writeups and Final report are a team writeup.
• The spirit of this exercise is to optimize the SoC mapping of the algorithm. As such,
explorations of alternate solutions that change the algorithms and generally optimize
the solution for hardware and software are out of scope. Explorations that tweak or
tune the algorithms slightly to better exploit the SoC hardware are potentially in scope.
1
800 Mb/s stretch goal for bonus points
1
ESE5320 Fall 2023
2 Final Report
Final report is a team writeup. There will be one turnin per team.
2
ESE5320 Fall 2023
1. Provide an xclbin, OpenCL host code executable, and decoder executable for your
encoder.
2. Your compression program (OpenCL host code) should take one argument:
• the file name where the program should store the compressed data.
Your program should assume that encoder.xclbin is in the same directory as the host
executable.
3
ESE5320 Fall 2023
4 Milestones
We will provide precise requirements for milestones each week. These may include a few
exercises to help prepare you for questions that may be on the final in addition to the project
specific components. Milestones and feedback feed into the final report. In most cases, the
milestones can serve as a first draft of a component of your report, and the feedback we give
you will help provide guidance on how to refine it for the report.
5 Components
The components we will use are standard enough that the wikipedia pages are useful, and
there are several other nice tutorial blog posts out there. Here’s a roundup of starting points.
4
ESE5320 Fall 2023
You may want to experiment with some of the parameters when tuning your implementation.
7 Examples of Use
1. Yan Zhang, Nirwan Ansari, Mingquan Wu, and Heather Yu. “On Wide Area Network
Optimization.” In IEEE Communications Surveys & Tutorials, vol. 4, issue 4, pp.
1090–113, Oct. 2013. http://ieeexplore.ieee.org/document/6042388/ Sections
III A and B survey the role of compresison and decompression in optimizing WAN
data traffic.
2. Ashok Anand, Chitra Muthukrishnan, Aditya Akella, and Ramachandran Ramjee.
“Redundancy in Network Traffic: Findings and Implications”. In Proceedings of ACM
SIGMETRICS/Performance, 2009. https://www.microsoft.com/en-us/research/
publication/redundancy-in-network-traffic-findings-and-implications/ Char-
acterizes redundancy in network traffic.
3. Athicha Muthitacharoen, Benjie Chen, and David Mazieres. 2001. A low-bandwidth
network file system. In Proceedings of the eighteenth ACM symposium on Operating
systems principles (SOSP ’01). pp. 174-187. https://dl.acm.org/citation.cfm?
id=502052 Use of deduplication for optimizing a file system operating across a low
bandwith link.
5
ESE5320 Fall 2023
8 Other Resources
• Xilinx HLS Tiny Tutorials https://github.com/Xilinx/HLS-Tiny-Tutorials. These
show examples of how to write code for specific things in Vitis HLS.
• Vitis Tutorials https://xilinx.github.io/Vitis-Tutorials/master/docs/README.
html. These are written for data center platforms (AWS F1/Alveo cards), but the de-
composition and core routines should still be useful.
• For examples of other applications that have been converted to run with PL accelerators
in HLS, you can look at:
– the Rosetta Stone Benchmarks https://github.com/cornell-zhang/rosetta
from Cornell. These were written for SDSoC, but the decomposition and core
routines should still be useful.
– Parallel Programming for FPGAs by Kastner et al. http://kastner.ucsd.edu/
wp-content/uploads/2018/03/admin/pp4fpgas.pdf
.
9 Encoded Data Storage
For your timing runs, store the compressed data in DRAM. The largest test case we provide
is under 200MB. Copy your encoded data from DRAM to the SD-Card outside of your test
timing. You will need to plan how you divide the DRAM among buffers, chunk hash storage,
and compressed output. During early testing, before adding external input, you may also
want to start with the uncompressed data in DRAM.
11 Chunk Validation
Using an SHA-256 signature, the probability of having a collision where two chunks share
the same signature is extremely low. For the project, we will consider equality of SHA-256
signatures adequate to determine that a chunk is a duplicate. This means you do not need
to read back the chunk and validate that it is, in fact, identical. If you had terabytes of
data, or if the consequences of error were high, you would want to perform the check. This
only applies to the full 256b signature. If you use smaller hashes for indexing, you will still
need to validate that there is a match on the 256b signature.
6
ESE5320 Fall 2023
12 Compressed Format
• Compressed stream is a sequential concatenation of chunks.
• Each chunk has a 32b header that identifies it as Duplicate Chunk or LZW Chunk.
7
ESE5320 Fall 2023
Because we are using a uniform length of dlog2 M axChunkSizee > 8 for encoding codewords,
it is possible that an encoded chunk could be longer than the unencoded chunk. To deal
with this, real implementations will often compare the length of the LZW encoded chunk to
the raw chunk length and send the unencoded data if it is smaller. For simplicity, we are not
asking you to perform that optimization (and our encoded chunk format does not support
it).
13 Compression Goal
Changes in parameters (such as average chunk size) will change deduplication and compres-
sion results. Furthermore, you may make tradeoffs in implementation that impact compres-
sion ratios. You may choose to sacrifice some level of compression for throughput. You
should try to maximize deduplication and tradeoff only a modest (e.g. 10%) level of chunk
compression. Show your tradeoffs explored in your detail design-space exploration section
with graphs to support as apporpriate.
8
ESE5320 Fall 2023
14 Supplied Resources
• Laptop code and/or scripts to send data to your Ultra96 at a fixed (tunable) frequency
(Section 15).
– Your encoder needs to work with the decoder. Getting the encoder to produce
encodings according to the compressed format (Sec. 12) may take some experi-
mentation. You may want to add a verbose debugging option to the decoder to
have it print out what values it is getting for the various fields to expedite the
debugging of your encoder.
• We provide several datasets that you can use for testing. We encourage you to create
your own simple datasets for unit testing. Note that the tar-files are not meant to be
unpacked. Following are the datasets that we provide.
– The Little Prince unencoded. As an encoding example, we also provide The Little
Prince compressed.
– Simple example. This archive contains three files, two of which are identical.
– Benjamin Franklin’s autobiography. This is a simple text file that you can modify
for your own purposes. The current file probably has few duplicate areas. (390
KB)
– GTK+ source code. This file contains several subsequent versions of the GTK+
source, which provides ample opportunity for deduplication. (177 MB)
– Linux source code. This file contains several subsequent versions of the source.
(191 MB)
– Several Linux kernels. As opposed to the other data sets, this set contains preva-
lently binary data. (66 MB)
Note: you should take these as examples, not a definitive list of test cases. In particular,
you should create many other focused test examples to facilitate your debugging and
validation.
9
ESE5320 Fall 2023
(a) Communication over ethernet. If you are using Windows, follow this document.
If you are on Mac, use the following instructions:
i. Download and install the AX88179 driver in the Mac (which is for the ethernet
usb):
ii. And then in Mac, you can do screen /dev/tty.usbserial-1234_oj11 115200
to open the serial console for the Ultra96. Assign the ip address to ultra96
like you would do normally (see Homework 6 instructions). You can exit the
serial console by doing CTRL-A CTRL-\ and pressing y.
iii. Once the ethernet driver is installed on your Mac, you can assign it an ip ad-
dress using sudo ifconfig en4 10.10.7.2 netmask 255.0.0.0 where en4
is the interface name you will find from ifconfig.
If you are using a Linux machine in Detkin/Ketterer as your host machine, use the
command: ifconfig_eth1. Note that if that command didn’t work, you might
have to use ifconfig_eth2 or ifconfig_eth3. These commands are equivalent
to the command:
sudo ifconfig ethX 10.10.7.2 netmask 255.0.0.0
This is the only way to assign an IP to the USB-ethernet device in the Linux
machines in Detkin and Ketterer.
(b) Install iperf3 on your computer: https://iperf.fr/iperf-download.php
(c) Find out what kind of USB ports you have in your computer: USB-2.0 or USB-3.0.
3. Open a terminal in your computer and issue the command: /usr/bin/iperf3 -c 10.10.7.1
(assuming 10.10.7.1 is the IP address you assigned to the Ultra96).
4. You should see outputs similar to the following if you connected to a USB-3.0 port in
your computer:
10
ESE5320 Fall 2023
This tells you that the upper bound on the throughput achieved by a placeholder
receiver is around 895 Mb/s limited by the Ugreen ethernet-to-USB interfaces.
cd ese532_code/
git pull origin master
The code you will use for this section is in the project directory. The directory
structure looks like this:
project/
Client/
client.cpp
Decoder/
Decoder.cpp
Server/
encoder.cpp
encoder.h
11
ESE5320 Fall 2023
event_timer.cpp
event_timer.h
server.cpp
server.h
sourceMe.sh
LittlePrince.txt
Makefile
source /opt/Xilinx/Vitis/2020.2/settings64.sh
And make sure the PLATFORM_REPO_PATHS is setup to the platform you downloaded.
3. You can either use make or the Vitis GUI to compile your code. Use make all to
compile all the targets client, encoder, and decoder.
5. Our basic model will be communication between two systems—your computer and the
Ultra96—over ethernet. Your computer will send packets at a fixed rate. The Ultra96
will receive the data and compress it. Figure 5.5 from homework 5 shows you the setup
and cabling. Since the first system is sending data at a fixed rate, it is necessary for
the receiver to compress the data at that rate or data will be lost. We provide the
code for the sender (Client/client.cpp). Your project is connected to the receiver
(Server/encoder.cpp). And then you can use the decoder (Decoder/Decoder.cpp)
to verify that you can recover the original, unencoded file from the compressed file.
6. Let’s run the given code with the system we have setup. After compiling the code, copy
over encoder binary, and the LittlePrince.txt as follows (adjust the commands if
you are not using Linux):
And then open the Ultra96 terminal and run the encoder with ./encoder. The pro-
gram waits for a packet to arrive. Open a terminal in your computer and issue the
following command:
12
ESE5320 Fall 2023
root@ultra96v2-2021-1:~# ./encoder
setting up sever...
server setup complete!
write file with 14247
--------------- Key execution times ---------------
Reading packets and processing : 0.228 ms
ip is set to 10.10.7.1
filename is LittlePrince.txt
bytes_read 14247
You can verify the output by doing the following in the Ultra96:
Note that our example is just writing the packets to a file. Your project will process
these packets with the encoder pipeline and write a compressed output.
7. We’ll now describe what’s happening in the client and the server:
(a) Packet Layout: We will be sending data via UDP datagrams. Linux supports
the UDP protocol and receiving packets from the client can be done easily using
Linux IP. The code provided will direct you on how to setup your compression
pipeline to listen as well as handle incoming packets. The maximum size of a
packet will be 16K Bytes. The header of the packet will be 2 bytes consisting of a
done bit denoting that all of the data has been transmitted as well as the length
of the data contained inside the packet.
13
ESE5320 Fall 2023
You are of course free to write your own application. For more information on
how to receive the data you can refer to the man page. https://linux.die.
net/man/2/recvfrom.
Alternatively, you can create your own client and server using the DPDK library.
Examples of how to write client and server code using DPDK can be found in the
following links:
• https://doc.dpdk.org/guides/index.html
• https://zenhox.github.io/2018/01/25/dpdk-pktSR/
14
ESE5320 Fall 2023
-s option specifies the sleep time or delay between packets (in microseconds)
-i option specifies ip address to send to
-f option specifies what file to send
-b option specifies the block size
For the project report and for project milestones where you characterize your through-
put, you should adjust the -s argument until your design fails. Report your maximum
throughput as the throughput associated with the smallest value of -s on which your
design successfully receives and correctly compresses the input. Measure the actual
throughput by measuring the time it takes for the client to send the file. You can use
/usr/bin/time to measure the time.
15