Manuscript
Manuscript
Electrical and Computer Engineering, Florida Polytechnic University, Lakeland, FL 33805, USA;
[email protected]
† Presented at the 10th International Electronic Conference on Sensors and Applications (ECSA-10), 15–30
Abstract: In this paper, we present a RISC-V RV32I based System on Chip (SoC) design approach
using the Vivado High Level Synthesis (HLS) tool. The proposed approach consists of three separate
levels: The first one is an HLS design and simulation purely in C++. The second one is a Verilog
simulation of the HLS generated Verilog implementation of the CPU core, a RAM unit initialized
with a short assembly code, and a simple output port which simply forwards the output data to the
simulation console. Finally, the third level is an implementation and testing of this SoC on a low-cost
FPGA board (Basys3) running at a clock speed of 100 MHz. A sample C code is compiled using the
GNU RISC-V compiler tool chain and tested on the HLS generated RISC-V RV32I core as well. The
HLS design consists of a single C++ file with less than 300 lines, a single header file, and a testbench
in C++. Our design objectives are (1) The C++ code should be easy to read for an average engineer,
and (2) The coding style should dictate minimal area, i.e., minimal resource utilization, without
significantly degrading the code readability. The proposed system is implemented for two different
I/O bus alternatives: (1) A traditional single clock cycle delay memory interface, and (2) The industry
standard AXI bus. We present timing closure, resource utilization, and power consumption estimates.
Furthermore, by using the open-source synthesis tool yosys, we generate a CMOS gate-level design
and provide gate count details. All design, simulation, and constraint files are publicly available in a
GitHub repo. We also present a simple dual-core SoC design, but detailed multi-core designs and
other advanced futures are planned for future research.
Keywords: High Level Synthesis; RISC-V; System on Chip; FPGA; multi-core architectures
for the HLS based approach adapted in this work. The HLS based approach can be quite
useful for rapid prototyping of complex ideas, especially for systems with complex state
machines. To the best of author’s knowledge, there are limited published work where an
HLS based approach is used for a RISC-V core design. In [7], an HLS design is presented
but the source code is split into multiple files making it difficult to read. What is presented
in this work is a single file design which is relatively short, easily readable, and yet suitable
for an FPGA implementation with clock speeds of 100 MHz. Open source RISC-V cores can
be quite useful for computer architecture education too, see [8]. The proposed HLS RISC-V
RV32I core source codes are available in the public GitHub repo [? ]. Finally, the author
would like to cite [? ] as a source of inspiration for this work.
This paper is organized as follows: In Section 2, we summarize the RISC-V RV32I
instruction set architecture. In Section 3, a high level synthesis approach for design and
simulation is presented. Verilog simulations of our RISC-V SoC is presented in Section 4,
CMOS gate-level design using the open-source synthesis tool yosys and gate count details
are given in Section 5, and the FPGA implementation and testing are presented in Section 6.
In Section 7, a sample C program is used for testing the HLS generated core. A multi-core
RISC-V SoC approach is outlined in Section 8, and finally some concluding remarks are
made in Section 9.
Table 1. Cont.
// Write strobe
#define wstrb (*pstrb)
// Register file
arch_t reg_file[REGFILE_SIZE];
arch_t pc = 0;
// Decode
opcode_t opcode = insn(6,0);
...
// Execute
switch (opcode) {
case OPCODE_R:
case OPCODE_IA:
switch(...) {
...
}
break;
...
}
// Branch handling
}
}
As seen in the Outline I, immediately after reset the program counter and all of the
registers are initialized to zero. There is an infinite while loop which will be exited if
an ECALL or EBREAK instruction is executed or an unaligned memory access is requested,
basically causing the CPU core to halt.
The HLS tool converts this while loop to a state machine with 11 states using the
one-hot encoding. Inside the loop, we have the usual instruction fetch, decode, execute,
write-back and branch handling. For example, insn = mem[pc » 2] will be synthesized as
a memory read operation, and opcode = insn(6,0) will be synthesized as selecting the
least significant 7 bits of the 32-bit value read from the memory. Note that, by using the
operator overloading features of C++, we are able to express slicing and concatenation in
C++, see [? ? ] for full details. For example, in the instruction decode stage, we have the
lines
which corresponds to generating the 32-bit immediate value for various types of instruc-
tions. Note that ap_uint<p> is used for p-bit unsigned integers, insn(p,q) corresponds to
slicing, and ( ... , ... , ... ) corresponds to concatenation. These are possible
because of the standard operator loading features of C++. Note that the C simulation
semantics and the hardware synthesis semantics are different.
There are various switch statements, which are synthesized as wide-multiplexers.
Nested switch statements correspond to cascaded multiplexers. To make sure that minimal
number of adders, comparators, barrel-shifters, etc. are synthesized, and no hardware
resources are wasted or underutilized, we define first program variables src1, src2, res
and then write a bunch of switch statements. This coding style may look a bit unusual, but
still highly readable, and is adapted purely for optimal hardware synthesis. In other words,
Eng. Proc. 2023, 56, 0 5 of 12
the C++ coding style used in HLS greatly affects the final generated hardware, and we tried
to keep a reasonable balance between C++ code readability and hardware optimality.
The HLS tool automatically generates Verilog files in human readable format, but
also allows C-simulation based testing using the file riscv32i_tb.cc. This C-simulation
testbench reads a text file of hexadecimal values in human readable format, initializes the
memory by using these values and passes the control to the cpu() function. Immediately
after return, all register values and the memory are dumped to separate text files. In
Figure 1, Vivado HLS C-simulation for the following short assembly program is given:
li x1,1020
sw x0,0(x1)
loop: lw x2,0(x1)
addi x2,x2,1
sw x2,0(x1)
j loop
Values stored in registers and memory as well as internal signals are displayed in the
debug window. Hexadecimal values for each instruction is written to the file mem.txt, and
conversion is done by using an online assembler tool. See [? ] for full details.
localparam T=10;
logic clk, reset, start, done, idle, ready, we, ce, vld;
logic [3:0] wstrb;
logic[9:0] addr;
logic [31:0] val_i, val_o;
cpu U1(
.ap_clk(clk),
.ap_rst(reset),
.ap_start(start),
.ap_idle(idle),
.ap_ready(ready),
.mem_V_address0(addr),
.mem_V_ce0(ce),
.mem_V_we0(we),
.mem_V_d0(val_i),
.mem_V_q0(val_o),
.pstrb_V(wstrb)
);
initial clk = 0;
always #(T/2) clk = ~clk;
initial
begin
...
wait(idle==1);
$stop;
end
endmodule
The simulation testbench outline is given in Outline II, and the RAM with the I/O
devices are presented in Outline III. Basically, we have a simple system on chip consisting
of a single RISC-V RV32I core, a 4 KB RAM with single clock cycle read/write delay, and a
32-bit output port at memory address 0x0ff.
initial
$readmemh("C:/Users/onur/Desktop/MyWork/vivado/RISCV32I_HLS/mem.txt", ram);
//initial begin
// ram[0] = 32’h 3fc00093; // li x1,1020
// ram[1] = ...
//end
endmodule
In Figure 2, Verilog simulation results are shown. We are using the assembly program
given in the previous section, which basically writes the values 0, 1, 2, ... to the
address 0x0ff. The program counter PC is shown in the timing diagram, and the values
written to the output port at address 0x0ff are shown both in the simulation console and
the timing diagram. There is a specific reason why $write("%c", din[7:0]) is used for
the memory mapped I/O at address 0x0ff. If we use a C-compiler, and implement putc()
as a write to the I/O address 0x0ff, then all printf(...) and cout « ... will write to
Eng. Proc. 2023, 56, 0 7 of 12
the Verilog simulation console. This allows testing of more complex C/C++ programs with
the HLS generated RISC-V core.
In our simulation testbench, we also have a block RAM option, shown as SRAM. This
allows testing the HLS generated RISC-V core using block RAMs available on most Xilinx
FPGAs, see Figure 3.
Figure 3. Block RAM should have single clock cycle read/write delay.
and the following gate-count results are reported by the synthesis tool:
In summary, a total of 1377 D-type flip-flops are used including the register file of
depth 32 and width 32. We have forced the synthesis tool to design using only two input
NAND and NOR gates, and with that constraint the total number of two-input NAND
gates is 7415, two-input NOR gates is 5219, and NOT gates is 1311.
CONCAT_0
VAND_0
In0[0:0]
SRAM_4K
In1[0:0]
RISCV_I32 dout[3:0] Op1[3:0]
In2[0:0] Res[3:0]
Op2[3:0] BRAM_PORTA
ap_start In3[0:0]
ap_ctrl mem_V_ce0 addra[9:0]
ap_start mem_V_we0 Utility Vector Logic clka
Concat
ap_clk mem_V_address0[9:0] dina[31:0]
ap_clk ap_rst mem_V_d0[31:0] douta[31:0]
mem_V_q0[31:0] pstrb_V[3:0] wea[3:0]
ap_rst
riscv_H1 Block RAM
SLICE_0 REG_0
Output port
Figure 5. Elaborated design of the RISC-V SoC. The blue box on the right corresponds to the
register file.
Resource utilization of the implemented design is 1078 LUT (5.18%), 326 FF (0.78%)
and % 3 of the BRAM. The final system has 1.41 ns worst-case negative slack for the setup
time for 100 MHz clock. The power consumption is estimated as 81 mW at 100 MHz clock.
Figure 6 shows the FPGA implementation of the SoC for the Basys3 board. Note that the
whole SoC design fits into a portion of the clock region X0Y0. The large rectangular block at
the center of Figure 6 is the 4 KB RAM used for the system on chip.
Figure 6. RISC-V SoC FPGA implementation for the Basys3 board fits into a portion of the clock
region X0Y0.
We use the same assembly program given in Section 3, and make sure that the hex
values corresponding to assembly instructions are loaded to the SoC RAM. After the
system is reset using the button btnC, the CPU core can be started using the button btnU.
Figure 7 shows a Basys3 board implementation of our RISC-V SoC with the output port
connected to the on-board leds. Note that, bits 20 down-to 13 of the 32-bit value written to
memory is routed to the I/O port using the slice block shown in Figure 4. The assembly
program shown given in Section 3 has loop execution time of 170 ns, i.e., 17 clock cycles
loop execution time. The slicing block effectively slows down the counting speed so that
counting can be observed by naked eye.
Eng. Proc. 2023, 56, 0 10 of 12
void main(void);
void main(void) {
*((volatile uint32_t*)OUTPORT) = ’R’;
*((volatile uint32_t*)OUTPORT) = ’I’;
*((volatile uint32_t*)OUTPORT) = ’S’;
*((volatile uint32_t*)OUTPORT) = ’C’;
*((volatile uint32_t*)OUTPORT) = ’\n’;
}
It is compiled with the GNU RISC-V compiler to generate the RAM image. As shown in the
Outline III, we have a $readmemh to initialize the RAM for the Verilog simulation. Again as
shown in the Outline III, all writes to address 0x0ff is forwarded to the simulation console
using the $write command. In summary, when the SoC is simulated with the GNU RISC-V
compiler to generated RAM image, we see the string ‘RISC’ written to the console followed
by a newline, which serves as another verification of the H1 core. In a future version of the
paper, we will be using longer C programs for a more comprehensive testing.
to implement byte, word, and double-word sized memory write operations respectively.
Note that, the bit-slicing operator (.,.) can be used both on the left and right-hand side
of expressions.
In Figure 8 we have a dual-core RISC-V RV32I system with 8K on-chip RAM, two
8-bit output ports, a 16-bit input port, and a single UART port. The H2 core does not have
a tightly coupled memory (TCM) inside the unit, but this will be addressed in a future
version of the paper. Basically, in the current implementation both cores are using the
on-chip static RAM over the AXI bus. All GPIOs and the UART unit are also on the AXI-bus.
We have added a JTAG to AXI unit which can be used for debugging and initialization of
Eng. Proc. 2023, 56, 0 11 of 12
the on-chip static RAM. For this dual-core SoC to function properly, both cores should have
different reset vectors so that they can execute different programs independently.
axi_gpio_0
S_AXI
GPIO
s_axi_aclk
gpio_io_o[7:0] LED0[7:0]
s_axi_aresetn
AXI GPIO
cpu_0
axi_gpio_1
ap_ctrl
S_AXI
ap_start ap_start GPIO
m_axi_mem_V s_axi_aclk
ap_clk gpio_io_o[7:0] LED1[7:0]
s_axi_aresetn
ap_rst_n
AXI GPIO
riscv32i H2 (Pre-Production)
axi_smc axi_bram_ctrl_0
cpu_1
axi_bram_ctrl_0_bram
S00_AXI M00_AXI S_AXI
ap_ctrl
S01_AXI M01_AXI s_axi_aclk BRAM_PORTA BRAM_PORTA rsta_busy
ap_start
m_axi_mem_V S02_AXI M02_AXI s_axi_aresetn
ap_clk
aclk M03_AXI Block Memory Generator
ap_rst_n
aresetn M04_AXI AXI BRAM Controller
sw[15:0]
Based on our preliminary results, we see that the dual-core RISC-V system shown in
Figure 8 does fit into a Basys3 board.
9. Conclusions
In this paper, we have presented a high level synthesis approach for RISC-V RV32I
system design. The CPU core is designed and simulated at the C level, then the HLS
generated Verilog code is tested with RAM and I/O devices at the Verilog simulation
level. Finally, the complete system on chip design with memory and I/O devices are
implemented and tested on a low-cost FPGA board. Timing closure, resource utilization,
and power consumption estimates are also presented. CMOS gate-level design and gate
counts are generated by using an open-source synthesis tool. We have also outlined a
dual-core system design as well. The HLS generated CPU core has 14 states for a traditional
single clock cycle delay memory interface, and 42 states if the AXI bus support is needed.
For such more complex systems, design in Verilog will be more demanding and error prone
compared to an HLS based approach. Detailed analysis of multi-core designs are planned
for future research.
References
1. Depablo, S.; Cebrián, J.A.; Herrero-de Lucas, L.C.; Rey-Boué, A.B. A very simple 8-bit RISC processor for FPGA. In Proceedings
of the FPGAworld Conference 2006, Stocholm, Sweden, 2006; pp. 9–15.
2. Archana, H.R.; Sanjana, T.; Bhavana, H.T.; Sunil, S.V. System Verification and Analysis of ALU for RISC Processor. In Proceedings
of the 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India,
19–20 March 2021; Volume 1, pp. 1785–1789. https://doi.org/10.1109/ICACCS51430.2021.9442045.
3. Wang, L.; Yu, Z.; Zhang, D.; Qin, G. Research on Multi-Cycle CPU Design Method of Computer Organization Principle
Experiment. In Proceedings of the 2018 13th International Conference on Computer Science Education (ICCSE), Colombo, Sri
Lanka, 8–11 August 2018; pp. 1–6. https://doi.org/10.1109/ICCSE.2018.8468694.
4. Eljhani, M.M.; Kepuska, V.Z. Reduced Instruction Set Computer Design on FPGA. In Proceedings of the 2021 IEEE 1st
International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering
MI-STA, Tripoli, Libya, 25–27 May 2021; pp. 316–321. https://doi.org/10.1109/MI-STA52233.2021.9464409.
5. Waterman, A.; Asanović, K. The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA version 20191213; RISC-V International:
2021.
Eng. Proc. 2023, 56, 0 12 of 12
6. Höller, R.; Haselberger, D.; Ballek, D.; Rössler, P.; Krapfenbauer, M.; Linauer, M. Open-Source RISC-V Processor IP Cores for
FPGAs — Overview and Evaluation. In Proceedings of the 2019 8th Mediterranean Conference on Embedded Computing
(MECO), Budva, Montenegro, 10–14 June 2019; pp. 1–6. https://doi.org/10.1109/MECO.2019.8760205.
7. Rokicki, S.; Pala, D.; Paturel, J.; Sentieys, O. What You Simulate Is What You Synthesize: Design of a RISC-V Core from C++
Specifications. In Proceedings of the RISC-V Workshop 2019, Zurich, Switzerland, 2006; pp. 1–2.
8. Harris, S.L.; Chaver, D.; Piñuel, L.; Gomez-Perez, J.; Liaqat, M.H.; Kakakhel, Z.L.; Kindgren, O.; Owen, R. RVfpga: Using
a RISC-V Core Targeted to an FPGA in Computer Architecture Education. In Proceedings of the 2021 31st International
Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 145–150.
https://doi.org/10.1109/FPL53798.2021.00032.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.