The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension
The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension
ABSTRACT Edge computing is becoming increasingly popular in artificial intelligence (AI) application
development due to the benefits of local execution. One widely used approach to overcome hardware
limitations in edge computing is heterogeneous computing, which combines a general-purpose processor
with a domain-specific AI processor. However, this approach can be inefficient due to the communication
overhead resulting from the complex communication protocol. To avoid communication overhead, the
concept of an application-specific instruction set processor based on customizable instruction set architecture
(ISA) has emerged. By integrating the AI processor into the processor core, on-chip communication replaces
the complex communication protocol. Further, custom instruction set extension (ISE) reduces the number of
instructions needed to execute AI applications. In this paper, we propose a uniprocessor system architecture
for lightweight AI systems. First, we define the custom ISE to integrate the AI processor and GPP into a single
processor, minimizing communication overhead. Next, we designed the processor based on the integrated
core architecture, including the base core and the AI core, and implemented the processor on an FPGA.
Finally, we evaluated the proposed architecture through simulation and implementation of the processor.
The results show that the designed processor consumed 6.62% more lookup tables and 74% fewer flip-flops
while achieving up to 193.88 times enhanced throughput performance and 52.75 times the energy efficiency
compared to the previous system.
INDEX TERMS AI processor, application specific instruction processor, custom instruction set extension,
edge computing, embedded systems, processor core, reduced instruction set computer.
I. INTRODUCTION deep neural network (DNN), are complex algorithms that are
Nowadays, ongoing advances in semiconductor process being actively applied to these applications.
technology encourage a variety of digital systems to embed Applications based on lightweight embedded systems are
more dense circuits by reducing the resources, e.g., power one of the fields where AI algorithms are actively applied.
consumption and area usage, while improving the time- Edge computing has become one of the major topics in
related performance of basic elements for digital chip imple- the development of AI applications due to the benefits
mentation. This trend emboldens recent attempts to adopt of replacing the cloud execution with the local execution,
complex algorithms to a variety of applications. Artificial such as reduced network bandwidth usage, enhanced privacy
intelligence (AI) algorithms, e.g., machine learning (ML) and protection, and minimized storage waste [1], [2]. Many
ongoing studies introduce several methods for distributing the
The associate editor coordinating the review of this manuscript and workloads of AI algorithms to lightweight systems [3], [4],
approving it for publication was Stavros Souravlas . [5]. One of the main challenges is overcoming the hardware
constraints of lightweight systems, such as low performance uniprocessor systems based on ASIP. By integrating the AI
and area limitations. core into the core of the GPP, the roles of either the AI
Heterogeneous computing is one of the methods to processor and the GPP, which are executed separately on
overcome these limitations. The general-purpose processor previous heterogeneous computing systems, can be executed
(GPP) is essential to most digital systems, but executing by only a single processor. This characteristic reduces the
most parts of the AI algorithms with only the single GPP is communication overheads and eliminates the necessity for
difficult because most GPPs are optimized to perform various communication controllers as the protocol is replaced with
types of simple operations sequentially, while AI algorithms the on-chip communication running synchronously on the
are composed of numerous limited types of operations [6]. operating frequency of the core.
Therefore, not only the lightweight systems but also the cloud The ASIP concept shares the view with the concept
computing-based high-performance systems that target the based on bus topology [13], [14] that minimizes the
AI application adopt heterogeneous computing with GPP overheads by simplifying and boosting communication
and hardware unit to accelerate arithmetic operations through performance through chip-level integration. Despite this
parallelization [7], [8]. similarity, the ASIP concept has advantages in resource
The main difference in hardware acceleration between utilization derived from the ISE. Through custom ISE,
these systems is the diversity of executable AI algorithms. the number of instructions to execute AI applications
The high-performance systems include domain-general hard- becomes lower because a number of memory load/store
ware units such as general-purpose graphics processing unit instructions, which deliver the orders to the AI accelerator
(GPGPU), and tensor processing unit (TPU), in general [7]. linked to the system bus, is converted to a small number
In contrast, most embedded systems usually include hardware of custom instructions. This characteristic reduces the
units for specific AI algorithms because of the resource memory usage for storing instruction codes and increases
constraints involved in lightweight systems [9], [10]. As edge the throughput of the system [15], [16]. This advantage
AI systems generally target domain-specific applications, made the ASIP concept a reasonable choice despite the
trading off versatility for resource utilization, e.g., power difficulty of the design process originating from the core
consumption and area usage, is an effective strategy for the modification.
systems [11]. In fact, many studies have been conducted to RISC-V and MIPS are suitable ISAs for realizing the
design modular AI processors for heterogeneous computing- ASIP concept. Both ISAs provide productivity in the custom
based edge AI applications [7], [8], [9]. ISE definition process through the guidelines in the ISA
Nevertheless, AI systems composed of modularized het- documentation [17], [18]. Indeed, several studies applied the
erogeneous processors still involve drawbacks stem from ASIP concept with one of these ISAs for systems targeting
certain characteristics. Communication protocols used for AI applications [19], [20], [21], [22]. However, most of those
handshaking between devices, such as universal asyn- studies concentrate on applying this concept to complex
chronous receiver/transmitter (UART), serial peripheral inter- AI algorithms such as deep learning and convolutional
face (SPI), and peripheral component interconnect express neural networks (CNN) [23], [24], [25], [26], [27], [28].
(PCIe), are one of the major characteristics that generate These heavy algorithms consume huge resources, making
the inefficiencies in the system. As the data rate of these them inappropriate for lightweight systems due to hardware
protocols is much slower than the parallel communication limitations [29]. For this reason, choosing an appropriate
inside the processors running on each operating frequency, algorithm that consumes affordable resources is necessary to
significant overheads are required for handshaking between apply the concept to lightweight systems.
both processors. Another thing to consider is an additional workload
The increased workload allocation for communication for rewriting the source codes of the application. Custom
results in degraded system performance. Moreover, synchro- instructions are not created by compiling the source codes
nizing the processors requires additional controller units for written in general syntaxes of compilers. Hence, software
both for communication in compliance with the protocol. developers need to mutate the legacy source codes to the
These controllers increase the area of each processor, new inline assembly codes to operate the AI core. Defining
resulting in higher energy consumption for the entire system. the custom ISE similar to the ISA of previous AI processors
As these drawbacks, e.g., throughput degradation and electric is necessary because conserving the previous mechanism
energy dissipation, are critical in lightweight systems [12], lowers the additional workloads for rewriting the source
minimizing the inefficiency generated by the communication codes.
protocol is necessary. In this paper, we propose a uniprocessor system
Application-specific instruction set processor (ASIP) architecture for lightweight AI systems. To minimize the
based on customizable instruction set architecture (ISA) communication overhead between the AI processor and the
for GPP is one of the concepts suggested to avoid the general-purpose processor, we selected the AI processor
inefficiencies caused by the communication overheads in suitable for lightweight embedded systems and defined the
modularized architecture. Fig. 1 shows the system archi- custom ISE to design the architecture that integrates the GPP
tectures of both heterogeneous computing systems and and the AI processor into a single processor.
FIGURE 1. (a) The heterogeneous system architecture. (b) The ASIP-based system architecture.
The ISE is compatible with MIPS ISA, which includes In order to describe our work clearly, the rest of the
reserved coprocessor definitions for the design-specific ISE. paper is categorized as follows. Section II reviews the related
Further, we designed the integrated core architecture of works that motivated our research. Section III explains the
the processor that executes the ISE. The core architecture base architecture of the target modular AI processor, which
includes the processor core and the AI core, which are is used to extract and optimize the AI core for the AI
operated by the synchronous clock. coprocessor. Section IV presents the custom ISE definitions
According to this architecture, complex communication and operations, which are based on the operation of the AI
is replaced with simple data transferring through commu- processor. Section V presents the integrated core architecture
nication based on multiple internal wires and buffers in with the detailed operation of custom ISE, as well as unified
only a few clock periods. To verify the system and the processor architecture with integrated core architecture. Sec-
core architecture, we designed the processor to a register- tion VI describes the verification environment and evaluates
transfer level (RTL) written in Verilog HDL and build up the proposed architecture. Section VII concludes our research
the verification environment with a field-programmable gate by summarizing our work and outlining the future endeavors
array (FPGA) implementation. to extend this work.
Next, we developed the software library containing
inline assembly codes that perform the same operation
as the legacy library functions, which are developed for II. RELATED WORKS
the previous modular AI processor. Finally, we evaluated A. FPGA-BASED LIGHTWEIGHT CNN ACCELERATORS
the proposed architecture and the designed processor by Many studies have been conducted to apply edge comput-
executing sample applications that are already verified for ing for AI, using FPGA-based heterogeneous computing
the heterogeneous computing system with the previous AI methods. Kim et al. [30] proposed hardware acceleration
processor. for lightweight systems based on SqueezeNet [31], which
The main contributions of this work are listed as follows: is a lightweight CNN model for embedded systems. The
• Customized ISE to minimize the overhead caused by the authors transformed 32-bit floating-point arithmetic to 8-bit
complex communication protocol in the modularized AI integer arithmetic to reduce the intensity of SqueezeNet and
system. designed the parallelized architecture. Xia et al. [32] propose
• The AI coprocessor architecture and AI core architecture SparkNet, which utilizes depthwise separable convolution
optimized for 32-bit MIPS core. to reduce the intensity of the CNN. To optimize the
• The core architecture for GPP with integrated AI CNN for the embedded systems, the authors of this work
coprocessor compatible with customized ISE. quantized the parameters to 16-bit integers and designed
• The unified processor architecture designed to RTL. the accelerator with pipelined architecture, applying optimal
• The software library to control the AI coprocessor. multilevel parallelism. Despite the astonishing performance
• Evaluation of the processor architecture by realizing the of the CNN-based heterogeneous accelerators proposed in
designed processor to a field-programmable gate array these studies, the baseline intensity and numerous parameters
(FPGA) implementation. of multi-layered neural networks still make challenging to
adopt these works on ultra-lightweight embedded systems memories, which make the system heavier, making the
due to power consumption, energy dissipation, and area usage system inappropriate for ultra-lightweight systems.
originating from calculation logics and external memory. • Optimizing the internal architecture of the k-NN calcula-
tion logic through parallelism directly reflecting the data
B. LIGHTWEIGHT ML ACCELERATORS rate and timing constraints of the general-purpose core
As the CNNs are not suitable for resource-limited ultra- inside the processor.
lightweight systems, lightweight ML accelerators and pro- • Provide the reconfigurability that reflects the k-NN
cessors have been designed and introduced. General Vision parameters to adopt the AI core by varying constraints
introduces a neuron-inspired pattern recognition chip based of the lightweight systems.
on k-nearest neighbor (k-NN) algorithm, the CM1K, which
is an application-specific integrated circuit (ASIC) [33].
III. BACKGROUND: INTELLINO AI PROCESSOR
Abeysekara et al., and Suri et al. [34], [35] developed
Before designing the processor based on the proposed
heterogeneous systems for AI applications with lightweight
system architecture for edge AI applications, we chose the
GPP and the CM1K. These works overcome the computing
appropriate AI processor, Intellino, which is designed for
performance limitations of the lightweight GPP but still
heterogeneous computing systems. Intellino is a reconfig-
have drawbacks related to energy dissipation and throughput
urable AI processor based on the k-NN algorithm, which is
degradation originating from the complex communication
a lightweight ML algorithm based on distance calculation
protocol.
and operates with the SPI, which is a common interface
To address these issues, Intel proposes a processor module
for high-speed communication in lightweight embedded
named Curie, including pattern matching engine (PME) that
systems [40], [41].
accelerates the k-NN algorithm and are manipulated by
Fig. 2 shows the top architecture of the Intellino, which
bus transactions [36]. Although manipulating the accelerator
consists of the SPI slave controller, packet decoder and
through a bus topology increases the throughput performance
encoder, neuron controller, and classifier. As the Intellino is
and reduces the energy dissipation, this method still has
the slave device, which cannot operate independently, the SPI
memory access overheads originating from bus transactions.
slave controller performs communication dependent on the
On the other hand, several studies proposed utilizing burst
generated clock, i.e., serial clock (SCK), from the host. After
access in bus topology to reduce memory access bottlenecks.
receiving the one-byte SPI frame, the controller sets the signal
Borelli et al. [37] proposed the k-NN accelerator for
called rx_cplt and delivers the received data to the packet
heterogeneous computing that uses the scratchpad memory
decoder. The packet decoder converts the received data to the
through direct memory access (DMA) connected to an
instruction for the neuron controller by buffering the data and
advanced extensible interface (AXI) stream port inside
observing the specific byte sequences based on the protocol
the processing system. Hussain et al. [38] designed the
of the Intellino.
accelerator using a parallel first-in-first-out (FIFO) module
The instruction set for the neuron controller is defined as
to reduce memory access bottlenecks by burst transactions.
the operations based on register map definition. Thus, the
In both architectures, data transactions between the GPP
single instruction for the neuron controller is represented as
and the accelerator are processed through burst read and
the control signals for the register file, such as register address
write operations, making the data transactions faster. Li et
(reg_addr), register write enable (reg_we), register write
al. [39] represent the k-NN accelerator accessed by the main
data (reg_wd), and register read enable (reg_re). The neuron
processor through the AXI slave interface and compose
controller generates the control signals and data signals for
the scratchpad memory as dynamic random access memory
neuron cells and sends the calculation results of the k-NN
(DRAM). In this study, the accelerator directly controls
algorithm generated by the classifier. The neuron cells and
the DMA with burst operations to manipulate the DRAM
the classifier make up the k-NN calculator.
controller through the AXI stream master port. These studies
Fig. 3 shows the architecture of the k-NN calculation
significantly reduced the memory access bottlenecks but are
process. The k-NN algorithm is calculated by searching the
not optimized for ultra-lightweight embedded systems, as
category-distance pair, which has the shortest distance from
these designs exploit not only the on-chip memory but also
the inference data vector, from the pre-trained dataset. The
the external memories, resulting in a heavier system in terms
k-NN calculator of the Intellino has an architecture that
of area usage, power consumption, and energy dissipation and
performs the calculation efficiently. At first, the distance
leading to throughput degradation due to long memory access
data of each dataset and current inference data vector
latency.
are calculated by each neuron cell. Next, The classifier
Different from these works, we concentrate on these terms:
submodule calculates the category with the lowest distance by
• Designing an optimized architecture to reduce bottle- comparing the distance data using the lower distance selectors
necks of the memory access, which is required for AI that are constructed to a multi-level inverted tree structure.
computation, by eliminating structural overheads caused Through this structure, the data pair with the lowest distance
by bus transactions while avoiding the use of external is derived by the selector in the last level.
FIGURE 4. (a)The internal architecture of the neuron cell. (b)The internal architecture of the lower distance
selector.
only the neuron cell to train but also the current state of the
whole coprocessor. Through this attribute, both the storing Listing 2. The source codes for training based on the original
architecture.
of the vector to the neuron cell in the training process
and calculating the distance in the inference process are
executed by the same iterative move operations. The variance
between the training process and the inference process is V. SYSTEM ARCHITECTURE
that the training process ends with the writing operation of Fig. 6 shows the proposed system architecture. The system
the category for training data to the neuron cell and the consists of the AI ASIP for calculation, sensors for receiving
inference process ends with the reading operation of the inference data, and a display interface to interact with the
derived category and distance. Another important attribute is users. The ASIP adopts the Harvard architecture, which
that the operations to overwrite certain values to the COMPID has two paths, the data path and the instruction path, to
register and NID register are highly infrequent because of the avoid bubble insertion generated by sharing the same path
certain operations that automatically change the value in the on instruction fetching and memory access instructions.
register not addressed by the instruction, move operations for Compared to high-performance systems, which generally
writing to the COMP, LCOMP, CAT, FORGET, CLEAR. adopt the modified Harvard architecture, the exclusion of
Listing 1 and 2 show examples of source code for the caches reduces area usage and energy consumption because
AI coprocessor executed by custom ISE and for the original of the elimination of the overheads caused by cache misses
Intellino executed by the SPI communication, respectively. and avoiding the area occupancy by the cache. As the
Both source codes execute the training of single data from memory requirements of the lightweight AI are generous,
the dataset to the neuron cell. The length of the vector is the reduction of the available memory resources provided
configured as 128 and each of the COMPID values and the by the cache is acceptable. The integrated core on the
NID are regarded as respectively zero and the value lower ASIP processes the input data for inference by executing
than 2n −1. Both codes share the steps that storing the training the program including the custom AI instructions through
vector in the SPM through sequential writing and sending the the internal AI coprocessor. The output controller displays the
category value. Despite the source codes having similar steps, inference result calculated by the AI coprocessor, allowing
the codes based on custom ISE provide much more simplicity the users to perceive the result directly. In accordance with
than those based on the original AI processor due to the this architecture, the throughput of the AI algorithm is
exclusion of the numerous function calls for the complex improved compared to the systems based on the previous
communications protocol. heterogeneous architecture. Further, the data flow of the
FIGURE 6. The proposed system architecture for lightweight AI systems. FIGURE 7. Data flow of the MTC2, MFC2 instructions on the core pipeline.
FIGURE 11. The required time for the training process in each case.
area, power, and propagation delay, between ASICs and
After the simulation, we realized the designed processor by FPGAs [46]. However, in terms of purpose that uses the
FPGA implementation. Fig. 12 shows the architecture of the FPGA implementation as the prototyping of the digital
verification system, which is based on the Zynq-7000 system- logic circuit before ASIC implementation, utilizing these
on-chip (SoC) mounted on the Digilent ZedBoard. The blocks makes the performance analysis fuzzier as the FPGA
SoC consists of the general-purpose processor (PS) and the prototype operates as ASIC-FPGA hybrid architecture [47].
programmable logic (PL). The PS includes the complex block Therefore, to provide an approximately clear and comparative
for the modern operating system (OS) such as the memory analysis on the different digital circuits for chip fabrica-
management unit (MMU) and a variety of peripherals. tion, synthesizing with only general-purpose blocks that is
Certain peripherals, e.g., GPIO, in the PS is connectable regarded as correspondence to the standard cell library is
to the programmable logic. This architecture facilitates the required.
input/output process and the verification process of the Fig. 13 shows the resource usage of the implementation
designed digital circuit by making use of the convenient results between the previous AI processor and the designed
features the OS environment provides. Thus, we adopted the AI coprocessor. The result presents that the FPGA implemen-
method of running the modern OS, Linux, on the PS and tations of the coprocessor require more look-up tables (LUTs)
taking control of the design implemented on the PL block than those of the original AI processor. On average, the
by the OS. The Linux distribution we utilized for the PS is coprocessor uses approximately 6.62% more LUTs compared
PetaLinux 2021.1, and a detailed guide for running PetaLinux to the original. This is because the ancillary components of
on the Zynq-7000 SoC is provided in [45]. the coprocessor, the pipeline stages, use more LUTs than
To realize the method, firstly, we synthesized and imple- the ancillary components of the original processor, the SPI
mented the MIPS-Intellino processor to the PL block of slave controller and the packet decoder. The gap between the
the Zynq-7000 SoC and linked the GPIO and the UART LUTs usage of the coprocessor and the original decreases as
of the PS to the JTAG controller and the UART in the the vector length parameter increases. As these components
designed processor. Next, we developed the software that do not affect by the AI configurations such as vector length
writes the program to the RAM in the designed processor by and the neuron cell count, the increase in the configuration
imitating the JTAG protocol through GPIO and programmed parameters makes the area share on the AI logic and the
the test software compiled by the cross compiler built with the memory higher, diminishing the impact of the gap between
GNU C compiler (GCC) to the designed processor. At last, ancillary components. The usage of other physical resources,
we checked the results of the test software running on the flip-flops (FFs), F7 multiplexers, and F8 multiplexers, of the
MIPS-Intellino through the serial communication software coprocessor shows each of 74%, 40.72%, 19.53% reduced
running on the PS. values than the original AI processor. According to this result,
To evaluate the area usage of the designed processor the area usage performance of the coprocessor is suitable
architecture, we performed synthesis and implementation for replacing the modularized architecture. When reflecting
using Xilinx Vivado 2022.2 for the designed coprocessor on the physical area reduction through the SoC architecture,
and the original AI processor with various AI configurations. which removes the external wires for communication, the
The jobs were configured to target the PL block of the advances in area usage performance become higher.
Xilinx xc7z020clg SoC with additional settings to avoid In the case of power and energy, the implementation
the utilization of pre-fabricated hard blocks, such as DSP results show that the designed AI coprocessor has far better
blocks and block RAMs. Hard blocks in FPGAs provide performance than the original Intellino. Fig. 14 presents the
performance enhancements and area reduction, operating energy-related performances of both designs. As shown in
more similarly to ASIC than FPGA [46]. Thus, utilizing the first graph, the AI coprocessor consumes 3.58 times the
these blocks is regarded as contracting the gap, e.g., dynamic power on average compared to the original. This
FIGURE 13. Resource usage comparison between the coprocessor and the original Intellino.
MIPS ISA, which presents the guidelines for defining the [5] P. Tsung, T. Chen, C. Lin, C. Chang, and J. Hsu, ‘‘Heterogeneous
custom ISE. The ISE we defined is optimized to mutate the computing for edge AI,’’ in Proc. Int. Symp. VLSI Design, Autom. Test
(VLSI-DAT), Apr. 2019, pp. 1–2.
target AI processor for heterogeneous computing, Intellino, [6] S. Mukhopadhyay, Y. Long, B. Mudassar, C. S. Nair, B. H. DeProspo,
a reconfigurable lightweight AI processor based on k- H. M. Torun, M. Kathaperumal, V. Smet, D. Kim, S. Yalamanchili, and
NN algorithm. The ASIP is designed with the integrated M. Swaminathan, ‘‘Heterogeneous integration for artificial intelligence:
Challenges and opportunities,’’ IBM J. Res. Develop., vol. 63, no. 6,
core architecture that contains the custom coprocessor that pp. 4:1–4:1, Nov. 2019.
synchronously operates with the base core. By placing the [7] C. Choi et al., ‘‘Reconfigurable heterogeneous integration using stackable
core logic to execute the AI algorithm on the write back stage, chips with embedded artificial intelligence,’’ Nature Electron., vol. 5, no. 6,
the necessity of the additional forwarding logic is removed. pp. 386–393, Jun. 2022, doi: 10.1038/s41928-022-00778-y.
[8] J. Han, M. Choi, and Y. Kwon, ‘‘40-TFLOPS artificial intelligence
Further, specified optimizations for communication which processor with function-safe programmable many-cores for ISO26262
is based on multiple parallel wires, e.g., parallelization of ASIL-D,’’ ETRI J., vol. 42, no. 4, pp. 468–479, Aug. 2020, doi:
the distance accumulation in the k-NN algorithm, boost the 10.4218/etrij.2020-0128.
[9] D. Jamma, O. Ahmed, S. Areibi, G. Grewal, and N. Molloy, ‘‘Design
throughput remarkably with a little demerit in area usage exploration of ASIP architectures for the K-nearest neighbor machine-
performance. learning algorithm,’’ in Proc. 28th Int. Conf. Microelectron. (ICM),
According to the simulation results, the designed pro- Dec. 2016, pp. 57–60.
cessor achieved up to 193.88 times enhanced throughput [10] Y. Chen, H. Lan, Z. Du, S. Liu, J. Tao, D. Han, T. Luo, Q. Guo, L. Li,
Y. Xie, and T. Chen, ‘‘An instruction set architecture for machine learning,’’
performance compared to that of the original modularized ACM Trans. Comput. Syst., vol. 36, no. 3, pp. 1–35, Aug. 2019, doi:
AI processor. In addition, resource shares of each component 10.1145/3331469.
were reduced, except LUTs, which were 6.62% higher on [11] H. W. Oh, J. K. Kim, G. B. Hwang, and S. E. Lee, ‘‘The design of
a 2D graphics accelerator for embedded systems,’’ Electronics, vol. 10,
average in compliance with implementation results. Consid- no. 4, p. 469, Feb. 2021. [Online]. Available: https://www.mdpi.com/2079-
ering the elimination of the external wires for communication 9292/10/4/469
due to the uniprocessor architecture is not reflected, the [12] H. Huang, Z. Liu, T. Chen, X. Hu, Q. Zhang, and X. Xiong,
‘‘Design space exploration for YOLO neural network accelerator,’’
advantages of area usage are assumed to be higher. Therefore,
Electronics, vol. 9, no. 11, p. 1921, Nov. 2020. [Online]. Available:
adopting the proposed system architecture is reasonable for https://www.mdpi.com/2079-9292/9/11/1921
lightweight systems that have a fixed application and are [13] J. Wang and S. Gu, ‘‘FPGA implementation of object detection accelerator
hugely affected by the throughput. based on Vitis-AI,’’ in Proc. 11th Int. Conf. Inf. Sci. Technol. (ICIST),
May 2021, pp. 571–577.
In future work, we plan to advance our work through two [14] E. Rapuano, G. Meoni, T. Pacini, G. Dinelli, G. Furano, G. Giuffrida, and
separate stages. At first, we will eliminate the necessity of L. Fanucci, ‘‘An FPGA-based hardware accelerator for CNNs inference
bubble insertion intrinsic in the current coprocessor archi- on board satellites: Benchmarking with myriad 2-based solution for the
CloudScout case study,’’ Remote Sens., vol. 13, no. 8, p. 1518, Apr. 2021.
tecture that reduces the throughput performance. This objec- [Online]. Available: https://www.mdpi.com/2072-4292/13/8/1518
tive can be achieved by optimizing the microarchitecture [15] M. Perotti, P. D. Schiavone, G. Tagliavini, D. Rossi, T. Kurd, M. Hill,
such as reconstructing the data processing and propagation L. Yingying, and L. Benini, ‘‘HW/SW approaches for RISC-V code
structure in the pipeline architecture and adding the data size reduction,’’ in Proc. Workshop Comput. Archit. Res. RISC-V, 2020,
pp. 1–12.
forwarding logic to avoid stalls. Second, we will adopt [16] A. K. Verma, P. Brisk, and P. Ienne, ‘‘Rethinking custom ISE identification:
the proposed coprocessor architecture for other complex AI A new processor-agnostic method,’’ in Proc. Int. Conf. Compil., Archit.,
algorithms that are more complicated than k-NN, extending Synth. Embedded Syst., New York, NY, USA, Sep. 2007, p. 125, doi:
10.1145/1289881.1289905.
the proposed architecture for high-performance systems. [17] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovi, ‘‘The RISC-V
Ultimately, we aim to fabricate the most advanced ASIP instruction set manual. Volume 1: User-level ISA, version 2.0,’’ California
designed by future works onto a chip to prove the feasibility Univ. Berkeley Dept. Elect. Eng. Comput. Sci., Berkeley, CA, USA,
Tech. Rep. UCB/EECS-2014-54, 2014.
and performance enhancements of the proposed system
[18] C. Price, ‘‘MIPS IV instruction set,’’ Tech. Rep., Revision 3.2, 1995.
architecture. [19] A. S. Eissa, M. A. Elmohr, M. A. Saleh, K. E. Ahmed, and M. M. Farag,
‘‘SHA-3 instruction set extension for a 32-bit RISC processor architec-
REFERENCES ture,’’ in Proc. IEEE 27th Int. Conf. Appl.-specific Syst., Archit. Processors
(ASAP), Jul. 2016, pp. 233–234.
[1] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
K. Chan, ‘‘When edge meets learning: Adaptive control for resource- [20] T. Ahmed, N. Sakamoto, J. Anderson, and Y. Hara-Azumi,
constrained distributed machine learning,’’ in Proc. IEEE Conf. Comput. ‘‘Synthesizable-from-C embedded processor based on MIPS-ISA
Commun., Apr. 2018, pp. 63–71. and OISC,’’ in Proc. IEEE 13th Int. Conf. Embedded Ubiquitous Comput.,
[2] S. Jang, H. W. Oh, Y. H. Yoon, D. H. Hwang, W. S. Jeong, and S. E. Lee, Oct. 2015, pp. 114–123.
‘‘A multi-core controller for an embedded AI system supporting parallel [21] Y. Zhou, X. Jin, and T. Xiang, ‘‘RISC-V graphics rendering instruction
recognition,’’ Micromachines, vol. 12, no. 8, p. 852, Jul. 2021. [Online]. set extensions for embedded AI chips implementation,’’ in Proc. 2nd Int.
Available: https://www.mdpi.com/2072-666X/12/8/852 Conf. Big Data Eng. Technol., New York, NY, USA, Jan. 2020, p. 85, doi:
[3] G. Cornetta and A. Touhafi, ‘‘Design and evaluation of a new machine 10.1145/3378904.3378926.
learning framework for IoT and embedded devices,’’ Electronics, [22] M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara, ‘‘A lightweight
vol. 10, no. 5, p. 600, Mar. 2021. [Online]. Available: https://www. posit processing unit for RISC-V processors in deep neural network
mdpi.com/2079-9292/10/5/600 applications,’’ IEEE Trans. Emerg. Topics Comput., vol. 10, no. 4,
[4] J. Chen, M. Jiang, X. Zhang, D. S. da Silva, V. H. C. de Albuquerque, pp. 1898–1908, Oct. 2022.
and W. Wu, ‘‘Implementing ultra-lightweight co-inference model in [23] S. Lee, Y. Hung, Y. Chang, C. Lin, and G. Shieh, ‘‘RISC-V CNN
ubiquitous edge device for atrial fibrillation detection,’’ Expert Syst. Appl., coprocessor for real-time epilepsy detection in wearable application,’’
vol. 216, Apr. 2023, Art. no. 119407. [Online]. Available: https://www. IEEE Trans. Biomed. Circuits Syst., vol. 15, no. 4, pp. 679–691,
sciencedirect.com/science/article/pii/S0957417422024265 Aug. 2021.
[24] N. Wu, T. Jiang, L. Zhang, F. Zhou, and F. Ge, ‘‘A reconfigurable [40] Y. H. Yoon, D. H. Hwang, J. H. Yang, and S. E. Lee, ‘‘Intellino:
convolutional neural network-accelerated coprocessor based on RISC-V Processor for embedded artificial intelligence,’’ Electronics, vol. 9, no. 7,
instruction set,’’ Electronics, vol. 9, no. 6, p. 1005, Jun. 2020. [Online]. p. 1169, Jul. 2020. [Online]. Available: https://www.mdpi.com/2079-
Available: https://www.mdpi.com/2079-9292/9/6/1005 9292/9/7/1169
[25] W. Lou, C. Wang, L. Gong, and X. Zhou, ‘‘RV-CNN: Flexible and efficient [41] D. H. Hwang, C. Y. Han, H. W. Oh, and S. E. Lee, ‘‘ASimOV:
instruction set for CNNs based on RISC-V processors,’’ in Advanced A framework for simulation and optimization of an embedded AI
Parallel Processing Technologies, P.-C. Yew, P. Stenström, J. Wu, X. Gong, accelerator,’’ Micromachines, vol. 12, no. 7, p. 838, Jul. 2021. [Online].
and T. Li, Eds. Cham, Switzerland: Springer, 2019, pp. 3–14. Available: https://www.mdpi.com/2072-666X/12/7/838
[26] Z. Li, W. Hu, and S. Chen, ‘‘Design and implementation of CNN custom [42] H. W. Oh, K. N. Cho, and S. E. Lee, ‘‘Design of 32-bit processor for
processor based on RISC-V architecture,’’ in Proc. IEEE 21st Int. Conf. embedded systems,’’ in Proc. Int. SoC Design Conf. (ISOCC), Oct. 2020,
High Perform. Comput. Commun., IEEE 17th Int. Conf. Smart City, IEEE pp. 306–307.
5th Int. Conf. Data Sci. Syst., Aug. 2019, pp. 1945–1950. [43] L. Deng, ‘‘The MNIST database of handwritten digit images for machine
[27] R. Porter, S. Morgan, and M. Biglari-Abhari, ‘‘Extending a soft-core learning research [best of the web],’’ IEEE Signal Process. Mag., vol. 29,
RISC-V processor to accelerate CNN inference,’’ in Proc. Int. Conf. no. 6, pp. 141–142, Nov. 2012.
Comput. Sci. Comput. Intell. (CSCI), Dec. 2019, pp. 694–697. [44] H. W. Oh. (2023). Simulation ENV. for The MIPS-Intellino and
[28] E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, and the Original Intellino. Accessed: Apr. 9, 2023. [Online]. Available:
L. Benini, ‘‘GAP-8: A RISC-V SoC for AI at the edge of the IoT,’’ in https://github.com/hopesandbeers/MIPS-Intellino-sim
Proc. IEEE 29th Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), [45] Xilinx. (2022). ZYNQ-7000 Embedded Design Tutorial. Accessed:
Jul. 2018, pp. 1–4. Nov. 8, 2022. [Online]. Available: https://xilinx.github.io/Embedded-
[29] F. Taheri, S. Bayat-Sarmadi, and S. Hadayeghparast, ‘‘RISC-HD: Design-Tutorials/docs/2021.1/build/html/docs/Introduction/Zynq7000-
Lightweight RISC-V processor for efficient hyperdimensional computing EDT/Zynq7000-EDT.html
inference,’’ IEEE Internet Things J., vol. 9, no. 23, pp. 24030–24037, [46] I. Kuon and J. Rose, ‘‘Measuring the gap between FPGAs and ASICs,’’
Dec. 2022. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 2,
[30] J. Kim, J.-K. Kang, and Y. Kim, ‘‘A resource efficient integer-arithmetic- pp. 203–215, Feb. 2007.
only FPGA-based CNN accelerator for real-time facial emotion recogni- [47] A. Ehliar and D. Liu, ‘‘An ASIC perspective on FPGA optimizations,’’ in
tion,’’ IEEE Access, vol. 9, pp. 104367–104381, 2021. Proc. Int. Conf. Field Program. Log. Appl., Aug. 2009, pp. 218–223.
[31] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, [48] General Vision. (Apr. 2019). NM500 User’s Manual. Accessed:
and K. Keutzer, ‘‘SqueezeNet: AlexNet-level accuracy with 50x fewer Apr. 8, 2023. [Online]. Available: https://generalvision.com/documenta
parameters and <0.5 MB model size,’’ 2016, arXiv:1602.07360. tion/TM_NM500_Hardware_Manual.pdf
[32] M. Xia, Z. Huang, L. Tian, H. Wang, V. Chang, Y. Zhu,
and S. Feng, ‘‘SparkNoC: An energy-efficiency FPGA-
based accelerator using optimized lightweight CNN for HYUN WOO OH received the B.S. degree in
edge computing,’’ J. Syst. Archit., vol. 115, May 2021, electronic and IT media engineering and the
Art. no. 101991. [Online]. Available: https://www.sciencedirect.
M.S. degree in electronic engineering from the
com/science/article/pii/S1383762121000138
Seoul National University of Science and Tech-
[33] General Vision. (Aug. 2017). CM1K Hardware User’s Manual. Accessed:
Apr. 8, 2023. [Online]. Available: https://www.generalvision.com/
nology, Seoul, South Korea, in 2021 and 2023,
documentation/TM_CM1K_Hardware_Manual.pdf respectively. He has published some papers and
[34] L. L. Abeysekara and H. Abdi, ‘‘Short paper: Neuromorphic chip articles related to processor architecture and posit
embedded electronic systems to expand artificial intelligence,’’ in Proc. arithmetic. His current research interests include
2nd Int. Conf. Artif. Intell. Industries (AI4I), Sep. 2019, pp. 119–121. computer architecture, system-on-chip, and edge
[35] M. Suri, V. Parmar, A. Singla, R. Malviya, and S. Nair, ‘‘Neuromorphic computing.
hardware accelerated adaptive authentication system,’’ in Proc. IEEE
Symp. Ser. Comput. Intell., Dec. 2015, pp. 1206–1213.
[36] R. Dower. (2022). IntelB. Pattern Matching Technology. Accessed: SEUNG EUN LEE (Senior Member, IEEE)
Apr. 8, 2023. [Online]. Available: https://github.com/intel/Intel-Pattern- received the B.S. and M.S. degrees in electrical
Matching-Technology engineering from the Korea Advanced Institute
[37] A. Borelli, F. Spagnolo, R. Gravina, and F. Frustaci, ‘‘An FPGA- of Science and Technology (KAIST), Daejeon,
based hardware accelerator for B the B k-nearest neighbor algorithm in 1998 and 2000, respectively, and the Ph.D.
implementation in B wearable embedded systems,’’ in Applied Intelligence
degree in electrical and computer engineering
and Informatics, M. Mahmud, C. Ieracitano, M. S. Kaiser, N. Mammone,
from the University of California, Irvine (UC
and F. C. Morabito, Eds. Cham, Switzerland: Springer, 2022, pp. 44–56.
[38] H. Hussain, K. Benkrid, C. Hong, and H. Seker, ‘‘An adaptive FPGA
Irvine), in 2008. After graduating, he had been
implementation of multi-core K-nearest neighbour ensemble classifier with Intel Labs., Hillsboro, OR, USA, where he
using dynamic partial reconfiguration,’’ in Proc. 22nd Int. Conf. Field worked as a Platform Architect. In 2010, he joined
Program. Log. Appl. (FPL), Aug. 2012, pp. 627–630. as a Faculty Member with the Seoul National University of Science
[39] Z.-H. Li, J.-F. Jin, X.-G. Zhou, and Z.-H. Feng, ‘‘K-nearest neighbor and Technology, Seoul. His current research interests include computer
algorithm implementation on FPGA using high level synthesis,’’ in architecture, multi-processor system-on-chip, low-power and resilient VLSI,
Proc. 13th IEEE Int. Conf. Solid-State Integr. Circuit Technol. (ICSICT), and hardware acceleration for emerging applications.
Oct. 2016, pp. 600–602.