0% found this document useful (0 votes)
76 views

The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension

Best one_The_Design_of_Optimized_RISC_Processor_for_Edge_Artificial_Intelligence_Based_on_Custom_Instruction_Set_Extension

Uploaded by

madupiz@gmail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension

Best one_The_Design_of_Optimized_RISC_Processor_for_Edge_Artificial_Intelligence_Based_on_Custom_Instruction_Set_Extension

Uploaded by

madupiz@gmail
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 11 April 2023, accepted 10 May 2023, date of publication 15 May 2023, date of current version 24 May 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3276411

The Design of Optimized RISC Processor for Edge


Artificial Intelligence Based on Custom
Instruction Set Extension
HYUN WOO OH AND SEUNG EUN LEE , (Senior Member, IEEE)
Department of Electronic Engineering, Seoul National University of Science and Technology, Seoul 01811, South Korea
Corresponding author: Seung Eun Lee ([email protected])
This work was supported by the Ministry of Science and ICT (MSIT), South Korea, through the Information Technology Research Center
(ITRC) Support Program, supervised by the Institute for Information and Communications Technology Planning and Evaluation (IITP),
under Grant IITP-2023-RS-2022-00156295.

ABSTRACT Edge computing is becoming increasingly popular in artificial intelligence (AI) application
development due to the benefits of local execution. One widely used approach to overcome hardware
limitations in edge computing is heterogeneous computing, which combines a general-purpose processor
with a domain-specific AI processor. However, this approach can be inefficient due to the communication
overhead resulting from the complex communication protocol. To avoid communication overhead, the
concept of an application-specific instruction set processor based on customizable instruction set architecture
(ISA) has emerged. By integrating the AI processor into the processor core, on-chip communication replaces
the complex communication protocol. Further, custom instruction set extension (ISE) reduces the number of
instructions needed to execute AI applications. In this paper, we propose a uniprocessor system architecture
for lightweight AI systems. First, we define the custom ISE to integrate the AI processor and GPP into a single
processor, minimizing communication overhead. Next, we designed the processor based on the integrated
core architecture, including the base core and the AI core, and implemented the processor on an FPGA.
Finally, we evaluated the proposed architecture through simulation and implementation of the processor.
The results show that the designed processor consumed 6.62% more lookup tables and 74% fewer flip-flops
while achieving up to 193.88 times enhanced throughput performance and 52.75 times the energy efficiency
compared to the previous system.

INDEX TERMS AI processor, application specific instruction processor, custom instruction set extension,
edge computing, embedded systems, processor core, reduced instruction set computer.

I. INTRODUCTION deep neural network (DNN), are complex algorithms that are
Nowadays, ongoing advances in semiconductor process being actively applied to these applications.
technology encourage a variety of digital systems to embed Applications based on lightweight embedded systems are
more dense circuits by reducing the resources, e.g., power one of the fields where AI algorithms are actively applied.
consumption and area usage, while improving the time- Edge computing has become one of the major topics in
related performance of basic elements for digital chip imple- the development of AI applications due to the benefits
mentation. This trend emboldens recent attempts to adopt of replacing the cloud execution with the local execution,
complex algorithms to a variety of applications. Artificial such as reduced network bandwidth usage, enhanced privacy
intelligence (AI) algorithms, e.g., machine learning (ML) and protection, and minimized storage waste [1], [2]. Many
ongoing studies introduce several methods for distributing the
The associate editor coordinating the review of this manuscript and workloads of AI algorithms to lightweight systems [3], [4],
approving it for publication was Stavros Souravlas . [5]. One of the main challenges is overcoming the hardware

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 49409
H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

constraints of lightweight systems, such as low performance uniprocessor systems based on ASIP. By integrating the AI
and area limitations. core into the core of the GPP, the roles of either the AI
Heterogeneous computing is one of the methods to processor and the GPP, which are executed separately on
overcome these limitations. The general-purpose processor previous heterogeneous computing systems, can be executed
(GPP) is essential to most digital systems, but executing by only a single processor. This characteristic reduces the
most parts of the AI algorithms with only the single GPP is communication overheads and eliminates the necessity for
difficult because most GPPs are optimized to perform various communication controllers as the protocol is replaced with
types of simple operations sequentially, while AI algorithms the on-chip communication running synchronously on the
are composed of numerous limited types of operations [6]. operating frequency of the core.
Therefore, not only the lightweight systems but also the cloud The ASIP concept shares the view with the concept
computing-based high-performance systems that target the based on bus topology [13], [14] that minimizes the
AI application adopt heterogeneous computing with GPP overheads by simplifying and boosting communication
and hardware unit to accelerate arithmetic operations through performance through chip-level integration. Despite this
parallelization [7], [8]. similarity, the ASIP concept has advantages in resource
The main difference in hardware acceleration between utilization derived from the ISE. Through custom ISE,
these systems is the diversity of executable AI algorithms. the number of instructions to execute AI applications
The high-performance systems include domain-general hard- becomes lower because a number of memory load/store
ware units such as general-purpose graphics processing unit instructions, which deliver the orders to the AI accelerator
(GPGPU), and tensor processing unit (TPU), in general [7]. linked to the system bus, is converted to a small number
In contrast, most embedded systems usually include hardware of custom instructions. This characteristic reduces the
units for specific AI algorithms because of the resource memory usage for storing instruction codes and increases
constraints involved in lightweight systems [9], [10]. As edge the throughput of the system [15], [16]. This advantage
AI systems generally target domain-specific applications, made the ASIP concept a reasonable choice despite the
trading off versatility for resource utilization, e.g., power difficulty of the design process originating from the core
consumption and area usage, is an effective strategy for the modification.
systems [11]. In fact, many studies have been conducted to RISC-V and MIPS are suitable ISAs for realizing the
design modular AI processors for heterogeneous computing- ASIP concept. Both ISAs provide productivity in the custom
based edge AI applications [7], [8], [9]. ISE definition process through the guidelines in the ISA
Nevertheless, AI systems composed of modularized het- documentation [17], [18]. Indeed, several studies applied the
erogeneous processors still involve drawbacks stem from ASIP concept with one of these ISAs for systems targeting
certain characteristics. Communication protocols used for AI applications [19], [20], [21], [22]. However, most of those
handshaking between devices, such as universal asyn- studies concentrate on applying this concept to complex
chronous receiver/transmitter (UART), serial peripheral inter- AI algorithms such as deep learning and convolutional
face (SPI), and peripheral component interconnect express neural networks (CNN) [23], [24], [25], [26], [27], [28].
(PCIe), are one of the major characteristics that generate These heavy algorithms consume huge resources, making
the inefficiencies in the system. As the data rate of these them inappropriate for lightweight systems due to hardware
protocols is much slower than the parallel communication limitations [29]. For this reason, choosing an appropriate
inside the processors running on each operating frequency, algorithm that consumes affordable resources is necessary to
significant overheads are required for handshaking between apply the concept to lightweight systems.
both processors. Another thing to consider is an additional workload
The increased workload allocation for communication for rewriting the source codes of the application. Custom
results in degraded system performance. Moreover, synchro- instructions are not created by compiling the source codes
nizing the processors requires additional controller units for written in general syntaxes of compilers. Hence, software
both for communication in compliance with the protocol. developers need to mutate the legacy source codes to the
These controllers increase the area of each processor, new inline assembly codes to operate the AI core. Defining
resulting in higher energy consumption for the entire system. the custom ISE similar to the ISA of previous AI processors
As these drawbacks, e.g., throughput degradation and electric is necessary because conserving the previous mechanism
energy dissipation, are critical in lightweight systems [12], lowers the additional workloads for rewriting the source
minimizing the inefficiency generated by the communication codes.
protocol is necessary. In this paper, we propose a uniprocessor system
Application-specific instruction set processor (ASIP) architecture for lightweight AI systems. To minimize the
based on customizable instruction set architecture (ISA) communication overhead between the AI processor and the
for GPP is one of the concepts suggested to avoid the general-purpose processor, we selected the AI processor
inefficiencies caused by the communication overheads in suitable for lightweight embedded systems and defined the
modularized architecture. Fig. 1 shows the system archi- custom ISE to design the architecture that integrates the GPP
tectures of both heterogeneous computing systems and and the AI processor into a single processor.

49410 VOLUME 11, 2023


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

FIGURE 1. (a) The heterogeneous system architecture. (b) The ASIP-based system architecture.

The ISE is compatible with MIPS ISA, which includes In order to describe our work clearly, the rest of the
reserved coprocessor definitions for the design-specific ISE. paper is categorized as follows. Section II reviews the related
Further, we designed the integrated core architecture of works that motivated our research. Section III explains the
the processor that executes the ISE. The core architecture base architecture of the target modular AI processor, which
includes the processor core and the AI core, which are is used to extract and optimize the AI core for the AI
operated by the synchronous clock. coprocessor. Section IV presents the custom ISE definitions
According to this architecture, complex communication and operations, which are based on the operation of the AI
is replaced with simple data transferring through commu- processor. Section V presents the integrated core architecture
nication based on multiple internal wires and buffers in with the detailed operation of custom ISE, as well as unified
only a few clock periods. To verify the system and the processor architecture with integrated core architecture. Sec-
core architecture, we designed the processor to a register- tion VI describes the verification environment and evaluates
transfer level (RTL) written in Verilog HDL and build up the proposed architecture. Section VII concludes our research
the verification environment with a field-programmable gate by summarizing our work and outlining the future endeavors
array (FPGA) implementation. to extend this work.
Next, we developed the software library containing
inline assembly codes that perform the same operation
as the legacy library functions, which are developed for II. RELATED WORKS
the previous modular AI processor. Finally, we evaluated A. FPGA-BASED LIGHTWEIGHT CNN ACCELERATORS
the proposed architecture and the designed processor by Many studies have been conducted to apply edge comput-
executing sample applications that are already verified for ing for AI, using FPGA-based heterogeneous computing
the heterogeneous computing system with the previous AI methods. Kim et al. [30] proposed hardware acceleration
processor. for lightweight systems based on SqueezeNet [31], which
The main contributions of this work are listed as follows: is a lightweight CNN model for embedded systems. The
• Customized ISE to minimize the overhead caused by the authors transformed 32-bit floating-point arithmetic to 8-bit
complex communication protocol in the modularized AI integer arithmetic to reduce the intensity of SqueezeNet and
system. designed the parallelized architecture. Xia et al. [32] propose
• The AI coprocessor architecture and AI core architecture SparkNet, which utilizes depthwise separable convolution
optimized for 32-bit MIPS core. to reduce the intensity of the CNN. To optimize the
• The core architecture for GPP with integrated AI CNN for the embedded systems, the authors of this work
coprocessor compatible with customized ISE. quantized the parameters to 16-bit integers and designed
• The unified processor architecture designed to RTL. the accelerator with pipelined architecture, applying optimal
• The software library to control the AI coprocessor. multilevel parallelism. Despite the astonishing performance
• Evaluation of the processor architecture by realizing the of the CNN-based heterogeneous accelerators proposed in
designed processor to a field-programmable gate array these studies, the baseline intensity and numerous parameters
(FPGA) implementation. of multi-layered neural networks still make challenging to

VOLUME 11, 2023 49411


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

adopt these works on ultra-lightweight embedded systems memories, which make the system heavier, making the
due to power consumption, energy dissipation, and area usage system inappropriate for ultra-lightweight systems.
originating from calculation logics and external memory. • Optimizing the internal architecture of the k-NN calcula-
tion logic through parallelism directly reflecting the data
B. LIGHTWEIGHT ML ACCELERATORS rate and timing constraints of the general-purpose core
As the CNNs are not suitable for resource-limited ultra- inside the processor.
lightweight systems, lightweight ML accelerators and pro- • Provide the reconfigurability that reflects the k-NN
cessors have been designed and introduced. General Vision parameters to adopt the AI core by varying constraints
introduces a neuron-inspired pattern recognition chip based of the lightweight systems.
on k-nearest neighbor (k-NN) algorithm, the CM1K, which
is an application-specific integrated circuit (ASIC) [33].
III. BACKGROUND: INTELLINO AI PROCESSOR
Abeysekara et al., and Suri et al. [34], [35] developed
Before designing the processor based on the proposed
heterogeneous systems for AI applications with lightweight
system architecture for edge AI applications, we chose the
GPP and the CM1K. These works overcome the computing
appropriate AI processor, Intellino, which is designed for
performance limitations of the lightweight GPP but still
heterogeneous computing systems. Intellino is a reconfig-
have drawbacks related to energy dissipation and throughput
urable AI processor based on the k-NN algorithm, which is
degradation originating from the complex communication
a lightweight ML algorithm based on distance calculation
protocol.
and operates with the SPI, which is a common interface
To address these issues, Intel proposes a processor module
for high-speed communication in lightweight embedded
named Curie, including pattern matching engine (PME) that
systems [40], [41].
accelerates the k-NN algorithm and are manipulated by
Fig. 2 shows the top architecture of the Intellino, which
bus transactions [36]. Although manipulating the accelerator
consists of the SPI slave controller, packet decoder and
through a bus topology increases the throughput performance
encoder, neuron controller, and classifier. As the Intellino is
and reduces the energy dissipation, this method still has
the slave device, which cannot operate independently, the SPI
memory access overheads originating from bus transactions.
slave controller performs communication dependent on the
On the other hand, several studies proposed utilizing burst
generated clock, i.e., serial clock (SCK), from the host. After
access in bus topology to reduce memory access bottlenecks.
receiving the one-byte SPI frame, the controller sets the signal
Borelli et al. [37] proposed the k-NN accelerator for
called rx_cplt and delivers the received data to the packet
heterogeneous computing that uses the scratchpad memory
decoder. The packet decoder converts the received data to the
through direct memory access (DMA) connected to an
instruction for the neuron controller by buffering the data and
advanced extensible interface (AXI) stream port inside
observing the specific byte sequences based on the protocol
the processing system. Hussain et al. [38] designed the
of the Intellino.
accelerator using a parallel first-in-first-out (FIFO) module
The instruction set for the neuron controller is defined as
to reduce memory access bottlenecks by burst transactions.
the operations based on register map definition. Thus, the
In both architectures, data transactions between the GPP
single instruction for the neuron controller is represented as
and the accelerator are processed through burst read and
the control signals for the register file, such as register address
write operations, making the data transactions faster. Li et
(reg_addr), register write enable (reg_we), register write
al. [39] represent the k-NN accelerator accessed by the main
data (reg_wd), and register read enable (reg_re). The neuron
processor through the AXI slave interface and compose
controller generates the control signals and data signals for
the scratchpad memory as dynamic random access memory
neuron cells and sends the calculation results of the k-NN
(DRAM). In this study, the accelerator directly controls
algorithm generated by the classifier. The neuron cells and
the DMA with burst operations to manipulate the DRAM
the classifier make up the k-NN calculator.
controller through the AXI stream master port. These studies
Fig. 3 shows the architecture of the k-NN calculation
significantly reduced the memory access bottlenecks but are
process. The k-NN algorithm is calculated by searching the
not optimized for ultra-lightweight embedded systems, as
category-distance pair, which has the shortest distance from
these designs exploit not only the on-chip memory but also
the inference data vector, from the pre-trained dataset. The
the external memories, resulting in a heavier system in terms
k-NN calculator of the Intellino has an architecture that
of area usage, power consumption, and energy dissipation and
performs the calculation efficiently. At first, the distance
leading to throughput degradation due to long memory access
data of each dataset and current inference data vector
latency.
are calculated by each neuron cell. Next, The classifier
Different from these works, we concentrate on these terms:
submodule calculates the category with the lowest distance by
• Designing an optimized architecture to reduce bottle- comparing the distance data using the lower distance selectors
necks of the memory access, which is required for AI that are constructed to a multi-level inverted tree structure.
computation, by eliminating structural overheads caused Through this structure, the data pair with the lowest distance
by bus transactions while avoiding the use of external is derived by the selector in the last level.

49412 VOLUME 11, 2023


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

FIGURE 2. Top architecture of the Intellino.

containing a lower distance than others when both batches are


valid. In the condition that one of the batches is not valid, the
selector chooses the valid data batch. Through several steps,
the shortest data pair is derived from the multiple batches.
According to the architecture, the accuracy performance of
the k-NN algorithm depends on the count of the dataset and
the length of the data vector. Intellino adopts reconfigurable
architecture with two customized parameters that reflect the
dependents: 2n as the number of neuron cells and 2m as
FIGURE 3. The architecture of the k-NN execution logic of the Intellino. the length of the data vector. This characteristic provides
flexibility to compose the system with the Intellino for a
Fig. 4 shows the internal architecture of the k-NN variety of lightweight embedded systems. Nevertheless, the
submodules. The main role of the neuron cells is storing communication overheads are still generated by the SPI-
the pre-trained dataset while in the training process and based protocol as the speed of the k-NN calculator, which
calculating the distance between the stored vector and the operates by parallel wires synchronous to the internal clock, is
input data stream while in the inference process. To store the much higher than the speed of the protocol, which operates by
dataset, each neuron cell has a scratchpad memory (SPM) serial communication synchronous to the slow external clock.
for data vector, register for category data, and register for Therefore, eliminating the overheads to improve throughput
checking the validity of the stored data in SPM and category. and minimize the area by embedding the k-NN calculator of
These storages are sequentially written by one byte when the the Intellino into the processor core is an appropriate strategy.
data in the neuron cell identifier (NID) register (ncell_idx) To achieve this in our work, we designed an integrated core
in the neuron controller matches the identifier of the current architecture that utilizes the Intellino as the coprocessor,
neuron cell and the appropriate write enable signal is set by which is operated by the custom ISE that we defined.
the neuron controller. Removing the stored data is executed
by clearing the flag set by the neuron controller. Distance IV. ISE FOR THE AI PROCESSOR
calculation runs similarly to the dataset storing. The distance In MIPS ISA, the custom ISE is defined as custom
accumulator in each neuron cell calculates the distance coprocessor instructions. The MIPS processor can contain up
between components in the stored vector and inference to four coprocessors according to the MIPS ISA definition.
vector on the current component index and accumulates the Except for the coprocessor 0, which is an interrupt and
calculated distance to the previous distance value. As the exception controller, and the coprocessor 1, which is defined
inference process proceeds through byte sequence writing, for floating-point extension, the operation of the coprocessor
the total distance calculation occurs simultaneously with varies by the designer. Fig. 5 shows the schematized
the reception of the last component of the inference vector. concept of the custom coprocessor. The coprocessor in MIPS
At the time, the classifier takes the data batches, which are ISA is the concept represented as the additional core that
composed of the valid flag, category, and calculated distance, executes instruction codes on the custom ISE that shares
from the neuron batch and passes the two batches to the the instruction fetch (IF) process and memory read/write
selectors at the first level. The selector chooses the data batch (MEM) process. In common, the custom coprocessor embeds
from two batches. Basically, the selector chooses the batch register files (CPnR) as shown in figure 5. Some coprocessor

VOLUME 11, 2023 49413


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

FIGURE 4. (a)The internal architecture of the neuron cell. (b)The internal architecture of the lower distance
selector.

TABLE 1. List of instructions on Intellino coprocessor.

documentation. The main difference between the sample


instructions and the ISE is the additional functions on write
operations for the register file. The register map of the
coprocessor is the same as the register map of the original
Intellino. Hence, the instructions that contain writing the data
to the CP2R, MTC2, and LWC2 perform certain operations
similar to those of the neuron controller in the original.
By this functionality, the mechanisms of the Intellino are
conserved while the packet decoding and encoding process
in the communication protocol is clearly eliminated.
FIGURE 5. The concept of the custom coprocessor in MIPS ISA.
Table 2 shows the information of the CP2Rs in the CP2
register file. Though almost every operation of the registers is
embeds the additional register files to control the coprocessor the same as the original Intellino, some features are changed
(CPnCR). These characteristics provide functional safety on for optimization. The main difference between the original
the compiled binary originating from source codes written in operations and newly defined operations is the default unit of
a high-level language, which specifies the role of each register the data transfers. Though the processor core relies on 32-bit
in general-purpose registers (GPR), by ensuring functional operation, changing the 8-bit operations to 32-bit operation
independency of both cores. To form data communication increases the throughput to four times faster than the original.
between the MIPS core and the coprocessor, the MIPS ISA Basically, the training process and inference process of
offers guidelines for defining the sample instructions that the Intellino are based on the iterative move operations that
perform data transferring between the cores. Additionally, the write the vector data to the address of the COMP register
MIPS ISA defines the sample instructions to provide direct and LCOMP register, the MTC2 and LWC2. Writing the data
communication between the CPnCR and the GPR. to the COMP or LCOMP register in this state redirects to
saving four component data, which is 32-bit, to the SPM in
A. CUSTOM ISE the neuron cell that has the same identifier as the NID value,
Before defining the custom ISE for the core integration of addressed by the COMPID value. Further, every neuron
Intellino, firstly, we set the identification number of the cell performs the distance calculation for each based on the
custom coprocessor to two. Next, we defined the custom accumulation of the current distance value between the input
ISE based on the register map of the original Intellino. component and the stored component. In case of the NID
Table 1 shows the instructions of the ISE. The instructions are value exceeds the maximum identifier value (2n − 1), nothing
descended from the sample move instructions and memory is stored in any neuron cell but the distance calculation is
access instructions which are guided by the MIPS ISA still executed. Consequently, the NID value indicates not

49414 VOLUME 11, 2023


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

TABLE 2. Register map of the Intellino.

Listing 1. Assembly codes for training based on the custom ISE.

only the neuron cell to train but also the current state of the
whole coprocessor. Through this attribute, both the storing Listing 2. The source codes for training based on the original
architecture.
of the vector to the neuron cell in the training process
and calculating the distance in the inference process are
executed by the same iterative move operations. The variance
between the training process and the inference process is V. SYSTEM ARCHITECTURE
that the training process ends with the writing operation of Fig. 6 shows the proposed system architecture. The system
the category for training data to the neuron cell and the consists of the AI ASIP for calculation, sensors for receiving
inference process ends with the reading operation of the inference data, and a display interface to interact with the
derived category and distance. Another important attribute is users. The ASIP adopts the Harvard architecture, which
that the operations to overwrite certain values to the COMPID has two paths, the data path and the instruction path, to
register and NID register are highly infrequent because of the avoid bubble insertion generated by sharing the same path
certain operations that automatically change the value in the on instruction fetching and memory access instructions.
register not addressed by the instruction, move operations for Compared to high-performance systems, which generally
writing to the COMP, LCOMP, CAT, FORGET, CLEAR. adopt the modified Harvard architecture, the exclusion of
Listing 1 and 2 show examples of source code for the caches reduces area usage and energy consumption because
AI coprocessor executed by custom ISE and for the original of the elimination of the overheads caused by cache misses
Intellino executed by the SPI communication, respectively. and avoiding the area occupancy by the cache. As the
Both source codes execute the training of single data from memory requirements of the lightweight AI are generous,
the dataset to the neuron cell. The length of the vector is the reduction of the available memory resources provided
configured as 128 and each of the COMPID values and the by the cache is acceptable. The integrated core on the
NID are regarded as respectively zero and the value lower ASIP processes the input data for inference by executing
than 2n −1. Both codes share the steps that storing the training the program including the custom AI instructions through
vector in the SPM through sequential writing and sending the the internal AI coprocessor. The output controller displays the
category value. Despite the source codes having similar steps, inference result calculated by the AI coprocessor, allowing
the codes based on custom ISE provide much more simplicity the users to perceive the result directly. In accordance with
than those based on the original AI processor due to the this architecture, the throughput of the AI algorithm is
exclusion of the numerous function calls for the complex improved compared to the systems based on the previous
communications protocol. heterogeneous architecture. Further, the data flow of the

VOLUME 11, 2023 49415


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

FIGURE 6. The proposed system architecture for lightweight AI systems. FIGURE 7. Data flow of the MTC2, MFC2 instructions on the core pipeline.

architecture provides privacy protection and network band-


width reduction due to the edge computing architecture.

A. INTEGRATED CORE DESIGN


To provide the ASIP based on MIPS ISA with custom AI ISE,
we designed the AI coprocessor for a 5-stage pipelined MIPS
core that is compatible with the 32-bit MIPS processor [42]
and integrated the coprocessor into the MIPS core. The key
aspects we considered in the design of the coprocessor are as
follows:
• Conserving the logic related to the main operating
mechanisms to ensure the functionalities of the legacy
features that are enabled on the original AI processor.
• Optimizing the main calculation logic and memory FIGURE 8. Data flow of the LWC2, SWC2 instructions on the core pipeline.

architecture of the Intellino to the 32-bit parallel data


transfer in the MIPS processor.
Fig. 7 and 8 show the data flow when executing the custom decode stage. To receive the appropriate inference results, the
ISE instructions. The AI coprocessor, i.e., CP2, is designed insertion of the five delay slots, which are the slots to fill
to compose the coupled architecture with the pipelines in the with independent instructions next to the target instruction,
basic MIPS core. The fetch stage in the pipeline is shared by is required.
both the basic core and CP2 as the instruction path is only Typically, methods to embed the additional forwarding
one in the core. The other stages and pipeline registers for logic are applied to remove the delay slots. Nevertheless,
each stage exist for both the base core and the coprocessor to we adopt this architecture because adding the data forwarding
ensure synchronized operation. logic directly affects the overall performance of the processor
As shown in the figures, the Intellino is held in the write- according to the base core architecture that data propagation
back stage. The stage to commit the 32-bit data to the Intellino of each stage passes through the data forwarding logic.
register is the same as the time to commit to the GPRs. In the worst case, the five no operation (NOP) instructions
Through this architecture, the core integration of the Intellino are inserted into every inference process. The throughput
is done without additional data forwarding logic. decrease by these bubbles is acceptable as the communica-
The latency for AI instruction to reach the AI processor tion performance based on the operating frequency, which
from the instruction fetch stage is five cycle periods of remains unchained due to the coprocessor architecture, is
the operating clock frequency because of the pipelined far ahead of the communication performance of the original
architecture. In the training process, this characteristic does heterogeneous architecture.
not generate throughput degradation as the pipeline operates
without any interruption. In the inference process, in contrast, B. AI CORE OPTIMIZATION
the throughput is reduced. This is because the data transfer Fig. 9 shows the optimization of the distance calculation logic
instructions from the Intellino to the other units such as for 32-bit parallel communication. The distance calculation
move from coprocessor 2 (MFC2) and store word from works by sequentially accumulating the absolute distance
coprocessor 2 (SW2) complete the communication on the value between the current vector components addressed by

49416 VOLUME 11, 2023


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

FIGURE 10. The architecture of the designed ASIP.

interface is provided to receive the training/inference data


FIGURE 9. Optimization of the distance calculation for the 32-bit parallel from the other input devices. Finally, the VGA controller and
communication. video RAM access peripheral are included to visualize the
result.
the COMPID register. In the case of the original AI processor,
the distance calculation requires only one subtractor, absolute VI. IMPLEMENTATION RESULT
logic, and adder as the communication protocol unitizes the To verify the proposed system architecture, we first evaluated
data packet to 8 bits. In the case of the AI coprocessor, the processor designed to apply to the system by RTL
in contrast, the communication is held with 32-bit parallel simulation running on Verilator 5.009 version, which is an
data wires. This means that the four vector components open-source RTL simulation tool. The simulation of the
are simultaneously transferred to the calculation logic. MIPS-Intellino processor has two cases: optimized assembly
To perform the distance calculation without additional bubble codes, and source codes with library functions written in
generation, we parallelized the sequential accumulations in C programming language. Both cases are built to operable
the original intellino into four by duplicating the subtractor machine codes by utilizing the GNU compiler collection
and absolute logic and adding the adders with a tree structure. (GCC) version 5.3.0 and supposed the operating frequency
With this architecture, the throughput of the optimized of the processor to be 50 MHz, which is a recommended
Intellino is improved as the distance is perfectly synchronized frequency of the target MIPS processor and has not been
to the basic core pipeline. modified. The simulation of the original Intellino has
only one case, the SPI-based communication protocol with
C. ASIP DESIGN an 8 Mbps data rate, which is the recommended data rate of
We designed the processor including the optimized AI the Intellino.
coprocessor to realize and verify the concept of lightweight We used the MNIST dataset [43], which is a test set
ASIP for AI applications. Fig. 10 shows the architecture of handwritten digits, for the k-NN training and inference
of the processor. The processor including the integrated process. The simulation environment and source codes for
core architecture with interrupt controller (CP0), floating- each case are provided on the website [44].
point unit (CP1), and the designed AI core (CP2), enables Fig. 11 shows the time measurement results of the
the execution of the custom AI ISE and other features training process. According to the simulation, the C library
such as hardware interrupt and floating-point acceleration. functions and optimized assembly codes showed approxi-
The memory composition in Harvard architecture, which mately 165.59 times and 193.88 times enhanced performance
separates the instruction memory and data memory, is compared to the original Intellino, which is based on SPI,
replaced by embedding the random access memories (RAM) on average. Both the throughput ratio of the operations
with one read or write data path and one read-only data compared to that of the original Intellino tends to converge
path, which imitates the mechanism that the instruction fetch to 200 when the vector length value increases. This is
stage and memory stage access the memory simultaneously. because the ideal throughput ratio between the 8 Mbps
To provide reconfigurability of the memories, the connection SPI communication and the 32-bit parallel communication
between the master device of the system bus, the integrated running on 50 Mhz is 200 and the increase in the vector length
core, and the system bus passes through the joint test action does not inflate the ancillary instructions such as setting the
group (JTAG) controller. By sending the JTAG signals, the starting pointer of the data, setting the category value. As a
host device takes control of the system bus from the integrated result, the throughput performance of the training process
core. This enables memory access from the host device to on MIPS-Intellino remarkably exceeded those of the original
replace the current program with another. Moreover, the serial Intellino.

VOLUME 11, 2023 49417


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

FIGURE 12. The architecture of the verification system.

FIGURE 11. The required time for the training process in each case.
area, power, and propagation delay, between ASICs and
After the simulation, we realized the designed processor by FPGAs [46]. However, in terms of purpose that uses the
FPGA implementation. Fig. 12 shows the architecture of the FPGA implementation as the prototyping of the digital
verification system, which is based on the Zynq-7000 system- logic circuit before ASIC implementation, utilizing these
on-chip (SoC) mounted on the Digilent ZedBoard. The blocks makes the performance analysis fuzzier as the FPGA
SoC consists of the general-purpose processor (PS) and the prototype operates as ASIC-FPGA hybrid architecture [47].
programmable logic (PL). The PS includes the complex block Therefore, to provide an approximately clear and comparative
for the modern operating system (OS) such as the memory analysis on the different digital circuits for chip fabrica-
management unit (MMU) and a variety of peripherals. tion, synthesizing with only general-purpose blocks that is
Certain peripherals, e.g., GPIO, in the PS is connectable regarded as correspondence to the standard cell library is
to the programmable logic. This architecture facilitates the required.
input/output process and the verification process of the Fig. 13 shows the resource usage of the implementation
designed digital circuit by making use of the convenient results between the previous AI processor and the designed
features the OS environment provides. Thus, we adopted the AI coprocessor. The result presents that the FPGA implemen-
method of running the modern OS, Linux, on the PS and tations of the coprocessor require more look-up tables (LUTs)
taking control of the design implemented on the PL block than those of the original AI processor. On average, the
by the OS. The Linux distribution we utilized for the PS is coprocessor uses approximately 6.62% more LUTs compared
PetaLinux 2021.1, and a detailed guide for running PetaLinux to the original. This is because the ancillary components of
on the Zynq-7000 SoC is provided in [45]. the coprocessor, the pipeline stages, use more LUTs than
To realize the method, firstly, we synthesized and imple- the ancillary components of the original processor, the SPI
mented the MIPS-Intellino processor to the PL block of slave controller and the packet decoder. The gap between the
the Zynq-7000 SoC and linked the GPIO and the UART LUTs usage of the coprocessor and the original decreases as
of the PS to the JTAG controller and the UART in the the vector length parameter increases. As these components
designed processor. Next, we developed the software that do not affect by the AI configurations such as vector length
writes the program to the RAM in the designed processor by and the neuron cell count, the increase in the configuration
imitating the JTAG protocol through GPIO and programmed parameters makes the area share on the AI logic and the
the test software compiled by the cross compiler built with the memory higher, diminishing the impact of the gap between
GNU C compiler (GCC) to the designed processor. At last, ancillary components. The usage of other physical resources,
we checked the results of the test software running on the flip-flops (FFs), F7 multiplexers, and F8 multiplexers, of the
MIPS-Intellino through the serial communication software coprocessor shows each of 74%, 40.72%, 19.53% reduced
running on the PS. values than the original AI processor. According to this result,
To evaluate the area usage of the designed processor the area usage performance of the coprocessor is suitable
architecture, we performed synthesis and implementation for replacing the modularized architecture. When reflecting
using Xilinx Vivado 2022.2 for the designed coprocessor on the physical area reduction through the SoC architecture,
and the original AI processor with various AI configurations. which removes the external wires for communication, the
The jobs were configured to target the PL block of the advances in area usage performance become higher.
Xilinx xc7z020clg SoC with additional settings to avoid In the case of power and energy, the implementation
the utilization of pre-fabricated hard blocks, such as DSP results show that the designed AI coprocessor has far better
blocks and block RAMs. Hard blocks in FPGAs provide performance than the original Intellino. Fig. 14 presents the
performance enhancements and area reduction, operating energy-related performances of both designs. As shown in
more similarly to ASIC than FPGA [46]. Thus, utilizing the first graph, the AI coprocessor consumes 3.58 times the
these blocks is regarded as contracting the gap, e.g., dynamic power on average compared to the original. This

49418 VOLUME 11, 2023


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

FIGURE 13. Resource usage comparison between the coprocessor and the original Intellino.

TABLE 3. Comparison between other k-NN accelerators.

Finally, we analyzed the timing reports of the coprocessor


implementation. In spite of the increment in the propagation
delay in the distance calculation logic issued by the paral-
lelization, the recommended operating frequency of the target
MIPS processor, 50 MHz, is preserved. This is because the
increased delay is still less than the critical path of the basic
core, 32-bit ALU operating with forwarded input. Thus, the
performance of the basic instruction execution is preserved
FIGURE 14. Power and energy comparison between the coprocessor and though the instruction set is extended by embedding the
the original Intellino. (a) Maximum dynamic power consumption. additional coprocessor for custom AI ISE. As a result,
(b) Energy dissipation per one operation. (c) Relative energy efficiency of
the coprocessor compared to the original Intellino. replacing the modularized AI coprocessor with the ASIP with
integrated core architecture has benefits in both throughput
and area usage while demeriting the flexibility of the system
attribute results from the parallelized architecture consisting composition.
of the increment in memory bandwidth and the optimization Table 3 presents a comparison between other k-NN
of the distance calculation logic. The device static power accelerators designed for lightweight systems. The results
statistics are the same for all configurations and designs, show that the AI coprocessor we designed has significantly
resulting in 0.103 W, as these attributes only depend on higher data rate performance than the others, while also
the target FPGA. Although the coprocessor requires more offering a wide range of configurability in k-NN parameters
power than the standalone Intellino, the energy efficiency and acceptable power consumption.
of the coprocessor surpasses the original, as shown in the
second graph and third graph. This is because the energy VII. CONCLUSION
dissipation (EDP) required for one operation is proportional In this paper, we propose a uniprocessor system architecture
to not only the power consumption but also the elapsed time based on ASIP for lightweight edge AI systems in order to
for one operation such as the training process and inference reduce the communication overhead caused by the complex
process. As a result, the coprocessor shows 52.75 times protocol and asynchronous operations of the heterogeneous
the energy efficiency on average compared to the original, architecture. To realize the proposed system architecture,
demonstrating the advances in energy efficiency. we designed the ASIP supporting custom ISE based on the

VOLUME 11, 2023 49419


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

MIPS ISA, which presents the guidelines for defining the [5] P. Tsung, T. Chen, C. Lin, C. Chang, and J. Hsu, ‘‘Heterogeneous
custom ISE. The ISE we defined is optimized to mutate the computing for edge AI,’’ in Proc. Int. Symp. VLSI Design, Autom. Test
(VLSI-DAT), Apr. 2019, pp. 1–2.
target AI processor for heterogeneous computing, Intellino, [6] S. Mukhopadhyay, Y. Long, B. Mudassar, C. S. Nair, B. H. DeProspo,
a reconfigurable lightweight AI processor based on k- H. M. Torun, M. Kathaperumal, V. Smet, D. Kim, S. Yalamanchili, and
NN algorithm. The ASIP is designed with the integrated M. Swaminathan, ‘‘Heterogeneous integration for artificial intelligence:
Challenges and opportunities,’’ IBM J. Res. Develop., vol. 63, no. 6,
core architecture that contains the custom coprocessor that pp. 4:1–4:1, Nov. 2019.
synchronously operates with the base core. By placing the [7] C. Choi et al., ‘‘Reconfigurable heterogeneous integration using stackable
core logic to execute the AI algorithm on the write back stage, chips with embedded artificial intelligence,’’ Nature Electron., vol. 5, no. 6,
the necessity of the additional forwarding logic is removed. pp. 386–393, Jun. 2022, doi: 10.1038/s41928-022-00778-y.
[8] J. Han, M. Choi, and Y. Kwon, ‘‘40-TFLOPS artificial intelligence
Further, specified optimizations for communication which processor with function-safe programmable many-cores for ISO26262
is based on multiple parallel wires, e.g., parallelization of ASIL-D,’’ ETRI J., vol. 42, no. 4, pp. 468–479, Aug. 2020, doi:
the distance accumulation in the k-NN algorithm, boost the 10.4218/etrij.2020-0128.
[9] D. Jamma, O. Ahmed, S. Areibi, G. Grewal, and N. Molloy, ‘‘Design
throughput remarkably with a little demerit in area usage exploration of ASIP architectures for the K-nearest neighbor machine-
performance. learning algorithm,’’ in Proc. 28th Int. Conf. Microelectron. (ICM),
According to the simulation results, the designed pro- Dec. 2016, pp. 57–60.
cessor achieved up to 193.88 times enhanced throughput [10] Y. Chen, H. Lan, Z. Du, S. Liu, J. Tao, D. Han, T. Luo, Q. Guo, L. Li,
Y. Xie, and T. Chen, ‘‘An instruction set architecture for machine learning,’’
performance compared to that of the original modularized ACM Trans. Comput. Syst., vol. 36, no. 3, pp. 1–35, Aug. 2019, doi:
AI processor. In addition, resource shares of each component 10.1145/3331469.
were reduced, except LUTs, which were 6.62% higher on [11] H. W. Oh, J. K. Kim, G. B. Hwang, and S. E. Lee, ‘‘The design of
a 2D graphics accelerator for embedded systems,’’ Electronics, vol. 10,
average in compliance with implementation results. Consid- no. 4, p. 469, Feb. 2021. [Online]. Available: https://www.mdpi.com/2079-
ering the elimination of the external wires for communication 9292/10/4/469
due to the uniprocessor architecture is not reflected, the [12] H. Huang, Z. Liu, T. Chen, X. Hu, Q. Zhang, and X. Xiong,
‘‘Design space exploration for YOLO neural network accelerator,’’
advantages of area usage are assumed to be higher. Therefore,
Electronics, vol. 9, no. 11, p. 1921, Nov. 2020. [Online]. Available:
adopting the proposed system architecture is reasonable for https://www.mdpi.com/2079-9292/9/11/1921
lightweight systems that have a fixed application and are [13] J. Wang and S. Gu, ‘‘FPGA implementation of object detection accelerator
hugely affected by the throughput. based on Vitis-AI,’’ in Proc. 11th Int. Conf. Inf. Sci. Technol. (ICIST),
May 2021, pp. 571–577.
In future work, we plan to advance our work through two [14] E. Rapuano, G. Meoni, T. Pacini, G. Dinelli, G. Furano, G. Giuffrida, and
separate stages. At first, we will eliminate the necessity of L. Fanucci, ‘‘An FPGA-based hardware accelerator for CNNs inference
bubble insertion intrinsic in the current coprocessor archi- on board satellites: Benchmarking with myriad 2-based solution for the
CloudScout case study,’’ Remote Sens., vol. 13, no. 8, p. 1518, Apr. 2021.
tecture that reduces the throughput performance. This objec- [Online]. Available: https://www.mdpi.com/2072-4292/13/8/1518
tive can be achieved by optimizing the microarchitecture [15] M. Perotti, P. D. Schiavone, G. Tagliavini, D. Rossi, T. Kurd, M. Hill,
such as reconstructing the data processing and propagation L. Yingying, and L. Benini, ‘‘HW/SW approaches for RISC-V code
structure in the pipeline architecture and adding the data size reduction,’’ in Proc. Workshop Comput. Archit. Res. RISC-V, 2020,
pp. 1–12.
forwarding logic to avoid stalls. Second, we will adopt [16] A. K. Verma, P. Brisk, and P. Ienne, ‘‘Rethinking custom ISE identification:
the proposed coprocessor architecture for other complex AI A new processor-agnostic method,’’ in Proc. Int. Conf. Compil., Archit.,
algorithms that are more complicated than k-NN, extending Synth. Embedded Syst., New York, NY, USA, Sep. 2007, p. 125, doi:
10.1145/1289881.1289905.
the proposed architecture for high-performance systems. [17] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovi, ‘‘The RISC-V
Ultimately, we aim to fabricate the most advanced ASIP instruction set manual. Volume 1: User-level ISA, version 2.0,’’ California
designed by future works onto a chip to prove the feasibility Univ. Berkeley Dept. Elect. Eng. Comput. Sci., Berkeley, CA, USA,
Tech. Rep. UCB/EECS-2014-54, 2014.
and performance enhancements of the proposed system
[18] C. Price, ‘‘MIPS IV instruction set,’’ Tech. Rep., Revision 3.2, 1995.
architecture. [19] A. S. Eissa, M. A. Elmohr, M. A. Saleh, K. E. Ahmed, and M. M. Farag,
‘‘SHA-3 instruction set extension for a 32-bit RISC processor architec-
REFERENCES ture,’’ in Proc. IEEE 27th Int. Conf. Appl.-specific Syst., Archit. Processors
(ASAP), Jul. 2016, pp. 233–234.
[1] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
K. Chan, ‘‘When edge meets learning: Adaptive control for resource- [20] T. Ahmed, N. Sakamoto, J. Anderson, and Y. Hara-Azumi,
constrained distributed machine learning,’’ in Proc. IEEE Conf. Comput. ‘‘Synthesizable-from-C embedded processor based on MIPS-ISA
Commun., Apr. 2018, pp. 63–71. and OISC,’’ in Proc. IEEE 13th Int. Conf. Embedded Ubiquitous Comput.,
[2] S. Jang, H. W. Oh, Y. H. Yoon, D. H. Hwang, W. S. Jeong, and S. E. Lee, Oct. 2015, pp. 114–123.
‘‘A multi-core controller for an embedded AI system supporting parallel [21] Y. Zhou, X. Jin, and T. Xiang, ‘‘RISC-V graphics rendering instruction
recognition,’’ Micromachines, vol. 12, no. 8, p. 852, Jul. 2021. [Online]. set extensions for embedded AI chips implementation,’’ in Proc. 2nd Int.
Available: https://www.mdpi.com/2072-666X/12/8/852 Conf. Big Data Eng. Technol., New York, NY, USA, Jan. 2020, p. 85, doi:
[3] G. Cornetta and A. Touhafi, ‘‘Design and evaluation of a new machine 10.1145/3378904.3378926.
learning framework for IoT and embedded devices,’’ Electronics, [22] M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara, ‘‘A lightweight
vol. 10, no. 5, p. 600, Mar. 2021. [Online]. Available: https://www. posit processing unit for RISC-V processors in deep neural network
mdpi.com/2079-9292/10/5/600 applications,’’ IEEE Trans. Emerg. Topics Comput., vol. 10, no. 4,
[4] J. Chen, M. Jiang, X. Zhang, D. S. da Silva, V. H. C. de Albuquerque, pp. 1898–1908, Oct. 2022.
and W. Wu, ‘‘Implementing ultra-lightweight co-inference model in [23] S. Lee, Y. Hung, Y. Chang, C. Lin, and G. Shieh, ‘‘RISC-V CNN
ubiquitous edge device for atrial fibrillation detection,’’ Expert Syst. Appl., coprocessor for real-time epilepsy detection in wearable application,’’
vol. 216, Apr. 2023, Art. no. 119407. [Online]. Available: https://www. IEEE Trans. Biomed. Circuits Syst., vol. 15, no. 4, pp. 679–691,
sciencedirect.com/science/article/pii/S0957417422024265 Aug. 2021.

49420 VOLUME 11, 2023


H. W. Oh, S. E. Lee: Design of Optimized RISC Processor for Edge AI

[24] N. Wu, T. Jiang, L. Zhang, F. Zhou, and F. Ge, ‘‘A reconfigurable [40] Y. H. Yoon, D. H. Hwang, J. H. Yang, and S. E. Lee, ‘‘Intellino:
convolutional neural network-accelerated coprocessor based on RISC-V Processor for embedded artificial intelligence,’’ Electronics, vol. 9, no. 7,
instruction set,’’ Electronics, vol. 9, no. 6, p. 1005, Jun. 2020. [Online]. p. 1169, Jul. 2020. [Online]. Available: https://www.mdpi.com/2079-
Available: https://www.mdpi.com/2079-9292/9/6/1005 9292/9/7/1169
[25] W. Lou, C. Wang, L. Gong, and X. Zhou, ‘‘RV-CNN: Flexible and efficient [41] D. H. Hwang, C. Y. Han, H. W. Oh, and S. E. Lee, ‘‘ASimOV:
instruction set for CNNs based on RISC-V processors,’’ in Advanced A framework for simulation and optimization of an embedded AI
Parallel Processing Technologies, P.-C. Yew, P. Stenström, J. Wu, X. Gong, accelerator,’’ Micromachines, vol. 12, no. 7, p. 838, Jul. 2021. [Online].
and T. Li, Eds. Cham, Switzerland: Springer, 2019, pp. 3–14. Available: https://www.mdpi.com/2072-666X/12/7/838
[26] Z. Li, W. Hu, and S. Chen, ‘‘Design and implementation of CNN custom [42] H. W. Oh, K. N. Cho, and S. E. Lee, ‘‘Design of 32-bit processor for
processor based on RISC-V architecture,’’ in Proc. IEEE 21st Int. Conf. embedded systems,’’ in Proc. Int. SoC Design Conf. (ISOCC), Oct. 2020,
High Perform. Comput. Commun., IEEE 17th Int. Conf. Smart City, IEEE pp. 306–307.
5th Int. Conf. Data Sci. Syst., Aug. 2019, pp. 1945–1950. [43] L. Deng, ‘‘The MNIST database of handwritten digit images for machine
[27] R. Porter, S. Morgan, and M. Biglari-Abhari, ‘‘Extending a soft-core learning research [best of the web],’’ IEEE Signal Process. Mag., vol. 29,
RISC-V processor to accelerate CNN inference,’’ in Proc. Int. Conf. no. 6, pp. 141–142, Nov. 2012.
Comput. Sci. Comput. Intell. (CSCI), Dec. 2019, pp. 694–697. [44] H. W. Oh. (2023). Simulation ENV. for The MIPS-Intellino and
[28] E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, and the Original Intellino. Accessed: Apr. 9, 2023. [Online]. Available:
L. Benini, ‘‘GAP-8: A RISC-V SoC for AI at the edge of the IoT,’’ in https://github.com/hopesandbeers/MIPS-Intellino-sim
Proc. IEEE 29th Int. Conf. Appl.-Specific Syst., Archit. Processors (ASAP), [45] Xilinx. (2022). ZYNQ-7000 Embedded Design Tutorial. Accessed:
Jul. 2018, pp. 1–4. Nov. 8, 2022. [Online]. Available: https://xilinx.github.io/Embedded-
[29] F. Taheri, S. Bayat-Sarmadi, and S. Hadayeghparast, ‘‘RISC-HD: Design-Tutorials/docs/2021.1/build/html/docs/Introduction/Zynq7000-
Lightweight RISC-V processor for efficient hyperdimensional computing EDT/Zynq7000-EDT.html
inference,’’ IEEE Internet Things J., vol. 9, no. 23, pp. 24030–24037, [46] I. Kuon and J. Rose, ‘‘Measuring the gap between FPGAs and ASICs,’’
Dec. 2022. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 2,
[30] J. Kim, J.-K. Kang, and Y. Kim, ‘‘A resource efficient integer-arithmetic- pp. 203–215, Feb. 2007.
only FPGA-based CNN accelerator for real-time facial emotion recogni- [47] A. Ehliar and D. Liu, ‘‘An ASIC perspective on FPGA optimizations,’’ in
tion,’’ IEEE Access, vol. 9, pp. 104367–104381, 2021. Proc. Int. Conf. Field Program. Log. Appl., Aug. 2009, pp. 218–223.
[31] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, [48] General Vision. (Apr. 2019). NM500 User’s Manual. Accessed:
and K. Keutzer, ‘‘SqueezeNet: AlexNet-level accuracy with 50x fewer Apr. 8, 2023. [Online]. Available: https://generalvision.com/documenta
parameters and <0.5 MB model size,’’ 2016, arXiv:1602.07360. tion/TM_NM500_Hardware_Manual.pdf
[32] M. Xia, Z. Huang, L. Tian, H. Wang, V. Chang, Y. Zhu,
and S. Feng, ‘‘SparkNoC: An energy-efficiency FPGA-
based accelerator using optimized lightweight CNN for HYUN WOO OH received the B.S. degree in
edge computing,’’ J. Syst. Archit., vol. 115, May 2021, electronic and IT media engineering and the
Art. no. 101991. [Online]. Available: https://www.sciencedirect.
M.S. degree in electronic engineering from the
com/science/article/pii/S1383762121000138
Seoul National University of Science and Tech-
[33] General Vision. (Aug. 2017). CM1K Hardware User’s Manual. Accessed:
Apr. 8, 2023. [Online]. Available: https://www.generalvision.com/
nology, Seoul, South Korea, in 2021 and 2023,
documentation/TM_CM1K_Hardware_Manual.pdf respectively. He has published some papers and
[34] L. L. Abeysekara and H. Abdi, ‘‘Short paper: Neuromorphic chip articles related to processor architecture and posit
embedded electronic systems to expand artificial intelligence,’’ in Proc. arithmetic. His current research interests include
2nd Int. Conf. Artif. Intell. Industries (AI4I), Sep. 2019, pp. 119–121. computer architecture, system-on-chip, and edge
[35] M. Suri, V. Parmar, A. Singla, R. Malviya, and S. Nair, ‘‘Neuromorphic computing.
hardware accelerated adaptive authentication system,’’ in Proc. IEEE
Symp. Ser. Comput. Intell., Dec. 2015, pp. 1206–1213.
[36] R. Dower. (2022). IntelB. Pattern Matching Technology. Accessed: SEUNG EUN LEE (Senior Member, IEEE)
Apr. 8, 2023. [Online]. Available: https://github.com/intel/Intel-Pattern- received the B.S. and M.S. degrees in electrical
Matching-Technology engineering from the Korea Advanced Institute
[37] A. Borelli, F. Spagnolo, R. Gravina, and F. Frustaci, ‘‘An FPGA- of Science and Technology (KAIST), Daejeon,
based hardware accelerator for B the B k-nearest neighbor algorithm in 1998 and 2000, respectively, and the Ph.D.
implementation in B wearable embedded systems,’’ in Applied Intelligence
degree in electrical and computer engineering
and Informatics, M. Mahmud, C. Ieracitano, M. S. Kaiser, N. Mammone,
from the University of California, Irvine (UC
and F. C. Morabito, Eds. Cham, Switzerland: Springer, 2022, pp. 44–56.
[38] H. Hussain, K. Benkrid, C. Hong, and H. Seker, ‘‘An adaptive FPGA
Irvine), in 2008. After graduating, he had been
implementation of multi-core K-nearest neighbour ensemble classifier with Intel Labs., Hillsboro, OR, USA, where he
using dynamic partial reconfiguration,’’ in Proc. 22nd Int. Conf. Field worked as a Platform Architect. In 2010, he joined
Program. Log. Appl. (FPL), Aug. 2012, pp. 627–630. as a Faculty Member with the Seoul National University of Science
[39] Z.-H. Li, J.-F. Jin, X.-G. Zhou, and Z.-H. Feng, ‘‘K-nearest neighbor and Technology, Seoul. His current research interests include computer
algorithm implementation on FPGA using high level synthesis,’’ in architecture, multi-processor system-on-chip, low-power and resilient VLSI,
Proc. 13th IEEE Int. Conf. Solid-State Integr. Circuit Technol. (ICSICT), and hardware acceleration for emerging applications.
Oct. 2016, pp. 600–602.

VOLUME 11, 2023 49421

You might also like