0% found this document useful (0 votes)
49 views6 pages

An Ultra-Low Power TinyML System For Real-Time Visual Processing at Edge

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views6 pages

An Ultra-Low Power TinyML System For Real-Time Visual Processing at Edge

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/367367874

An Ultra-low Power TinyML System for Real-time Visual Processing at Edge

Article in IEEE Transactions on Circuits and Systems II: Express Briefs · July 2023
DOI: 10.1109/TCSII.2023.3239044

CITATIONS READS
8 103

6 authors, including:

Rui Lai
Xidian University
64 PUBLICATIONS 863 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rui Lai on 22 March 2023.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 1

An Ultra-low Power TinyML System for Real-time


Visual Processing at Edge
Kunran Xu† , Huawei Zhang† , Yishi Li, Yuhao Zhang, Rui Lai, Member, IEEE, and Yi Liu

Abstract—Tiny machine learning (TinyML), executing AI


workloads on resource and power strictly restricted systems, is
an important and challenging topic. This brief firstly presents
an extremely tiny backbone to construct high efficiency CNN
models for various visual tasks. Then, a specially designed neural
co-processor (NCP) is interconnected with MCU to build an
ultra-low power TinyML system, which stores all features and
weights on chip and completely removes both of latency and
power consumption in off-chip memory access. Moreover, an ap- Fig. 1. The overview of the proposed TinyML system for visual processing.
plication specific instruction-set is further presented for realizing
agile development and rapid deployment. Extensive experiments Recently, the continuously emerging studies on TinyML
demonstrate that the proposed TinyML system based on our tiny
achieve to deploy CNNs on MCUs by introducing memory-
model, NCP and instruction set yields considerable accuracy and
achieves a record ultra-low power of 160mW while implementing efficient inference engines [1], [4] and more compact CNN
object detection and recognition at 30FPS. The demo video is models [5], [6]. However, the existing TinyML systems still
available on https://www.youtube.com/watch?v=mIZPxtJ-9EY. struggle to implement high-accuracy and real-time inference
Index Terms—Convolutional neural network, tiny machine with ultra-low power consumption. Such as the state-of-the-art
learning, internet of things, application specific instruction-set MCUNet [1] obtains 5FPS on STM32F746 but only achieves
49.9% top-1 accuracy on ImageNet. When the frame rate is
increased to 10FPS, the accuracy of MCUNet further drops to
I. I NTRODUCTION 40.5%. What’s more, running CNNs on MCUs is still not a
extremely power-efficient solution due to the low efficiency of

R Unning machine learning inference on the resource and


power limited environments, also known as Tiny Ma-
chine Learning (TinyML), has grown rapidly in recent years.
general purpose CPU in intensive convolution computing and
massive weight data transmission. Considering this, we pro-
pose to greatly pormote TinyML system by jointly designing
It is promising to drastically expand the application domain more efficient CNN models and specific CNN co-processor.
of healthcare, surveillance, and IoT, etc [1], [2]. However, Specifically, we firstly design an extemelly tiny CNN back-
TinyML presents severe challenges due to large computational bone EtinyNet aiming at TinyML applications, which has only
load and memory demand of AI models, especially in vision 477KB model weights and maximum feature map size of
applications. Popular solutions using CPU+GPU architecture 128KB and still yields remarkable 66.5% ImageNet Top-1
has shown high flexibility in MobileML applications [3], but accuracy. Then, an ASIC-based neural co-processor (NCP)
it is no longer feasible in TinyML for the much stricter is specially designed for accelerating the inference. Since
constraints on hardware resources and power consumption. implementing CNN inference in a fully on-chip memory
A typical TinyML system based on microcontroller unit access manner, the proposed NCP achieves up to 180FPS
(MCU) usually has only < 512KB on-chip SRAM, <2MB throughput with 73.6mW ultra-low power consumption. On
Flash, <1GOP/s computing capability, and <1W power limi- this basis, we propose a state-of-the-art TinyML system shown
tation [2], [4]. Meanwhile, it is difficult to use off-chip memory in Fig.2 for visual processing, which yields a record low power
(e.g., DRAM) in TinyML system for the very limited energy of 160mW in object detecting and recognizing at 30FPS.
budget, showing a huge gap between the desired and available
In summary, we make the following contributions:
storage capacity for running visual AI models.
1) An extremely tiny CNN backbone named EtinyNet is
† Authors contributed equally to this work. specially designed for TinyML. It is far more efficient
This work was supported in part by the National Key R&D Program of than existing lightweight CNN models.
China under Grant 2018YF70202800, Natural Science Foundation of China
(NSFC) under Grant 61674120. (Corresponding author: Rui Lai). 2) An efficient neural co-processor (NCP) with specific de-
Kunran Xu, Huawei Zhang, Yishi Li, Yuhao Zhang and Rui Lai are with signs for tiny CNNs is proposed. While running EtinyNet,
the School of Microelectronics, Xidian University, Xi’an 710071, and also NCP provides remarkable processing efficiency and con-
with the Chongqing Innovation Research Institute of Integrated Cirtuits,
Xidian University, Chongqing 400031, China. (e-mail: [email protected]; venient interface with extensive MCUs via SDIO/SPI.
[email protected]; [email protected]; [email protected]; 3) Building upon the proposed EtinyNet and NCP, we pro-
[email protected]). mote the visual processing TinyML system to achieve
Yi Liu is with the School of Microelectronics, Xidian University, Xi’an
710071, and also with the Guangzhou Institute of Technology, Xidian Uni- a record ultra-low power and real-time processing effi-
versity, Guangzhou 510555, China. ([email protected]). ciency, greatly advancing the TinyML community.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

Fig. 3. The proposed building blocks that make up the EtinyNet. (a) is
the Linear Depthwise Block (LB), (b) is the Dense Linear Depthwise Block
Fig. 2. The TinyML system for verification. (DLB), and (c) is the configuration of the backbone.

II. S OLUTION OF O UR T INY ML S YSTEM


A. Design of Proposed Blocks
Fig.1 shows the overview of the proposed TinyML system.
We present the linear depthwise convolution by removing
It integrates MCU with the specially designed energy-efficient
the ReLU behind DWConv of ϕd1 under the observation that
NCP on a compact board to achieve superior efficiency in a
this non-linearity harms accuracy in the design of extremely
collaborative work manner. To the best of our knowledge, we
parameter-efficient architectures, forming a specific case of
are the first to propose such a collaborative architecture in
sparse coding. Then, we introduce additional DWConv of ϕd2
TinyML field, which successfully balances the efficiency and
behind PWConv of ϕp to build a novel linear depthwise block
flexibility in the inference.
(LB) by utilizing DWConv’s parameter efficiency [7]. The LB
Initially, MCU sends the model weights and instructions
is defined as
to NCP who has sufficient on-chip SRAM to cache all these
data. During inference, NCP computes the intensive CNN O = σ(ϕd2 (σ(ϕp (ϕd1 (I))))) (1)
backbone efficiently while MCU only performs the light-
load pre-processing (color normalization) and post-processing As shown in Fig 3(a), the structure of proposed LB can
(fully-connected layer, non-maximum suppression, etc), which be represented as DWConv-PWConv-DWConv, which is ap-
improves the overall energy efficiency to the greatest extent. parently different from the commonly used bottleneck block
Besides, the inference process of NCP only involves two kinds of PWConv-DWConv-PWConv in mobile models, explained by
of data transfer, which are the input image and the output the fact that increasing the proportion of DWConv is beneficial
results. This working mode greatly reduces the off-chip data to the accuracy of tiny models.
transfer power consumption and overall processing latency, Additionally, we introduce the dense connection into LB
and helps the system to achieve high energy efficiency in for increasing its equivalent width, which is important and
computing, which will be demonstrated in Section VI. necessary for a higher accuracy [8], as well as the very limited
Considering real-time application, we interconnects NCP size of features and weights. We refer the resulting block to
and MCU with SDIO/SPI interface. SDIO could provide up Dense Linear Depthwise Block (DLB) depicted in Fig 3(b).
to 500Mbps bandwidth, which can transmit about 300FPS for Note that we take the ϕd1 and ϕp as a whole due to the removal
256×256 RGB image and 1200FPS for 128×128 one. As for of ReLU, and add the shortcut connection at the ends of these
SPI, it still reaches 100Mbps, or an equivalent throughput of two layers.
60FPS for 256 × 256 RGB image. These two buses are widely
supported by MCUs available in the market, which makes NCP B. Architecture of EtinyNet Backbone
can be applied in a wide range of TinyML systems. By stacking LBs and DLBs, we configure the EtinyNet
Fig.2 shows the prototype verification system only consist- backbone as indicated in Fig 3(c), where n, c and s represent
ing of STM32L4R9 MCU and our proposed NCP. Thanks block repeated times, the number of output channels, and the
to the innovative model (EtinyNet), co-processor (NCP) and first layer’s stride in each block (other layers’ stride equaling
application specific instruction-set, the entire system yields one) respectively. Since dense connection consumes more
both of efficiency and flexibilty. memory space, we only utilize DLB at high level stages with
much smaller feature maps. It’s encouraging that EtinyNet
III. PARAMETER - EFFICIENT E TINY N ET M ODEL backbone has only 477KB parameters and still achieves
Since NCP handles CNN workloads entirely on-chip for 66.5% ImageNet Top-1 accuracy. The extreme compactness
pursuing extreme efficiency, we focus on reducing the model of EtinyNet makes it possible to design small footprint NCP
size for satisfying the memory constrains of IoT devices in that could run without off-chip DRAM.
TinyML, which is totally different from MobileML targeting
at the reduction of MAdds. By presenting Linear Depthwise IV. A PPLICATION S PECIFIC I NSTRUCTION - SET FOR NCP
Block (LB) and Dense Linear Depthwise Block (DLB), we For easily deploying tiny CNN models on NCP, we define
derive an extremely tiny CNN backbone EtinyNet, shown in an application specific instruction-set. As shown in Table I,
Fig.3. the set contains 13 instructions, belonging to neural operation

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

type and control type respectively. It includes basic operations implementation of each operation. Furthermore, we deal with
for tiny CNN models, and each instruction consists of 128 the design details in the following three aspects.
bits: 5 bits for operation code, and the rest for attributes of 1) Different from other designs [11], [12] with fine grained
operations and operands. With each neural type instruction instructions, we implement NOU-conv with a hardwired ma-
encoding an entire layer, the proposed instruction-set has a rel- trix multiply-accumulate (MAC) [13] array, which helps to
atively coarser granularity, which simplifies the control com- improve efficiency with a simpler control logic. The MAC
plexity of hardware. Moreover, the basic operations included array is designed to perform matrix outer product with par-
in the instruction-set provide sufficient ability to execute allelism in spatial and output channel dimension for handling
commonly-used CNN architectures (e.g., MobileNetV2 [9], the most computational costly 3×3 Conv and PWConv with
MobileNeXt [10], etc). im2col operation. In this way, the number of effective mul-
tiplications in each cycle is fixed to Toc × Thw . Note that
TABLE I the number of channels varies across different convolution
INSTRUCTION SET FOR PROPOSED NCP layers, which may lead to inefficient computation for other
Instruction format Description Type ways of implementation (e.g., dot product). Conversely, our
bn batch normalization N implementation manner can avoid the above-mentioned prob-
relu non-linear activation operation N
conv 1x1 and 3x3 convolution & bn, relu N lem and improve the overall efficiency in running PWConv of
dwconv 3x3 depthwise conv & bn, relu N entire network. Moreover, the addition is realized by simple
add elementwise addition N
move move tenor to target address N
accumulation process instead of commonly-used adder tree
dsam down-sampleing by factor of 2 N with extra hardware overhead.
usam up-sampleing by factor of 2 N 2) As for the implementation of DWConv, the above de-
maxp max pooling by factor of 2 N
gap global average pooling N signed MAC array proves its efficiency only in diagonal units.
jump set program counter (PC) to target C Given this, we turn to the classical convolution processing
sup suspend processer C
end suspend processer and reset PC C pipeline [14], where nine multipliers and eight adders are
arranged to compute DWConv in each channel. The indepen-
dence between channels allows us to extend pipelines easily,
V. D ESIGN OF N EURAL C O - PROCESSOR implementing a parallelism of Toc to build NOU-dw. Since
the feature length in spatial dimension is usually much larger
As shown in Fig.4, the proposed NCP consists of five main than the pipeline depth, the DWConv can be performed in a
components: Neural Operation Unit (NOU), Tensor Memory fully pipelined manner, which yields NOU-dw an ultra high
(TM), Instruction Memory (IM), I/O and System Controller efficiency up to nearly 100% .
(SC). When NCP works, SC decodes one instruction fetched 3) In NOU-post unit, modules of int2float, float32 multiply-
from IM and informs the NOU to start computing with add, float2int and ReLU are designed and interconnected to
decoded signals. The computing process takes multiple cycles, perform post-operations of float32 BN, ReLU and element-
during which NOU reads operands from TM and writes results wise addition. To reduce memory access as much as possible,
back automatically. Once completing the writing back process, multiplexers are further utilized to select data from the output
SC continues to process the next instruction until an end of NOU-conv, NOU-dw or TM, and connect modules as
or suspend instruction is encountered. When NOU is idle, needed, allowing flexible fusion of post-operations with the
TM is accessed through I/O. We will fully describe each previous convolution layer. By implementing Toc pipelines to
component in the following parts. match the throughput of convolution, we effectively maximize
the efficiency of fusion operations.

B. Tensor Memory and Tensor Layout


1) TM is a single-port SRAM consisting of 6 banks, whose
width is Ttm × 8 bits, as shown in Fig 4. Thanks to the
compactness of EtinyNet, NCP only requires totally 992KB
on-chip SRAM. The BankI (192KB) is responsible for caching
256 × 256 input RGB images. The 128KB sized Bank0 and
Bank1 are arranged for caching feature maps, while Bank2
and Bank3 with a larger size of 256KB are used for storing
Fig. 4. The overall block diagram of the proposed NCP.
weights. The BankO (32KB) is used to store final results, such
as feature vectors and bonding boxes, etc. TM’s small capacity
and simple structure yield our NCP a small footprint.
A. Neural Operation Unit 2) The highly efficient NOU brings 2 types of tensor layouts,
CNN workloads mainly come from operations of int8 conv, named pixel-major layout and interleaved layout respectively
dwconv and float32 bn. To achieve a high energy efficiency, shown in Fig.5. For the former, all pixels of the first channel
we respectively design special hardware units, termed NOU- are sequentially mapped to TM in a row-major order. Then,
conv, NOU-dw and NOU-post, focusing on optimizing the the next channel’s counterpart is arranged in the same pattern

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

until the last channel’s pixels of a tensor are stored. For the C. Characteristics
latter, the whole tensor is divided into Nc //Ttm tiles and are We implement our NCP using TSMC 65nm low-power
placed in TM sequentially, while each tile is arranged in a technology. While Ttm = 32, Toc = 16 and Thw = 32,
channel-major order. Different layouts are required for NOUs NCP contains 512 of 8-bit MACs in NOU-conv, 144 of 8-
to achieve the maximum efficiency. For example, the input bit multipliers and 16 of adder trees in NOU-dw, and 16 of
of NOU-conv prefers pixel-major layout because spatially float32 MACs in NOU-post. When running at the maximum
continuous Thw pixels of a channel need to be multiplied and frequency of 250MHz, NOU-conv and NOU-post are active
added at a time by MAC array, while the reverse is the case every cycle, achieving a peak performance of 264 GOP/s.
for NOU-dw.
VI. E XPERIMENTAL R ESULTS
A. EtinyNet Evaluation
Table II lists the ImageNet-1000 classification results of
well-known lightweight CNNs, including MobileNetV2 [9],
MobileNeXt [10], ShuffleNetV2 [15], and MCUNet se-
ries [16]. We pay more attention to the backbone because the
fully-connected layer is generally not involved in most of vi-
sual models. Among these competitive models, MCUNet gets
the highest accuracy at the cost of model size up to 2048K.
Fig. 5. Illustration of different tensor layouts. (a) Pixel-major layout. (b)
Interleaved layout.
Compared with tiny models in similar size, our EtinyNet
reaches 66.5% top-1 and 86.8% top-5 accuracy, outperforming
the most competitive MCUNetV2-M4 by significant 1.6%
3) Running the proposed LB, DLB, and other blocks with top-1 accuracy. Moreover, EtinyNet-0.75, the width of each
NOU, the layout between adjacent DWConv and PWConv is layer is shrunk by 0.75, outperforms MCUNet-320kB by
constantly varying, which seriously decreases the computing significant 2.6% top-1 accuracy with 60K fewer parameters.
efficiency because of the discontinuous memory access. It Obviously, EtinyNet yields much higher accuracy at the same
takes NOU-conv Toc times to read the output of NOU-dw level of storage consumption, and is more suitable for TinyML
stored in an interleaved layout for performing a single matrix systems.
outer product operation. Hence, an efficient layout conversion
TABLE II
circuit is designed to tackle this problem. As shown in Fig.6, C OMPARISON OF STATE - OF - THE - ART TINY MODELS OVER ACCURACY ON
the circuit is composed of two Toc ×Thw register arrays A and I MAGE N ET. ”B” DENOTES BACKBONE . ”-” DENOTES NOT REPORTED .
B, working in a ping-pong mechanism. At the beginning, array
A receives Toc inputs at a time, after Thw cycles, A will be Model #Params. (K) Top-1 Acc. Top-5 Acc.
filled and start to output Thw results at a time in the transposed MobileNeXt-0.35 812(B) / 1836 64.7 85.7
dimension. Since reading A empty requires Toc cycles, the new MobileNetV2-0.35 740(B) / 1764 60.3 82.9
ShuffleNetV2-0.5 566(B) / 1590 61.1 82.6
coming data to be converted will be sent to array B in order MCUNet -(B) / 2048 70.7 -
to maintain the pipeline. When B is full and A completes MCUNet-320kB -(B) / 740 61.8 84.2
the readout, the role of them are exchanged. This strategy MCUNetV2-M4 -(B) / 1034 64.9 86.2
EtinyNet 477(B) / 989 66.5 86.8
obviously boosts the efficiency of valid memory access for EtinyNet-0.75 296(B) / 680 64.4 85.2
computing. EtinyNet-0.5 126(B) / 446 59.3 81.2

B. NCP Evaluation
As shown in Table III, running general CNN models usually
needs DRAMs to store their enormous weights and fea-
tures [11], [18], resulting in considerable power consumption
and processing latency. As for no DRAM access methods,
YodaNN [17] yields the highest peak performance and energy
efficiency, but it is a dedicated accelerator only for binarized
networks with very limited accuracy. Except that, Vega [12]
gets the lowest power and the maximum latency, which leads
to the lowest peak performance. To comprehensively assess
the throughput, energy consumption and speed of various
neural processors in TinyML application, we prefer to use the
metric of processing efficiency, which is the number of frames
Fig. 6. The proposed efficient layout conversion circuit.
processed per unit time and per unit power consumption.
Our proposed NCP reaches an extremely high processing

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

TABLE III VII. C ONCLUSION


C OMPARISON WITH STATE - OF - THE - ART NEURAL PROCESSORS . ”-”
DENOTES NOT REPORTED . In this brief, we propose an ultra-low power TinyML system
for real-time visual processing by designing 1) an extremely
Component NullHop ConvAix YodaNN Vega NCP tiny CNN backbone EtinyNet, 2) an ASIC-based neural co-
Technology 28nm 28nm 65nm 22nm 65nm
Area (mm2 ) 6.3 3.53 1.9 12 10.88
processor and 3) an application specific instruction-set. Our
DRAM Used yes yes none none none study greatly advances the TinyML community and promises
FC Support none yes none yes none
CNN model VGG16 MbV1 VGG19 RVGGA0 EtinyNet
to drastically expand the application scope of AIoT.
ImageNet Acc. 68.3% 70.6% - 72.4% 66.5%
Latency 72.9ms 14.2ms 75.2ms 118ms 5.5ms
Typ. Power
R EFERENCES
155.0 313.1 153 37.3 73.6
(mW) [1] J. Lin, W.-M. Chen, Y. Lin, C. Gan, S. Han et al., “Mcunet: Tiny deep
Peak Perf.
128 262.6 1500 32.2 264 learning on iot devices,” Advances in Neural Information Processing
(GOP/s)
Energy Eff.
Systems, vol. 33, pp. 11 711–11 722, 2020.
2714.8 256.3 8500 631.4 751.0 [2] M. Shafique, T. Theocharides, V. J. Reddy, and B. Murmann, “Tinyml:
(GOP/s/W)
Processing Eff. Current progress, research challenges, and future roadmap,” in 2021
1.21 15.18 2.95 1.93 449.1 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021,
(Frames/s/mJ)
pp. 1303–1306.
[3] S. Mittal, “A survey on optimized implementation of deep learning
models on the nvidia jetson platform,” Journal of Systems Architecture,
efficiency up to 449.1 Frames/s/mJ, at least 29× higher than vol. 97, pp. 428–442, 2019.
[4] L. Lai, N. Suda, and V. Chandra, “Cmsis-nn: Efficient neural network
other solutions, suggesting the unique superiority of NCP in kernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018.
this particular field. As for the reason, the specially designed [5] C. Banbury, C. Zhou, I. Fedorov, R. Matas, U. Thakker, D. Gope,
NOU, tenser layout and coarse-grained instruction-set jointly V. Janapa Reddi, M. Mattina, and P. Whatmough, “Micronets: Neural
network architectures for deploying tinyml applications on commodity
decrease the delay and the power of inference. microcontrollers,” Proceedings of Machine Learning and Systems, vol. 3,
pp. 517–532, 2021.
[6] R. T. N. Chappa and M. El-Sharkawy, “Deployment of se-squeezenext
on nxp bluebox 2.0 and nxp i. mx rt1060 mcu,” in 2020 IEEE Midwest
C. TinyML System Verification Industry Conference (MIC), vol. 1. IEEE, 2020, pp. 1–4.
[7] K. Xu, Y. Li, H. Zhang, R. Lai, and L. Gu, “Etinynet: Extremely
We compare our proposed system with existing prominent tiny network for tinyml,” in AAAI Conference on Artificial Intelligence.
MCU-based TinyML systems. As shown in Table IV, CMSIS- AAAI, 2022, pp. 4628–4636.
[8] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv
NN obtains 59.5% ImageNet accuracy at 2FPS, promoted preprint arXiv:1605.07146, 2016.
by MCUNet to 5FPS at the expense of accuracy dropping [9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
to 49.9%. In comparison, our solution reaches up to 66.5% “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE conference on computer vision and pattern recognition,
accuracy and 30FPS, achieving the goal of real-time visual 2018, pp. 4510–4520.
processing in TinyML. Furthermore, since existing methods [10] D. Zhou, Q. Hou, Y. Chen, J. Feng, and S. Yan, “Rethinking bottleneck
burden MCUs with entire CNN models, high-performance structure for efficient mobile network design,” in European Conference
on Computer Vision. Springer, 2020, pp. 680–697.
MCUs (STM32H743/STM32F746) running at the upper-limit [11] A. Bytyn, R. Leupers, and G. Ascheid, “Convaix: An application-specific
frequency (480MHz/216MHz) are necessary. Although flexi- instruction-set processor for the efficient acceleration of cnns,” IEEE
ble, general-purpose MCU is of low energy efficiency in com- Open Journal of Circuits and Systems, vol. 2, pp. 3–15, 2020.
[12] D. Rossi, F. Conti, M. Eggiman, A. Di Mauro, G. Tagliavini, S. Mach,
puting massive tensors, which results in considerable power M. Guermandi, A. Pullini, I. Loi, J. Chen et al., “Vega: A ten-core
consumption up to about 600mW. In contrast, the proposed soc for iot endnodes with dnn acceleration and cognitive wake-up from
solution allows us to perform the same flexible task only with mram-based state-retentive sleep mode,” IEEE Journal of Solid-State
Circuits, vol. 57, no. 1, pp. 127–139, 2021.
a low-end MCU (STM32L4R9,120MHz) and proposed NCP, [13] M. E. Nojehdeh, S. Parvin, and M. Altun, “Efficient hardware implemen-
which boosts the energy efficiency of the entire system and tation of convolution layers using multiply-accumulate blocks,” in 2021
achieves an ultra-low power of 160mW. IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE,
2021, pp. 402–405.
[14] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
TABLE IV “Diannao: A small-footprint high-throughput accelerator for ubiqui-
C OMPARISON WITH MCU- BASED DESIGNS ON IMAGE CLASSIFICATION tous machine-learning,” ACM SIGARCH Computer Architecture News,
( CLS ) AND OBJECT DETECTION ( DET ). ∗ DENOTES REPRODUCED RESULTS . vol. 42, no. 1, pp. 269–284, 2014.
[15] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical
guidelines for efficient cnn architecture design,” in Proceedings of the
Method Hardware Acc/mAP FPS Power European conference on computer vision (ECCV), 2018, pp. 116–131.
CMSIS-NN H743 59.5% 2 ∗675 mW [16] J. Lin, W. Chen, H. Cai, C. Gan, and S. Han, “Mcunetv2: Memory-
Cls MCUNet F746 49.9% 5 ∗525 mW efficient patch-based inference for tiny deep learning. arxiv 2021,” arXiv
Ours L4R9+NCP 66.5% 30 160 mW preprint arXiv:2110.15352.
CMSIS-NN H743 31.6% 10 ∗640 mW [17] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An architecture
Det MCUNet H743 51.4% 3 ∗650 mW for ultralow power binary-weight cnn acceleration,” IEEE Transactions
Ours L4R9+NCP 56.4% 30 160 mW on Computer-Aided Design of Integrated Circuits and Systems, vol. 37,
no. 1, pp. 48–60, 2017.
[18] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-
In addition, we benchmark the object detection performance Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco,
S.-C. Liu et al., “Nullhop: A flexible convolutional neural network
on Pascal VOC dataset. The results indicate that our system accelerator based on sparse representations of feature maps,” IEEE
also greatly improves its performance, which makes AIoT transactions on neural networks and learning systems, vol. 30, no. 3,
more promising in extensive applications. pp. 644–656, 2018.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like