An Ultra-Low Power TinyML System For Real-Time Visual Processing at Edge
An Ultra-Low Power TinyML System For Real-Time Visual Processing at Edge
net/publication/367367874
Article in IEEE Transactions on Circuits and Systems II: Express Briefs · July 2023
DOI: 10.1109/TCSII.2023.3239044
CITATIONS READS
8 103
6 authors, including:
Rui Lai
Xidian University
64 PUBLICATIONS 863 CITATIONS
SEE PROFILE
All content following this page was uploaded by Rui Lai on 22 March 2023.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044
Fig. 3. The proposed building blocks that make up the EtinyNet. (a) is
the Linear Depthwise Block (LB), (b) is the Dense Linear Depthwise Block
Fig. 2. The TinyML system for verification. (DLB), and (c) is the configuration of the backbone.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044
type and control type respectively. It includes basic operations implementation of each operation. Furthermore, we deal with
for tiny CNN models, and each instruction consists of 128 the design details in the following three aspects.
bits: 5 bits for operation code, and the rest for attributes of 1) Different from other designs [11], [12] with fine grained
operations and operands. With each neural type instruction instructions, we implement NOU-conv with a hardwired ma-
encoding an entire layer, the proposed instruction-set has a rel- trix multiply-accumulate (MAC) [13] array, which helps to
atively coarser granularity, which simplifies the control com- improve efficiency with a simpler control logic. The MAC
plexity of hardware. Moreover, the basic operations included array is designed to perform matrix outer product with par-
in the instruction-set provide sufficient ability to execute allelism in spatial and output channel dimension for handling
commonly-used CNN architectures (e.g., MobileNetV2 [9], the most computational costly 3×3 Conv and PWConv with
MobileNeXt [10], etc). im2col operation. In this way, the number of effective mul-
tiplications in each cycle is fixed to Toc × Thw . Note that
TABLE I the number of channels varies across different convolution
INSTRUCTION SET FOR PROPOSED NCP layers, which may lead to inefficient computation for other
Instruction format Description Type ways of implementation (e.g., dot product). Conversely, our
bn batch normalization N implementation manner can avoid the above-mentioned prob-
relu non-linear activation operation N
conv 1x1 and 3x3 convolution & bn, relu N lem and improve the overall efficiency in running PWConv of
dwconv 3x3 depthwise conv & bn, relu N entire network. Moreover, the addition is realized by simple
add elementwise addition N
move move tenor to target address N
accumulation process instead of commonly-used adder tree
dsam down-sampleing by factor of 2 N with extra hardware overhead.
usam up-sampleing by factor of 2 N 2) As for the implementation of DWConv, the above de-
maxp max pooling by factor of 2 N
gap global average pooling N signed MAC array proves its efficiency only in diagonal units.
jump set program counter (PC) to target C Given this, we turn to the classical convolution processing
sup suspend processer C
end suspend processer and reset PC C pipeline [14], where nine multipliers and eight adders are
arranged to compute DWConv in each channel. The indepen-
dence between channels allows us to extend pipelines easily,
V. D ESIGN OF N EURAL C O - PROCESSOR implementing a parallelism of Toc to build NOU-dw. Since
the feature length in spatial dimension is usually much larger
As shown in Fig.4, the proposed NCP consists of five main than the pipeline depth, the DWConv can be performed in a
components: Neural Operation Unit (NOU), Tensor Memory fully pipelined manner, which yields NOU-dw an ultra high
(TM), Instruction Memory (IM), I/O and System Controller efficiency up to nearly 100% .
(SC). When NCP works, SC decodes one instruction fetched 3) In NOU-post unit, modules of int2float, float32 multiply-
from IM and informs the NOU to start computing with add, float2int and ReLU are designed and interconnected to
decoded signals. The computing process takes multiple cycles, perform post-operations of float32 BN, ReLU and element-
during which NOU reads operands from TM and writes results wise addition. To reduce memory access as much as possible,
back automatically. Once completing the writing back process, multiplexers are further utilized to select data from the output
SC continues to process the next instruction until an end of NOU-conv, NOU-dw or TM, and connect modules as
or suspend instruction is encountered. When NOU is idle, needed, allowing flexible fusion of post-operations with the
TM is accessed through I/O. We will fully describe each previous convolution layer. By implementing Toc pipelines to
component in the following parts. match the throughput of convolution, we effectively maximize
the efficiency of fusion operations.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044
until the last channel’s pixels of a tensor are stored. For the C. Characteristics
latter, the whole tensor is divided into Nc //Ttm tiles and are We implement our NCP using TSMC 65nm low-power
placed in TM sequentially, while each tile is arranged in a technology. While Ttm = 32, Toc = 16 and Thw = 32,
channel-major order. Different layouts are required for NOUs NCP contains 512 of 8-bit MACs in NOU-conv, 144 of 8-
to achieve the maximum efficiency. For example, the input bit multipliers and 16 of adder trees in NOU-dw, and 16 of
of NOU-conv prefers pixel-major layout because spatially float32 MACs in NOU-post. When running at the maximum
continuous Thw pixels of a channel need to be multiplied and frequency of 250MHz, NOU-conv and NOU-post are active
added at a time by MAC array, while the reverse is the case every cycle, achieving a peak performance of 264 GOP/s.
for NOU-dw.
VI. E XPERIMENTAL R ESULTS
A. EtinyNet Evaluation
Table II lists the ImageNet-1000 classification results of
well-known lightweight CNNs, including MobileNetV2 [9],
MobileNeXt [10], ShuffleNetV2 [15], and MCUNet se-
ries [16]. We pay more attention to the backbone because the
fully-connected layer is generally not involved in most of vi-
sual models. Among these competitive models, MCUNet gets
the highest accuracy at the cost of model size up to 2048K.
Fig. 5. Illustration of different tensor layouts. (a) Pixel-major layout. (b)
Interleaved layout.
Compared with tiny models in similar size, our EtinyNet
reaches 66.5% top-1 and 86.8% top-5 accuracy, outperforming
the most competitive MCUNetV2-M4 by significant 1.6%
3) Running the proposed LB, DLB, and other blocks with top-1 accuracy. Moreover, EtinyNet-0.75, the width of each
NOU, the layout between adjacent DWConv and PWConv is layer is shrunk by 0.75, outperforms MCUNet-320kB by
constantly varying, which seriously decreases the computing significant 2.6% top-1 accuracy with 60K fewer parameters.
efficiency because of the discontinuous memory access. It Obviously, EtinyNet yields much higher accuracy at the same
takes NOU-conv Toc times to read the output of NOU-dw level of storage consumption, and is more suitable for TinyML
stored in an interleaved layout for performing a single matrix systems.
outer product operation. Hence, an efficient layout conversion
TABLE II
circuit is designed to tackle this problem. As shown in Fig.6, C OMPARISON OF STATE - OF - THE - ART TINY MODELS OVER ACCURACY ON
the circuit is composed of two Toc ×Thw register arrays A and I MAGE N ET. ”B” DENOTES BACKBONE . ”-” DENOTES NOT REPORTED .
B, working in a ping-pong mechanism. At the beginning, array
A receives Toc inputs at a time, after Thw cycles, A will be Model #Params. (K) Top-1 Acc. Top-5 Acc.
filled and start to output Thw results at a time in the transposed MobileNeXt-0.35 812(B) / 1836 64.7 85.7
dimension. Since reading A empty requires Toc cycles, the new MobileNetV2-0.35 740(B) / 1764 60.3 82.9
ShuffleNetV2-0.5 566(B) / 1590 61.1 82.6
coming data to be converted will be sent to array B in order MCUNet -(B) / 2048 70.7 -
to maintain the pipeline. When B is full and A completes MCUNet-320kB -(B) / 740 61.8 84.2
the readout, the role of them are exchanged. This strategy MCUNetV2-M4 -(B) / 1034 64.9 86.2
EtinyNet 477(B) / 989 66.5 86.8
obviously boosts the efficiency of valid memory access for EtinyNet-0.75 296(B) / 680 64.4 85.2
computing. EtinyNet-0.5 126(B) / 446 59.3 81.2
B. NCP Evaluation
As shown in Table III, running general CNN models usually
needs DRAMs to store their enormous weights and fea-
tures [11], [18], resulting in considerable power consumption
and processing latency. As for no DRAM access methods,
YodaNN [17] yields the highest peak performance and energy
efficiency, but it is a dedicated accelerator only for binarized
networks with very limited accuracy. Except that, Vega [12]
gets the lowest power and the maximum latency, which leads
to the lowest peak performance. To comprehensively assess
the throughput, energy consumption and speed of various
neural processors in TinyML application, we prefer to use the
metric of processing efficiency, which is the number of frames
Fig. 6. The proposed efficient layout conversion circuit.
processed per unit time and per unit power consumption.
Our proposed NCP reaches an extremely high processing
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2023.3239044
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on March 22,2023 at 09:36:20 UTC from IEEE Xplore. Restrictions apply.
View publication stats