Accelerating Tinyml Inference On Microcontrollers Through Approximate Kernels
Accelerating Tinyml Inference On Microcontrollers Through Approximate Kernels
Abstract—The rapid growth of microcontroller-based IoT de- tegrating Approximate Computing [4] (AC) principles with
vices has opened up numerous applications, from smart manufac- optimized software kernels, we develop an automated frame-
turing to personalized healthcare. Despite the widespread adop- work that generates specialized approximate code for specific
tion of energy-efficient microcontroller units (MCUs) in the Tiny
Machine Learning (TinyML) domain, they still face significant Convolutional Neural Networks (CNNs). Our approach utilizes
limitations in terms of performance and memory (RAM, Flash). flash memory to unpack kernel code within convolution layers,
arXiv:2409.16815v1 [cs.LG] 25 Sep 2024
In this work, we combine approximate computing and software eliminating instruction overheads. Subsequently, by leveraging
kernel design to accelerate the inference of approximate CNN the unpacked operations and the fact that each computation
models on MCUs. Our kernel-based approximation framework contributes uniquely to the final output, we employ an offline
firstly unpacks the operands of each convolution layer and then
conducts an offline calculation to determine the significance of significance-aware computation skipping approach, where cer-
each operand. Subsequently, through a design space exploration, tain operations are either skipped or retained. Through design
it employs a computation skipping approximation strategy based space exploration (DSE), our framework identifies Pareto-
on the calculated significance. Our evaluation on an STM32- optimal solutions, each offering unique accuracy-latency trade-
Nucleo board and 2 popular CNNs trained on the CIFAR-10 offs tailored to user requirements. Compared to the state-
dataset shows that, compared to state-of-the-art exact inference,
our Pareto optimal solutions can feature on average 21% latency of-the-art CMSIS-NN, our approach achieves a 21% latency
reduction with no degradation in Top-1 classification accuracy, reduction with no degradation in Top-1 classification accuracy
while for lower accuracy requirements, the corresponding reduc- on CIFAR-10 trained CNN models, while for lower accuracy
tion becomes even more pronounced. requirements (< 5%), our method outperforms even commer-
Index Terms—Approximate Computing, MCUs, TinyML cial frameworks.
Our novel contributions within this work are as follows:
I. I NTRODUCTION
1) This is the first work that evaluates the impact of ap-
In recent years, the proliferation of low-cost IoT microcon- proximate computing on the optimized inference library
troller units (MCUs) has significantly expanded the Tiny Ma- of CMSIS-NN, targeting MCUs.
chine Learning (TinyML) domain [1], enabling real-time data 2) We propose an automated cooperative approximation
processing on tiny devices. Despite the energy efficiency of framework for accelerating CNNs inference on MCUs1
MCUs, their limited resources and high latency challenge the 3) Using our framework, we demonstrate that, in many cases
deployment of deep learning models on small-scale hardware. approximate computing is able to realize larger and faster
Consequently, new optimizations and customized architectures networks than conventional ones on tiny devices.
are needed to bridge the resource gap, making reconfigurable
MCUs an attractive option for ML acceleration. II. C OOPERATIVE A PPROXIMATION F RAMEWORK FOR
In this effor, ARM’s CMSIS-NN [2] software library offers I NFERENCE O PTIMIZATION
efficient neural network operations for MCUs running on Arm This section describes our cooperative approximation frame-
Cortex-M CPUs, achieving nearly an 11x latency improvement work for deploying approximate DNNs on microcontrollers.
compared to TensorFlow Lite Micro on several ImageNet In brief, we first describe our basic kernel customizations
models deployed on an STM32H743 board. TinyEngine [3], a and how we eliminate associated overheads from existing
system-model co-design framework, combines neural architec- inference libraries. Then, we analyze our layer-based code
ture search with a memory-optimized inference library, result- unpacking showing the latency benefits over typical imple-
ing in average latency and SRAM usage reductions of 2.1× and mentation for the targeted kernels and we finally describe
2.4×, respectively, compared to CMSIS-NN. However, relevant our significance-aware computation skipping exploration that
frameworks focus on fitting models within memory constraints offers the flexibility to trade classification accuracy for further
rather than reducing inference latency of large models. For inference acceleration. An abstract overview of our framework
instance, TinyEngine requires about 1.3s to execute an mcunet- is depicted in Fig. 1
in4 ImageNet model on a 160MHz MCU, highlighting existing
latency challenges in real-time applications. A. Customized kernels for NN deployment
In this work we investigate the feasibility of the efficient To meet our deployment scenarios’ unique requirements, we
utilization of MCUs to enhance DNN performance. By in- use CMSIS-NN as our baseline inference library. Our approx-
Work partially supported by the Horizon Europe research and innovation pro- 1 Available at https://github.com/GeorgeMentzos/ATAMAN-AuTo-driven-
gram via the “CONVOLVE” project under grant agreement No. 101070374. Approximation-and-Microcontroller-AcceleratioN-Toolkit
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers
or lists, or reuse of any copyrighted component of this work in other works.
Trained CNN any necessary padding or adjustments to ensure the correct
alignment of the data for convolution. Instead, our framework
Accuracy
0.6
Accuracy
(a)(a)
multiplication (mat mult kernel) is calculated based on a 0.5
0.5
weighted sum and an initialized bias:
0.4
X 0.4
Sumc = b + ai · wi , (1)
0.0
0.0 0.1
0.2 0.4 0.6 0.8
∀i Normalized MAC0.2
Reduction 0.3
Normalized Latency Reduction
where, b is the initialized bias, wi are the trained coefficients
(weights) and ai are the inputs from the respective channel. 0.70
Intuitively, when inputs are multiplied by large numbers, they
Accuracy
(b)
tend to produce significantly more impactful products (ai · 0.65
wi ) in the final result compared to inputs multiplied by small
values. However, it is worth noting that the significance of
0.60
the product ai · wi also depends on the value of ai . Thus, we
define the significance ( 3 ) of each product as follows: 0.0 0.1 0.2 0.3 0.4 0.5
Normalized MAC Reduction
E[ai ] · wi
Si = | P | (2) Fig. 2. Pareto space between accuracy and normalized MAC unit reduction is
∀i E[ai ] · wi depicted for our computation skipping approach within all convolution layers
for AlexNet (a) and LeNet (b).
, where E[ai ] is the average expected value of the ai
input. In other words, (2) calculates the long-term expected On average, DSE required less than 2 hours using 6 threads.
outcome of each product ai · wi over the total sum Sumc of The aforementioned execution times refer to an Intel-i7-8750H
the respective channel. If the sum equals with zero, which is with 32GB RAM. Following the DSE analysis, we extract the
the vast minority of the cases, we consider the corresponding suitable approximate configuration ( 5 ) based on the user’s
significance Si to be large, and thus, the product is retained. specified accuracy loss threshold and desired possible speedup.
For each channel and Sumc , the calculation of Si , ∀i, is Subsequently, our framework generates the approximate code
straightforward and involves capturing the input values’ dis- ( 4 ), which is then compiled and deployed to the MCU.
tribution ( 2 ) from a small portion of the dataset.
By exploiting this high-level information, we minimize III. E XPERIMENTAL R ESULTS AND A NALYSIS
the total computations required for each summation (Sumc ) In this section we evaluate the efficiency of our pro-
at compile time, and thus, we approximate the summation posed framework in reducing inference latency at the cost of
accordingly. Specifically, for each product ai · wi , if Si is some classification accuracy and we investigate the impact
less or equal to a given threshold τ , it is incorporated of approximate computing within the context of TinyML
into the generated code, while others are omitted. Thus, our on MCUs. We evaluate the inference latency, classification
approximate summation per channel is now represented by: accuracy, memory usage and energy of our approximate de-
X X signs against the state-of-the-art exact models [2] and we also
Sum′c = b + (ai · wi ) − (ai · wi ) (3) compare our framework against the closed-source X-CUBE-
∀i ∀i:Si ≤τ
AI [8] framework. All the experiments are evaluated on an
Finally, we perform an exhaustive DSE w.r.t. the targeted STM32U575ZIT6Q SoC, an ARM Cortex-M33 based MCU,
layers and the values of τ ranging from [0, 0.1] with a running at 160 MHz, with 2 MB of Flash and 768KB of RAM.
step of 0.001 and 0.01 for LeNet and AlexNet, respectively. Before deploying the final approximate design and measur-
This exploration is performed offline and only once. Every ing its latency, an initial analysis is required ( 5 ). This offline
approximate configuration, denoting which layers and com- analysis assists in extracting approximate designs based on
putations are approximated, undergoes simulation to calculate the specified accuracy loss threshold set by the user and also
the classification accuracy. Subsequently, a Pareto analysis is avoids reconfiguring the MCU multiple times (i.e., ≡ designs
conducted to determine trade-offs between accuracy and total of DSE), which could potentially result in flash memory
perforated MAC operations, leading to a model with increased deterioration. Hence, Fig. 2 presents the Pareto space between
speedup. Note that in this work, we exclusively concentrate on accuracy and normalized MAC unit reduction achieved from
the convolution layers, and therefore, the model’s behavior, our skipping approximation for the two examined CNNs. In
when considering the rest of the functions, becomes rather Fig. 2 MAC reduction concerns only the convolution layers.
predictable. Consequently, the clock cycles reported by our The black ‘x’ is our exact baseline design [2]. The blue dots
counters during our simulations [5] closely align with the on the graph correspond to approximate configurations. The
cycles of the actual model deployment, and export represen- green triangles, on the other hand, form the Pareto Front line.
tative gain percentages with respect to the “unpacked” model. These configurations particularly represent the percentage of
TABLE II
C OMPARISON WITH STATE - OF - THE - ART CMSIS [2] AND X-CUBE-AI [8] FOR TWO CNN S DEPLOYED ON AN STM 32 U 575 ZI - Q BOARD FITTING 2MB
F LASH AND 768KB RAM. T HREE ACCURACY LOSS THRESHOLDS HAVE BEEN CONSIDERED .
operations that are skipped and the indexes of these operations reduction in latency, with a negligible accuracy degradation.
in the final generated code. Note that the number of the Additionally, uTVM [10], an end-to-end ML compiler frame-
explored configuration/designs is model dependent. As afore- work tailored for bare-metal MCUs, reports a 13% latency
mentioned, the DSE was performed based on various signifi- overhead compared to CMSIS when using a similar LeNet
cance thresholds τ , steps and examined layers for both CNNs. model architecture. For the same model, our approach outper-
In total, we evaluated more than 10,000 approximate designs forms uTVM, achieving an additional 32% speedup with an
for LeNet and AlexNet, separately. On average, it is observed accuracy loss of less than 5%.
that our “only skipping” approximation achieves 44% MAC
IV. C ONCLUSION
reduction, delivering identical classification accuracy with the
exact baseline, while this number rises further for both models In this work, to address the notable latency limitations of
to averagely 57% when compromising 5% accuracy loss. MCUs, we introduce a cooperative framework that combines
approximate computing with software kernel optimizations.
In Table II we report some important metrics of our frame- Through a systematic kernel-based computation skipping ap-
work. To generate this table we considered three conservative proach, our framework effectively removes operations deemed
accuracy loss thresholds (i.e., 0%, 5% and 10%) and we report insignificant for the model’s inference, resulting in accelerated
the latency, Top-1 accuracy, flash, and energy metrics for the inference speeds at the expense of different accuracy trade-off.
latency-optimized approximate designs after deployment on These trade-offs have the potential to open avenues for more
the examined MCU. As aforementioned, due to the nature AI applications and enable the execution of more complex
of target AI applications, such as real-time processing, a deep neural networks on tiny MCUs.
fast inference is one of the foremost requirements when
targeting DNNs on MCUs and so prioritizing it over strict R EFERENCES
accuracy constraints is a typical procedure [3]. As depicted [1] V. Rajapakse, I. Karunanayake, and N. Ahmed, “Intelligence at the ex-
in Table II, our cooperative approximation approach, which treme edge: A survey on reformable tinyml,” ACM Comput. Surv., vol. 55,
no. 13s, jul 2023.
includes both code unpacking and significance-aware skipping [2] L. Lai, N. Suda, and V. Chandra, “Cmsis-nn: Efficient neural network
approximation, achieves an average a speedup of 21% while kernels for arm cortex-m cpus,” 01 2018.
[3] J. Lin, W.-M. Chen, Y. Lin, J. Cohn, C. Gan, and S. Han, “Mcunet: Tiny
incurring no degradation (zero accuracy loss) compared to deep learning on iot devices,” in Proceedings of the 34th International
the exact baseline [2]. Moreover, the respective speedup is Conference on Neural Information Processing Systems, ser. NIPS’20.
Red Hook, NY, USA: Curran Associates Inc., 2020.
increased to 36% when accepting approximately 10% accuracy [4] G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware ap-
loss. In Table II, we also provide a comparison of our models proximate techniques for deep neural network accelerators: A survey,”
ACM Comput. Surv., vol. 55, no. 4, nov 2022.
with the state-of-the-art homogeneous inference library, X- [5] S. Prakash, T. Callahan, J. Bushagour, C. Banbury, A. V. Green, P. Warden,
CUBE-AI. Although X-CUBE-AI attains a 12% lower latency T. Ansell, and V. J. Reddi, “Cfu playground: Full-stack open-source
framework for tiny machine learning (tinyml) acceleration on fpgas,” in
for the precise LeNet(0%) compared to our framework, it is 2023 IEEE International Symposium on Performance Analysis of Systems
worth noting that for the more complex CNN of AlexNet, our and Software (ISPASS), 2023, pp. 157–167.
[6] Z. Jia, D. Li, C. Liu, L. Liao, X. Xu, L. Ping, and Y. Shi, “Tinyml design
approach outperforms X-CUBE-AI. Specifically, we achieve contest for life-threatening ventricular arrhythmia detection,” IEEE Trans-
an increased speedup of 17% with identical classification actions on Computer-Aided Design of Integrated Circuits and Systems, pp.
1–1, 2023.
accuracy, while even for LeNet we manage better latency for [7] J. Zhang, X. Chen, M. Song, and T. Li, “Eager pruning: Algorithm and
7% accuracy loss. As shown, our framework, can surpass even architecture support for fast training of deep neural networks,” in 2019
ACM/IEEE 46th Annual International Symposium on Computer Architec-
commercial tools like X-CUBE-AI (that have also very limited ture (ISCA), 2019, pp. 292–303.
flexibility), providing an accuracy-latency trade-off that was [8] STMicroelectronics, “X-cube-ai: Ai expansion pack for
stm32cubemx,” Dec 2019. [Online]. Available: https://www.st.com/
previously unattainable for optimized libraries like CMSIS. en/embedded-software/x-cube-ai.html
[9] A. Capotondi, M. Rusci, M. Fariselli, and L. Benini, “Cmix-nn: Mixed
Lastly, we undertake a qualitative evaluation, comparing our low-precision cnn library for memory-constrained edge devices,” IEEE
approximation framework with other state-of-the-art method- Transactions on Circuits and Systems II: Express Briefs, vol. 67, 2020.
[10] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen,
ologies. When compared to CMix-NN [9] using a model with L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm:
13.8M MAC operations, our framework achieves a latency An automated end-to-end optimizing compiler for deep learning,” in
Proceedings of the 13th USENIX Conference on Operating Systems Design
of 124ms on a 160MHz MCU. This means that, compared and Implementation, ser. OSDI’18, 2018.
to CMix-NN [9], our framework achieves a remarkable 62%