Hardware Complexity Analysis of Deep Neural Networks and Decision Tree Ensembles For Real-Time Neural Data Classi Cation
Hardware Complexity Analysis of Deep Neural Networks and Decision Tree Ensembles For Real-Time Neural Data Classi Cation
net/publication/333230138
CITATIONS READS
15 1,942
2 authors, including:
Mahsa Shoaran
Cornell University
51 PUBLICATIONS 508 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A 16-channel 1.1mm2 implantable seizure control SoC with sub-μW/channel consumption and closed-loop stimulation in 0.18µm CMOS View project
All content following this page was uploaded by Mahsa Shoaran on 06 June 2019.
Abstract— A fast and low-power embedded classifier with computationally demanding during inference, and require
small footprint is essential for real-time applications such as extensive hardware resources and large amounts of memory
brain-machine interfaces (BMIs) and closed-loop neuromod- to store many parameters on chip. Moreover, the compu-
ulation for neurological disorders. In most applications with
large datasets of unstructured data, such as images, deep tationally intensive nature of DNNs may also degrade the
neural networks (DNNs) achieve a remarkable classification detection latency, that is critical for real-time and closed-loop
accuracy. However, DNN models impose a high computational applications such as responsive stimulation and prosthetic
cost during inference, and are not necessarily ideal for problems arm control. Support vector machine (SVM) is an alternative
with limited training sets. The computationally intensive nature classification model that has been used in biomedical system-
of deep models may also degrade the classification latency,
that is critical for real-time closed-loop applications. Among on-chips (SoCs). However, the size of feature vector in
other methods, ensembles of decision trees (DTs) have recently SVM, and consequently, the number of multiplications and
been very successful in neural data classification tasks. DTs additions required for classification linearly increase with
can be designed to successively process a limited number number of input channels. Moreover, for cases with a highly
of features during inference, and thus impose much lower nonlinear separation boundary, the use of nonlinear kernels
computational and memory overhead. Here, we compare the
hardware complexity of DNNs and gradient boosted DTs for can further increase the hardware complexity of SVM.
classification of real-time electrophysiological data in epilepsy. Recently, prediction models based on gradient boosting
Our analysis shows that the strict energy-area-latency trade- technique [7],[8] have achieved an unprecedented accuracy
off can be relaxed using an ensemble of DTs, and they can be in ML competitions on Kaggle, such as classification of
significantly more efficient than alternative DNN models, while intracranial EEG data for epilepsy [9]. A combination of
achieving better classification accuracy in real-time neural data
classification tasks. gradient boosting (XGBoost) and neural networks was the
winning solution to the grasp-and-lift EEG detection contest
I. I NTRODUCTION on Kaggle. This technique employs gradient-based opti-
Today, machine learning (ML) techniques can be used mization and boosting to form an accurate classification
to interpret complex and noisy sensor data in a variety of model, by adaptively combining simple weak predictors,
applications such as medical devices, wearables, and internet typically decision trees. Tree-based classifiers with simple
of things (IoT). To enable fast and energy-efficient classifi- comparators as their processing units are inherently more
cation of neural data in real-time applications such as motor hardware-friendly compared to DNNs and SVMs [10]. For
decoding for BMI [1] or seizure detection for epilepsy [2], instance, a 32-channel embedded gradient-boosting classifier
the application-specific integrated circuit (ASIC) implemen- for epileptic seizure detection achieved 27× improvement
tation of ML algorithms is required. Furthermore, embedded in energy-area-latency product compared to state-of-the-art
learning at the edge and near the sensors is preferred over the SVM models [2], [11]. In contrast to DTs, other classification
cloud, due to latency or privacy concerns as well as limited models extract all required features from every input channel,
communication bandwidth. or directly process raw data with intensive computations,
which may increase the hardware and memory overhead.
Different architectures of DNNs such as convolutional
In our previous work [2], [11], we showed the superiority
neural networks (CNNs) and recurrent neural networks
of gradient boosted trees over linear/non-linear SVM and
(RNNs) have recently been used for neural data classification
LLS classifiers with a low-power microchip implementation.
tasks such as epileptic seizure detection and movement
In this work, we specifically compare the computational
intention decoding [1], [3]. For example, a 5-layer CNN
complexity and energy-area requirements of neural networks
followed by a logistic regressor was used to detect interictal
and DT ensembles for neural data classification in real-
epileptiform discharges from intracranial EEG recordings
time applications such as seizure detection. The complexity
in [4]. An 8-layer CNN classifier was used to detect the ictal,
analysis and results are however applicable to other domains
pre-ictal, and interical periods from scalp EEG following
and similar classification tasks.
wavelet transform in [5]. To partially relax the large storage
requirements of CNNs, an integer CNN architecture for II. C OMPUTATIONAL C OMPLEXITY A NALYSIS
detection of epileptic seizures from scalp or intracranial
The complexity of a classifier during inference is defined
EEG was proposed in [6]. However, DNN models can be
by the number of computational resources and computa-
The authors are with the School of Electrical and Computer Engineering, tions required to classify the input data. The memory and
Cornell University, Ithaca, NY. Email: (mt795, shoaran)@cornell.edu. hardware requirements can be mathematically formulated by
the following, we discuss the computational cost of DNN MAC count dim(l − 1) × dim(l) f × u × v × g × om × on k × l × (t + 2) × Ns *
without sharing:
and DT ensemble models. Parameters dim(l − 1) × dim(l)
( f ×u×v+1)×g×om ×on
k × (2l − 1) × 3 + T
with sharing:
( f × u × v + 1) × g
A. Deep Neural Networks * Two extra additions are required for classification and thresholding.
With multiple stacked layers for feature extraction and scheme has been proposed, in which the neurons of each
transformation, DNNs can model complex and non-linear output feature map share the same weights and bias for
relationships in data. Convolutional neural network (CNN) filtering the inputs. Thus, the total number of weights and
is a popular class of deep learning models with translation bias values with and without sharing can be written as:
invariance characteristics, that can extract spatiotemporal
with sharing : ( f × u × v + 1) × g
features from raw input [12]. The hidden layers in a CNN
typically consist of convolutional (CONV), fully connected without sharing : ( f × u × v + 1) × g × om × on (1)
(FC), and pooling layers [12]. The pooling layers are used To calculate the input to the activation function for each
to downsample the spatial dimensions of the input, and do element in the output feature maps, f ×u ×v MAC operations
not require any parameters to be learned. The computational are required. Thus, the total number of MACs required to
complexity of pooling layer depends on the type of pooling calculate the output of a CONV layer with a group of g
function (e.g., average or max pooling). A pooling layer filters is given by:
of size u × v downsamples the input feature maps by a
factor of uv. In an FC layer, all neurons at the current layer f × u × v × g × om × on (2)
are connected to all neurons of the previous layer. For a
B. Decision Tree Ensembles
neuron with label i at the FC layer l which receives outputs
o from neurons in the preceding layer, the propagation Decision trees are obtained by recursively partitioning the
dim(l−1) data space based on a sequence of queries, in the form of
function is defined by ni = f (∑ j=1 wi j o j ), in which
f is the activation function (e.g., Sigmoid or ReLU), o j comparison to a threshold. For a trained model, successive
denotes the output from neuron j at layer l − 1, and the comparisons are then performed on input features during
weight parameters are shown by wi j . Therefore, a total of inference, that start at the root node and terminate in a
dim(l − 1) MAC operations and parameters are required for leaf node. To improve the classification performance, various
each neuron at layer l. Thus, the total number of parameters ensemble methods such as gradient boosting and random
and MAC operations for the FC layer l would be equal to forest have been widely used. In gradient boosting, multiple
dim(l − 1) × dim(l). trees are built in a greedy fashion to minimize a regularized
The CONV layers can extract spatial features from the objective on the training loss, while the output of classifier
input, as well as temporal features from time series data. In is defined by a weighted additive combination of individual
CONV layers, a group of kernels (filters) are applied to the trees [2]. Given the sequential process of decision making
input, while passing the result to the next layer. Such layers in a tree, only a subset of nodes (along the path from root
may also provide downsampling (depending on the stride to the leaf) will be visited during the top-down flow. As a
size of the layer) [12]. Let’s assume a group of g kernels of result, a DT would only require the extraction of a limited
size u × v are applied to f feature maps of dimension m × n, number of features for classification of input data [2].
as shown in Fig. 1(a). Here, pm and pn are the amount of In the case of epileptic seizure detection, the input fea-
zero-padding on the borders of input feature maps, while ture vector consists of spectral power features from input
filters are applied with a stride of s. The dimensions of the channels. Therefore, for comparison at each node of a tree,
output feature map in the m and n directions can be written the channel and feature number, and corresponding threshold
as om = (m − u + 2pm )/s + 1 and on = (n − v + 2pn )/s + 1, value need to be determined. As a result, in a k-size ensemble
respectively, as shown in Fig. 1(b). To reduce the number of of trees with a depth of l (i.e., 2l − 1 nodes), the total number
parameters required for a CONV layer, a parameter sharing of parameters required for inference is equal to:
k × (2l − 1) × 3 (3)
In addition, a total of T coefficient parameters for FIR
filters in non-overlapping bands is required, that is stored and
shared among trees, as shown in the hardware architecture
of Fig. 2(b).
The maximum number of MAC operations per classifica-
tion in a decision tree is for the case when all the active
queries are on spectral power features, as the most computa-
Fig. 1: (a) A group of g kernels with a size of u×v and depth of f are tionally intensive attribute. Therefore, assuming that feature
applied to f input feature maps of size m×n; (b) sliding of filters on input extraction is done with t-tap FIR filters followed by an energy
data. extractor, the maximum number of MAC operations required
408
Accelerator
for feature extraction in a k-size ensemble of trees with a
PE Array
depth of l can be written as k × l × (t + 2) × Ns , where Ns Filter Filter
coefficients Processing
is the number of samples used for feature extraction at each Ifmap Element (PE)
node. This can be physically implemented with a serial filter Off-Chip
Compression, Global Spad
MAC
DRAM Buffer Psum
ReLU
architecture employing a single MAC unit per tree. Finally, Control
409
ensemble in Fig. 2(b) [2], however, the data movement is
significantly lower. In this DT architecture, the only param-
eters to be retrieved from memory include the input channel
number to be processed, threshold value for comparison,
feature number, and filter coefficients, with no need for
intermediate memory access from on-chip memory during
Fig. 3: Comparison of number of parameters and MAC operations required
feature extraction or classification. As a result, a state-of- for CNN and DT classifiers, *assuming that all visited nodes in a tree extract
the-art energy efficiency of 41.2 nJ/class was achieved in SoC features from 1s windows of input.
measurements [2]. ployed, that remains as future work.
B. Hardware Utilization and Storage Requirements IV. C ONCLUSIONS
In this work, we compared the hardware complexity of
As discussed earlier, deep neural networks require spe-
DNN and DT ensembles for neural data classification. Our
cialized hardware platforms for energy efficient computation
study shows that DNNs are computationally demanding, and
of MAC operations. Such a platform would process CNNs
require large number of parameters and MAC operations
in a layer by layer fashion. Based on the size and shape
for inference. Therefore, such classifiers do not meet the
of the layer (i.e., CONV or FC), a portion of the on-chip
stringent power and area requirements for applications such
resources would be reused to compute the MAC operations
as implantables or wearables. Prior work [2], [11] have pro-
in a sequence. In [13], the computation mapping of the
posed hardware friendly architectures for DTs that substan-
processing element (PE) array for each layer is found by
tially relaxes the energy-area trade-off, while outperforming
maximizing the data reuse in the form of convolutional,
the deep learning methods in accuracy. Our analysis further
filter, and input feature map reuse. The total core area of
confirms that ensembles of DTs can serve as an attractive
this accelerator with 168 PEs and 108kB of global buffer is
solution for embedded neural data classification.
12.25 mm2 . Moreover, a large off-chip DRAM and a bulky
R EFERENCES
on-chip buffer is required for energy efficient computations
and storage of classifier parameters in DNNs, as opposed to [1] D. Pei, M. Burns, R. Chandramouli, R. Vinjamuri, “Decoding asyn-
chronous reaching in electroencephalography using stacked autoen-
minimal storage requirements of DT ensembles. While the coders,” IEEE Access, vol. 6, pp. 52889-52898, 2018.
network quantization technique proposed in [6] saves 7.2– [2] M. Shoaran, B. A. Haghi, M. Taghavi, M. Farivar, A. Emami,
7.6× in storage requirements compared to the 32-bit floating “Energy-Efficient Classification for Resource-Constrained Biomedical
Applications,” IEEE Journal on Emerging and Selected Topics in
point, it still needs 26.2× more storage compared to the Circuits and Systems (JETCAS), vol. 8, no. 4, pp. 693-707, 2018.
DT ensemble in [2]. This 8-tree ensemble requires less than [3] P. Thodoroff, J. Pineau, A. Lim, “Learning robust features using
1kB of register-type memory (690b dedicated per tree and deep learning for automatic seizure detection,” Machine learning for
healthcare conference, pp. 178-190, 2016.
228B shared), with no need for off-chip memory. Figure 3 [4] A. Antoniades, L. Spyrou, D. Martin-Lopez, A. Valentin, G. Alarcon,
further confirms the large number of parameters required S. Sanaei, C.C. Took, “Detection of interictal discharges with convolu-
for DNN classifiers compared to DT ensembles, based on tional neural networks using discrete ordered multichannel intracranial
EEG,” IEEE Transactions on Neural Systems and Rehabilitation
the equations derived in Section II. The accurate estimation Engineering, vol. 25, no. 12, pp. 2285-2294, 2017.
of hardware resources and chip area would depend on the [5] H. Khan, L. Marcuse, M. Fields, K. Swann, B. Yener, “Focal onset
customized hardware architecture for each model, which is seizure prediction using convolutional networks,” IEEE Transactions
on Biomedical Engineering, vol. 65, no. 9, pp. 2109-2118, 2018.
not available for the cited references that use DNN models. [6] N. D. Truong, A. D. Nguyen, L. Kuhlmann, M. R. Bonyadi, J. Yang,
Furthermore, the hardware architecture in Fig. 2(b) pro- S. Ippolito, O. Kavehei, “Integer Convolutional Neural Network for
cesses the top-down flow of a tree in a sequential way, Seizure Detection,” IEEE Journal on Emerging and Selected Topics
in Circuits and Systems (JETCAS), 2018.
through reusing a universal feature extraction engine. This [7] J. H. Friedman, “Greedy function approximation: a gradient boosting
architecture allows significant reduction in the overall chip machine,” Annals of Statistics, pp.1189-1232, 2001.
area and power consumption. Therefore, the energy and area [8] T. Chen and T. He, “xgboost: eXtreme Gradient Boosting,” R package
version 0.4-2, 2015.
trade-off is relaxed. As discussed in [2], the hardware com- [9] Available online at www.kaggle.com/c/seizure-detection
plexity of this architecture would not scale with number of [10] M. Shoaran, M. Farivar, A. Emami, “Hardware-Friendly Seizure
input channels, and can therefore serve as a compact and low- Detection with a Boosted Ensemble of Shallow Decision Trees,” Int.
Conf. of the IEEE Engineering in Medicine and Biology Society
power solution for multichannel neural data classification. (EMBC), 2016.
During feature extraction at each node of a tree, the small [11] M. Taghavi, B. A. Haghi, M. Farivar, M. Shoaran, A. Emami, “A 41.2
number of MAC operations in FIR filters would allow the nJ/class, 32-channel on-chip classifier for epileptic seizure detection,”
Int. Conf. IEEE Eng. Medicine and Biology Society (EMBC), 2018.
implementation of a filter with a fully serial architecture, [12] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner “Gradient-based learning
without penalizing the detection latency. As a result, the applied to document recognition,” Proceedings of the IEEE, vol. 86,
footprint can be further reduced by implementing one MAC no. 11, pp. 2278-2324, 1998.
[13] Y. H. Chen, T. Krishna, J. S. Emer, V. Sze, “Eyeriss: An energy-
per feature extraction unit. efficient reconfigurable accelerator for deep convolutional neural net-
To further improve the area and energy efficiency of the works,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-
138, 2017.
DT ensemble, a reduced bit-precision for parameters and [14] B. Hassibi, D. G. Stork, G. J. Wolff, “Optimal Brain Surgeon and
efficient MAC implementation in filters, such as distributed general network pruning,” IEEE International Conference on Neural
arithmetic or memristive-based multiplication can be em- Networks, 1993.
410