Deep Learning Using Rectified Linear Units (ReLU)
Deep Learning Using Rectified Linear Units (ReLU)
ABSTRACT 2 METHODOLOGY
We introduce the use of rectified linear units (ReLU) as the classifi- 2.1 Machine Intelligence Library
cation function in a deep neural network (DNN). Conventionally,
Keras[4] with Google TensorFlow[1] backend was used to imple-
ReLU is used as an activation function in DNNs, with Softmax
ment the deep learning algorithms in this study, with the aid of
function as their classification function. However, there have been
other scientific computing libraries: matplotlib[7], numpy[14], and
several studies on using a classification function other than Soft-
scikit-learn[11].
max, and this study is an addition to those. We accomplish this
arXiv:1803.08375v2 [cs.NE] 7 Feb 2019
Consequently, we have The backpropagation algorithm (see Eq. 8) is the same as the
conventional softmax-based deep neural network.
exp(o )
pk = Ín−1 k (3)
exp(ok )
" #
∂ℓ(θ ) Õ ∂ℓ(θ ) Õ ∂pi ∂ok
k =0
= (8)
Hence, the predicted class would be ŷ ∂θ i
∂pi ∂ok ∂θ
k
Algorithm 1 shows the rudimentary gradient-descent algorithm
ŷ = arg max pi (4) for a DL-ReLU model.
i ∈1, ..., N
2.4.2 Rectified Linear Units (ReLU). ReLU is an activation func- Algorithm 1: Mini-batch stochastic gradient descent
tion introduced by [6], which has strong biological and mathemati- training of neural network with the rectified linear unit
cal underpinning. In 2011, it was demonstrated to further improve (ReLU) as its classification function.
training of deep neural networks. It works by thresholding values Input: {x (i) ∈ Rm }i=1 n ,θ
at 0, i.e. f (x) = max(0, x). Simply put, it outputs 0 when x < 0, and Output: W
conversely, it outputs a linear function when x ≥ 0 (refer to Figure for number of training iterations do
1 for visual representation). for i = 1, 2, . . . n do
θ ·y
∇θ = ∇θ −
max 0, θh + b · ln 10
θ = θ − α · ∇θ ℓ(θ ; x (i) )
Any standard gradient-based learning algorithm may be used.
We used adaptive momentum estimation (Adam) in our
experiments.
Table 1: Architecture of VGG-like CNN from Keras[4]. Table 3: MNIST Classification. Comparison of FFNN-
Softmax and FFNN-ReLU models in terms of % accuracy. The
Layer (type) Output Shape Param # training cross validation is the average cross validation ac-
curacy over 10 splits. Test accuracy is on unseen data. Preci-
conv2d_1 (Conv2D) (None, 14, 14, 32) 320 sion, recall, and F1-score are on unseen data.
conv2d_2 (Conv2D) (None, 12, 12, 32) 9248
max_pooling2d_1 (MaxPooling2) (None, 6, 6, 32) 0
Metrics / Models FFNN-Softmax FFNN-ReLU
dropout_1 (Dropout) (None, 6, 6, 32) 0
conv2d_3 (Conv2D) (None, 4, 4, 64) 18496 Training cross validation ≈ 99.29% ≈ 98.22%
conv2d_4 (Conv2D) (None, 2, 2, 64) 36928 Test accuracy 97.98% 97.77%
max_pooling2d_2 (MaxPooling2) (None, 1, 1, 64) 0 Precision 0.98 0.98
dropout_2 (Dropout) (None, 1, 1, 64) 0 Recall 0.98 0.98
flatten_1 (Flatten) (None, 64) 0 F1-score 0.98 0.98
dense_1 (Dense) (None, 256) 16640
dropout_3 (Dropout) (None, 256) 0
dense_2 (Dense) (None, 10) 2570
All models used Adam[8] optimization algorithm for training, Figure 2: Confusion matrix of FFNN-ReLU on
with the default learning rate α = 1 × 10−3 , β 1 = 0.9, β 2 = 0.999, MNIST classification.
ϵ = 1 × 10−8 , and no decay.
Table 4: MNIST Classification. Comparison of CNN-Softmax
3.1 MNIST and CNN-ReLU models in terms of % accuracy. The training
We implemented both CNN and FFNN defined in Tables 1 and 2 cross validation is the average cross validation accuracy over
on a normalized, and PCA-reduced features, i.e. from 28 × 28 (784) 10 splits. Test accuracy is on unseen data. Precision, recall,
dimensions down to 16 × 16 (256) dimensions. and F1-score are on unseen data.
In training a FFNN with two hidden layers for MNIST classifica-
tion, we found the results described in Table 3.
Despite the fact that the Softmax-based FFNN had a slightly Metrics / Models CNN-Softmax CNN-ReLU
higher test accuracy than the ReLU-based FFNN, both models had Training cross validation ≈ 97.23% ≈ 73.53%
0.98 for their F1-score. These results imply that the FFNN-ReLU is Test accuracy 95.36% 91.74%
on par with the conventional FFNN-Softmax. Precision 0.95 0.92
Figures 2 and 3 show the predictive performance of both models Recall 0.95 0.92
for MNIST classification on its 10 classes. Values of correct pre- F1-score 0.95 0.92
diction in the matrices seem to be balanced, as in some classes,
the ReLU-based FFNN outperformed the Softmax-based FFNN, and
vice-versa. inspected (see Table 5). However, despite its slower convergence, it
In training a VGG-like CNN[4] for MNIST classification, we was able to achieve a test accuracy higher than 90%. Granted, it is
found the results described in Table 4. lower than the test accuracy of CNN-Softmax by ≈ 4%, but further
The CNN-ReLU was outperformed by the CNN-Softmax since it optimization may be done on the CNN-ReLU to achieve an on-par
converged slower, as the training accuracies in cross validation were performance with the CNN-Softmax.
,, Abien Fred M. Agarap
5 ACKNOWLEDGMENT
An appreciation of the VGG-like Convnet source code in Keras[4],
as it was the CNN model used in this study.
REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-
jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-
berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
(2015). http://tensorflow.org/ Software available from tensorflow.org.
[2] Abien Fred Agarap. 2017. A Neural Network Architecture Combining Gated
Recurrent Unit (GRU) and Support Vector Machine (SVM) for Intrusion Detection
in Network Traffic Data. arXiv preprint arXiv:1709.03082 (2017).
[3] Abdulrahman Alalshekmubarak and Leslie S Smith. 2013. A novel approach
combining recurrent neural network and support vector machines for time series
classification. In Innovations in Information Technology (IIT), 2013 9th International
Conference on. IEEE, 42–47.
[4] François Chollet et al. 2015. Keras. https://github.com/keras-team/keras. (2015).
[5] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and
Yoshua Bengio. 2015. Attention-based models for speech recognition. In Advances
in Neural Information Processing Systems. 577–585.
[6] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas,
and H Sebastian Seung. 2000. Digital selection and analogue amplification coexist
in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947.
[7] J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing In Science
& Engineering 9, 3 (2007), 90–95. https://doi.org/10.1109/MCSE.2007.55
[8] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980 (2014).
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
tion with deep convolutional neural networks. In Advances in neural information