Activation Functions For Neural Networks: Application and Performance-Based Comparison
Activation Functions For Neural Networks: Application and Performance-Based Comparison
Abstract:- Past decade has seen explosive growth of Deep II. ACTIVATION FUNCTIONS
Learning (DL) algorithms based on Artificial Neural
Networks (ANNs) and its applications in vast emerging Primarily there are 04 types of AFs prevalent in ANN
domains to solve real world complex problems. The DL namely Sigmoid, ReLU, Exponential Unit and Adaptive Unit
architecture uses Activation Functions (AFs), to perform based AFs. In addition to these primary functions, there are a
the task of finding relationship between the input feature number of use-case based variations of primary AFs which
and the output. Essential building blocks of any ANN are are well suited for a specific application area such as Leaky-
AFs which bring the required non-linearity of the output ReLU is suitable for Convolution Neural Network (CNN).
in the Output layer of network. Layers of ANNs are Chigozie Enyinna Nwankpa et al has presented a detailed
combinations of linear and nonlinear AFs. Most summary of various AFs used in ANN. Author has identified
extensively used AFs are Sigmoid, Hyperbolic Tangent that there are 04 flavors of ReLU AFs which are prevalently
(Tanh), Rectified Linear Unit (ReLU) etc to name a few. used in neural networks the rectified linear units (ReLU)
Choosing an AF for a particular AF depends on various namely Leaky-ReLU, Parametric-ReLU, Randomized-ReLU
factors such as Nature of Application, Design of ANN, and S-shaped-ReLU. Furthermore, the Sigmoid has two
Optimizers used in the network, Complexity of Data etc. variants namely Hyperbolic tangent (TanH) and Exponential
This paper presents a survey on most widely used AFs Linear Squashing Activation Function (EliSH). The Softmax,
along with the important consideration while selecting an Maxout , Softplus, Softsign, and Swish functions has no
AF on a specific problem domain. A broad guideline on variants [1]. Sigmoid is generally used in Logistic Regression
selecting an AF based on the literature survey has been whereas TanH is predominantly used in LSTM cell in NLP
presented to help researchers in employing suitable AF in tasks using Recurrent Neural Networks (RNN).
their problem domain.
The Exponential Linear Unit (ELU) has negative
Keywords:- Artificial Neural Network, Activation Functions, values which works as Batch Normalisation thus speeding up
RNN. the convergence but with lower computational complexity.
AFs are processing units based on mathematical ANNs enables an algorithm in making faster decisions
equations that determine the output of ANN model. Each without human intervention as they are able to infer complex
neuron in ANN receives input data, applies a linear non-linear relationship among the features of the dataset. The
transformation (weighted summation) and then pass the result primary role of the AF is to transform the calculated summation of
through an AF typically a Sigmoid or ReLU, in order to bring weights input from the node and bias into an output value to be
non-linearity to the input data. This process allows ANN to fed to the next hidden layer or as output in the output layer
capture complex, non-linear relationships within data which [Figure-01]. The output produced by the AF of output layer is
otherwise is not possible by conventional Machine Learning compared to the desired value by means of Loss function and
algorithms. Performance of ANN is dependent on its accordingly a Gradient is calculated using some Optimization
efficiency of convergence so as to stabilise the weights Algorithm (usually Gradient Descent) in order to achieve the
assigned to various input which are provided to a Neuron or a local or global minima for the ANN using backpropagation.
set of Neurons in each layer. ANN is said to have converged The resultant weight vector that contains the hidden
when no further significant change to the assigned weights is characteristics in the data. These AFs are often referred to as
possible in subsequent iterations. Depending on the selection a Transfer Functions. A typical ANN is biologically inspired
of AFs, a network may converge faster or may not converge computer programmes, inspired by the working of human
at all. AF is chosen so as to limit the output in the range to brain. These collection of neurons are called networks
any value between -1 to 1 , 0 to 1 , or -∞ to +∞ depending because they are stacked together in form of layers with each
upon the AF used in various neuron in different layers of set of neurons performing specific function and infer
ANN. knowledge by detecting the relationships and patterns in data
using past experiences known as training examples. In the
absence of AFs every neuron will behave as an Identity function
and will only perform the summation of the weighted input
features using the weights and biases because irrespective of with one hidden layer for classification of records in the
number of layers in ANN all the neurons would simply output the IMDB, Movie Review, and MNIST data sets. Elliott AF
summation of input features without transforming them. The and its modifications demonstrated the least average error
resultant linear function will not be able to capture the non-linear and better results than the sigmoid AF which is more
pattern in the input data, even though, the ANN will become popular in LSTM networks [6].
simpler. However, ANN would work well for linear regression Dubey and Jain et al have used ReLU and Leaky-ReLU
problems where the predicted output is same as the weighted AFs for deep layers and softmax AF for Output layer on
sum of the input features and bias. MNIST dataset classification. Result obtained shows that
CNN with ReLU showed better results than Leaky ReLU
in terms of model accuracy and model loss. Same
findings have been confirmed by Banerjee et al in his
experiment on MNIST dataset [7].
Castaneda et al experiment on Object detection, Face
detection, Text and Sound dataset using ReLU, SELU and
Maxout shows that ReLU is best suited for Object, Face
and Text detection, whereas, SELU and Maxout are better
for sound / speech detections [8].
Shiv Ram Dubey et al have used CIFAR10 and
CIFAR100 datasets for the image classification
experiment over different CNN models. It is observed
that the Softplus, ELU and CELU are better suited with
MobileNet. The ReLU, Mish and PDELU exhibit good
Fig 1 Building Blocks of Neural Network performance with VGG16, GoogleNet and DenseNet. The
ReLU, LReLU, CELU, ELU, GELU, ABReLU, and
IV. LITERATURE SURVEY PDELU AFs are better for the networks having residual
connections, such as ResNet50, SENet18 and
A Wibowo, W Wiryawan, and I Nuqoyati DenseNet121 [9].
experimented on Cancer classification using using Tomasz Szandała et al experimented data-set CIFAR-10
microRNA feature. In his experiment Gradient Descent, with just two CNN convolution layers to compare
Momentum, RMSProp, AdaGrad, AdaDelta, and Adam efficiency of ReLU, Sigmoid, TanH, Leaky-ReLU,
optimizers were used. The result showed that ReLU AF SWISH, Softsign and Softplus. The training/ test data
produced 98.536% and 98.54762% accuracy with Adam were split in 5:1 ratio [Fig-02] presents the accuracy of
and RMSProp optimizer resp [2]. various AFs [10]. The performance of ReLU and Leaky
Bekir Karlik and A. Vehbi Olgac attempted to analyze ReLU were the most satisfying and both of them
the performance of Multi Layered Perceptron produced above 70 % accuracy in classification in spite of
architectures using various AFs such as sigmoid, Uni- the dying ReLU condition. Rest all other AFs resulted in
polar sigmoid, Tanh, Conic Section, and Radial Bases lower accuracy of less than 70 % [Table-01].
Function (RBF) with varying number of Neurons in
hidden and output layers. Results shows that when Tanh Table 1 Relative Accuracy Ratio of AFs
AF was used in both hidden & Output layers, best
accuracy of around 95 and 99% was observed for 100 and
500 iterations resp [3].
Hari Krishna Vydana, Anil Kumar Vuppala studied the
influence of various AFs on speech recognition system on
TIMIT and WSJ datasets. The result shows that on
smaller datasets (TIMIT) ReLU worked better producing
the minimum phone rete error of 18.9 whereas for larger
dataset (WSJ), Exponential Linear Unit (ELU) produced
better result of reduced phone rate error of 19.5. It shows
whenever we have a larger dataset for speech recognition,
we should first employ ELU. [4]
Giovanni Alcantara conducted an empirical analysis on
the effectiveness of using AFs on the MNIST
classification of handwritten digits (LECUN).
Experiments suggests that ReLU, Leaky ReLU, ELU and
SELU AFs all yield great results in terms of validation
error, however, ELU performed better than all other
models [5].
Farzad et al employed Long Short-Term Memory based
Recurrent ANN (RNN) for sentiment analysis task on
Fig 2 Relative Accuracy Graph of AFs
IMDB and Movie Review and observed that Elliott AF
V. IMPORTANT CONSIDERATIONS FOR Leaky ReLU addresses the issue of dead activation by
SELECTION OF AF replacing ‘0’ with alpha times x (alpha = 0.01) such that
derivative of LReLU is slightly greater than ‘0’ [Fig-05].
The selection of an AF is a critical decision when Similarly, Parametric ReLU (PReLU), ELU are often
designing and training ANNs. Different AFs can have a preferred because they are less prone to vanishing gradients
significant impact on the performance and convergence of a [11]
ANN. Here are some critical factors that affect the choice of
AF in an ANN:
Differentiability:
Two important properties of AFs are that they should be
differentiable and that their gradient should be expressible
using the function itself. The first essential properties make
them suitable for the backpropagation which is the essence of
ANNs. Many optimization techniques, such as gradient
descent, rely on the derivative of the AFs. Hence, it's
essential that the AF is differentiable or has well-defined
gradients. This allows for the efficient training of the
network. The second desirable property reduces the
computation time of ANNs because at times the ANN is
trained on millions of complex data points. Functions of
Sigmoid and TanH AFs and their derivative is as shown
below along with their graph plot [Fig-03] : Fig 4 Sigmoid and its Derivative Curve
REFERENCES