0% found this document useful (0 votes)

8 views

4

This paper provides a comprehensive survey of activation functions (AFs) used in deep learning, detailing their classifications, characteristics, and performance comparisons among 18 state-of-the-art AFs. It discusses the evolution of AFs, including Logistic Sigmoid, Tanh, ReLU, ELU, and learning-based functions, highlighting their advantages and limitations. The insights aim to assist researchers and practitioners in selecting appropriate AFs for various applications in neural networks.

Uploaded by

hpvictus903

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

4

Uploaded by

hpvictus903

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Activation Functions in Deep Learning: A Comprehensive Survey and Benchmark

Shiv Ram Dubey1 , Satish Kumar Singh1 , Bidyut Baran Chaudhuri2

1 Computer Vision and Biometrics Laboratory, Indian Institute of Information Technology, Allahabad, India.
2 Techno India University, Kolkata, India and Indian Statistical Institute, Kolkata, India.

[email protected], [email protected], [email protected]

This paper is accepted in Neurocomputing. Copyright will be transferred to Elsevier.

Abstract
arXiv:2109.14545v3 [cs.LG] 28 Jun 2022

Neural networks have shown tremendous growth in recent years to solve numerous problems. Various types of neural networks
have been introduced to deal with different types of problems. However, the main goal of any neural network is to transform
the non-linearly separable input data into more linearly separable abstract features using a hierarchy of layers. These layers are
combinations of linear and nonlinear functions. The most popular and common non-linearity layers are activation functions (AFs),
such as Logistic Sigmoid, Tanh, ReLU, ELU, Swish and Mish. In this paper, a comprehensive overview and survey is presented
for AFs in neural networks for deep learning. Different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based,
ELU based, and Learning based are covered. Several characteristics of AFs such as output range, monotonicity, and smoothness are
also pointed out. A performance comparison is also performed among 18 state-of-the-art AFs with different networks on different
types of data. The insights of AFs are presented to benefit the researchers for doing further research and practitioners to select
among different choices. The code used for experimental comparison is released at: https://github.com/shivram1987/
ActivationFunctions.

1. Introduction of this survey are outlined as follows:

In recent years, deep learning has shown a tremondous 1. This survey provides a detailed classification for a wide
growth to solve the challenging problems such as object de- range of AFs. It also includes the AFs very comprehen-
tection [1], semantic segmentation [2], person re-identification sively, including Logistic Sigmoid/Tanh, Rectified Unit,
[3], image retrieval [4], anomaly detection [5], skin disease di- Exponential Unit, and Adaptive AFs.
agnosis [6], and many more. Various types of neural networks 2. This survey enriches the reader with the state-of-the-art
have been defined in deep learning to learn abstract features AFs with analysis from various perspectives. It specifi-
from data, such as Multilayer Perceptron (MLP) [7], Convolu- cally covers the progress in AFs for deep learning.
tional Neural Networks (CNN) [8], Recurrent Neural Networks
3. This survey also summarizes the AFs with brief highlights
(RNN) [9], and Generative Adversarial Networks (GAN) [10].
and important discussions to depict its suitability for dif-
The important aspects of neural networks include weight ini-
ferent types of data (Refer to Table 6).
tialization [11], loss functions [12], different layers [13], over-
fitting [14], and optimization [15]. 4. This survey is compared with the existing survey and per-
The activation functions (AFs) play a very crucial role in neu- formance analysis to show its importance (Refer to Table
ral networks [16] by learning the abstract features through non- 7).
linear transformations. Some common properties of the AFs are 5. This paper also presents the performance comparisons
as follows: a) it should add the non-linear curvature in the option 4 benchmark datasets of different modalities using 18
mization landscape to improve the training convergence of the state-of-the-art AFs with different types of networks (Re-
network; b) it should not increase the computational complexity fer to Tables 8, 9 and 11).
of the model extensively; c) it should not hamper the gradient
flow during training; d) it should retain the distribution of data The evolution of AFs is illustrated in Section 2. The progress
to facilitate the better training of the network. Several AFs have in Logistic Sigmoid and Tanh, rectified, exponential, adaptive
been explored in recent years for deep learning to achieve the and miscellaneous AFs are summarized in Section 3, 4, 5, 6,
above mentioned properties. This survey is dedicated to the de- and 7, respectively. Some aspects of AFs are discussed in Sec-
velopments in the area of AFs in neural networks. The insights tion 8. A comprehensive performance analysis is conducted in
of the different AFs are presented along with the reasoning to Section 9. A summary with conclusions and recommendations
benefit the deep learning community. The major contributions is provided in Section 10.
Activation Functions (AFs)

Characteristics Properties of Different

Type AFs Applications

Sigmoid/ Tanh Unit Parametric vs Image Based

Based AFs Non-parametric Applications
Rectified Unit Based Monotonic vs Time Series
AFs Non-monotonic Applications
Exponential Unit Smooth vs Text Based
Based AFs Non-smooth Applications
Adaptive Unit Based Bounded vs Signal Based
AFs Un-bounded Applications
Miscellaneous Output Range Games Related
AFs of AFs Applications
Figure 1: An illustration of Linear, Logistic Sigmoid and Tanh AFs.
Figure 2: Classification of activation functions.

2. Evolution of Activation Functions tion is written as,

e x − e−x
A linear function can be thought of as a simple AF which Tanh(x) = . (2)
e x + e−x
outputs c × x for input x with c as a constant. The linear AF
is illustrated in Fig. 1 for c = 1, i.e., identity function. Note The Tanh function also squashes the inputs, but in [−1, 1]. The
that the linear AF does not add non-linearity into the network. drawbacks of Logistic Sigmoid function such as vanishing gra-
However, the non-linearity needs to be introduced in the neural dient and computational complexity also exist with Tanh func-
networks. Otherwise, a neural network produces the output as a tion. The Logistic Sigmoid and Tanh AFs majorly suffer from
linear function of inputs inspite of having several layers. More- vanishing gradient. Several improvements have been proposed
over, in practice data is generally not linearly separable; hence, based on the Logistic Sigmoid and Tanh AFs which are de-
the non-linear layers help to project the data in non-linear fash- scribed in Section 3 in detail.
ion in feature space which can be used with different objective Rectified Linear Unit Based Activation Functions: The satu-
functions. This section provides an overview of the evolution rated output and increased complexity are the key limitations of
of AFs for deep learning. A classification is presented in Fig. 2 above-mentioned Logistic Sigmoid and Tanh based AFs. The
in terms of the different properties and characteristic types. Rectified Linear Unit (ReLU) [17] has become the state-of-the-
Logistic Sigmoid/Tanh Unit Based Activation Functions: In art AF due to its simplicity and improved performance. The
order to introduce the non-linearity into the neural networks, ReLU was also used in the AlexNet model [8]. Various variants
the Logistic Sigmoid and Tanh AFs have been used in the early of ReLU have been investigated by tackling its drawbacks, such
days. The firing of bilogical neurons was the motivation of us- as non-utilization of negative values, limited non-linearity and
ing the Logistic Sigmoid and Tanh AFs with artificial neurons. unbounded output, as detailed in Section 4.
The Logistic Sigmoid AF is a very popular and traditional non- Exponential Unit Based Activation Functions: The major
linear function. It is given as, problem faced by the Logistic Sigmoid and Tanh based AFs
is with its saturated output for large positive and negative input.
1 Similarly, the major problem with ReLU based AFs is with the
Logistic Sigmoid(x) = . (1) under-utilization of negative values leading to vanishing gradi-
1 + e−x
ent. In order to cope up with these limitations the exponential
This AF squashes the output between [0, 1] as shown in Fig. function based AFs have been used in the literature. The Expo-
1. The output of the Logistic Sigmoid function is saturated nential Linear Unit (ELU) [27] based AF utilizes the negative
for higher and lower inputs, which leads to vanishing gradient values with the help of the exponential function. Several AFs
problem. The vanishing gradient problem depicts to a scenario have been introduced in the literature as the ELU variants which
where the gradient of objective function w.r.t. a parameter be- are presented in Section 5 in detail.
comes very close to zero and leads to almost no update in the Learning/Adaptive Activation Functions: Most of the Sig-
parameters during the training of the network using stochastic moid, Tanh, ReLU, and ELU based AFs are designed manually
gradient descent technique. Hence, the training is almost killed which might not be able to exploit the data complexity. The
under vanishing gradient scenario. Moreover, the output not learning based adaptive AFs are the recent trends. This class
following a zero-centric nature leads to poor convergence. The of AFs contains learnable parameters, e.g. Adaptive Piecewise
Tanh function has also been used as the AF in neural networks. Linear (APL) [28] and Swish [29] AFs contain two and one
It is similar to the Logistic Sigmoid function while exhibiting learnable parameters, respectively. Recently, several learning
the zero centric property as depicted in Fig. 1. The Tanh func- based AFs have been proposed as illustrated in Section 6.
Table 1: Advantage and disadvantage of primary AFs.

Diminishing Limited Optimization Lack of Computational

AFs
gradients non-linearity difficulty adaptibility inefficiency
Sigmoid Yes No Yes Yes Yes
Tanh Yes No Partial Yes Yes
ReLU Partial Yes Partial Yes No
ELU No Partial No Yes Partial
APL No Partial No No No
Swish No Partial No No Partial

Table 2: Summary of Logistic Sigmoid and Tanh based activation functions.

Name of AF Parametric Monotonic Smooth Bounded

Logistic Sigmoid No Yes Yes Yes
Tanh No Yes Yes Yes
Scaled Tanh (sTanh), 1998 [18] Yes Yes Yes Yes
Rectified Hyperbolic Secant (ReSech), 2016 [19] No No Yes Yes
Scaled Sigmoid (sSigmoid), 2016 [20] No Yes Yes Yes
Penalized Tanh (pTanh), 2016 [20] No Yes No Yes
Hexpo, 2017 [21] No Yes Yes Yes
Improved Sigmoid (ISigmoid), 2018 [22] No Yes Yes No
Sigmoid-Weighted Linear Units (SiLU), 2018 [23] No No Yes For negative inputs
Linearly Scaled Hyperbolic Tangent (LiSHT), 2019 [24] No No Yes No
Elliott, 2019 [25] No Yes Yes Yes
Soft-Root-Sign (SRS), 2020 [26] Yes No Yes Yes

Miscellaneous Activation Functions: In recent years, many bounded function as,

other AFs have also been investigated as presented in Section
1
7. These activations include Softplus units, probabilistic func- PS F(x) = (4)
tions, polynomial functions, and kernel functions. (1 + e−x )m
Table 1 highlights the advantage and disadvantage of the pri- where m is a hyperparameter [31]. The gradient flow is im-
mary AFs in terms of the diminishing gradients, limited non- proved for the higher value of m. The sum of shifted log-
linearity, optimization difficulty, computational inefficiency sigmoid is also explored as an AF [32] which retains the sym-
and lack of adaptibility. It can be noticed that the Tanh function metry in the generated features. The Rectified Hyperbolic Se-
is computationally inefficient because it involves the computa- cant (ReSech) AF is differentiable, symmetric, and bounded
tion of exponential multiple times [30]. However, in implemen- [19] which is given as,
tation it can be computed using single exponential with the help
of Sigmoid function. These limitations in the existing AFs have ReS ech(x) = x × S ech(x) (5)
been the driving factors for the development of recent AFs as
surveyed in the further sections of this paper. with the output range in [−1, 1]. However, it exhibits the vanish-
ing gradient problem due to saturating behavior for both large
positive and large negative inputs. The training of deep net-
3. Logistic Sigmoid and Tanh Based AFs works become difficult due to the uniform slope of the Logistic
Sigmoid and Tanh AFs near the origin [20]. To minimize this
limitation, the Scaled Sigmoid (sSigmoid) is defined as,
The traditional AFs such as Logistic Sigmoid and Tanh were
used very extensively in the early days of neural networks. sS igmoid(x) = (4 × S igmoid(x) − 2) (6)
However, these AFs had shown the hurdle to train the deep net-
works due to their saturated output. Several attempts have also with the output range in [−2, 2] and the Penalized Tanh (pTanh)
been made to improve these AFs for different networks. Table is defined as,
2 presents the comparison of Logistic Sigmoid and Tanh based 
AFs in terms of their properties including parametric, mono- T anh(x),
 x≥0
pT anh(x) = 

(7)
tonic, smooth and bounded. a × T anh(x), x < 0

In order to tackle the limited output range and zero gradient
problems of Tanh, a scaled Hyperbolic Tangent (sTanh) is used with the output range in [−a, 1] where a ∈ (0, 1). However,
in [18] which is defined as, sSigmoid and pTanh AFs also suffer from the vanishing gradi-
ent problem. It is noticed that the pTanh AF performs better for
sT anh(x) = A × T anh(B × x) (3) Natural Language Processing (NLP) tasks [33].
A noisy AF is defined to overcome the vanishing gradient
with the output range in [−A, A]. A Parametric Sigmoid Func- problem [48]. Due to the added noise the gradients may flow
tion (PSF) is proposed as a continuous, differentiable, and easily even in the saturating regime. The vanishing gradient
Table 3: Summary of Rectified Linear Unit based activation functions.

Name Parametric Monotonic Smooth Bounded

Rectified Linear Unit (ReLU), 2010 [17] No Yes No For negative inputs
Leaky ReLU (LReLU), 2013 [34] No Yes No No
Parametric ReLU (PReLU), 2015 [35] Yes Yes No No
Randomized ReLU (RReLU), 2015 [35] No Yes No No
Concatenated ReLU (CReLU), 2016 [36] No Yes No For negative inputs
Bounded ReLU (BReLU), 2016 [37] No Yes No Yes
Parametric Tanh Linear Unit (PTELU), 2017 [38] Yes Yes Yes For negative inputs
Flexible ReLU (FReLU), 2018 [39] Yes Yes No For negative inputs
Elastic ReLU (EReLU), 2018 [40] No Yes No For negative inputs
Randomly Translational ReLU (RTReLU), 2018 [41] No Yes No For negative inputs
Dual ReLU (DualReLU), 2018 [42] No Yes No No
Paired ReLU (PairedReLU), 2018 [43] Yes Yes No No
Average Biased ReLU (ABReLU), 2018 [44] No Yes No For negative inputs
Natural-Logarithm (NLReLU), 2019 [45] No Yes No For negative inputs
Multi-bin Trainable Linear Units (MTLU), 2019 [46] Yes No No No
Lipschitz ReLU (L-ReLU), 2020 [47] Yes Depends upon φ and η Depends upon φ and η Depends upon φ and η

problem is minimized by the Hexpo function [21] which is sim- complexity of the SRS function. Most of the variants of Sig-
ilar to Tanh with a scaled gradient. It is given as, moid/Tanh AFs have tried to overcome the vanishing gradient
 issue. However, this issue is still present in most of these AFs.
−a × (e−x/b − 1), x ≥ 0

Hexpo(x) = 

(8)
c × (e x/d − 1),
 x<0
4. Rectified Activation Functions
in the output range of [−c, a]. The output of the sigmoid func-
tion is multiplied with its input in sigmoid-weighted linear unit A summary of rectified AFs is illustrated in Table 3. Recti-
(SiLU) AF [23] as fied Linear Unit (ReLU) is a simple function which is the iden-
tity function for positive input and zero for negative input and
S iLU(x) = x × S igmoid(x) (9)
given as,
in the output range of (−0.5, ∞). At the same time an improved 
logistic Sigmoid (ISigmoid) AF [22] is proposed to solve the  x, if x ≥ 0

ReLU(x) = max(0, x) =  .

(14)
vanishing gradient problem of Sigmoid with the help of a piece- 0, otherwise

wise combination of sigmoidal and linear functions. It is de-
fined as, Hence, the range of ReLU is [0, ∞). The gradient for positive
 and negative inputs is one and zero, respectively. The ReLU



α × (x − a) + S igmoid(a), x ≥ a function solves the problem of computational complexity of the
IS igmoid(x) =  −a < x < a


S igmoid(x), Logistic Sigmoid and Tanh functions. The downside of ReLU

α × (x + a) + S igmoid(a), x ≤ −a

 is with the vanishing gradient problem for the negative inputs.
(10) In spite of having the vanishing gradient problem, the ReLU AF
in the output range of (−∞, ∞). The Linearly scaled hyper- has been used very extensively with the deep learning models.
bolic tangent (LiSHT) AF scales the Tanh in a linear fashion to The advancements in ReLU based AFs are discussed in the rest
overcome the vanishing gradient issue [24]. The LiSHT can be of this section.
defined as,
LiS HT (x) = x × T anh(x) (11) 4.1. On the Non-utilization of Negative Values of ReLU
in the output range of [0, ∞). The LiSHT function is symmetric, Vanishing gradient is the main problem with ReLU AF which
but is has the shortcoming of including unbounded and non- is caused due to the non-utilization of negative values. A Leaky
negative outputs only. The Elliott AF [25] is similar to Sigmoid Rectified Linear Unit (LReLU) is the extension of ReLU by
function in terms of the characteristics diagram and defined as, utilizing the negative values [34]. The LReLU is defined as,
0.5 × x
Elliott(x) = + 0.5

(12)  x, x≥0
1 + |x|

LReLU(x) = 

(15)
0.01 × x, x < 0

in the output range of [0, 1]. The Soft-Root-Sign (SRS) AF [26]
is defined as, in the output range of (−∞, ∞). The LReLU has been used
x
S RS (x) = x (13) in many applications with promising performance. One ma-
α + e−x/β
jor problem associated with LReLU is the finding of the right
α×β
in the output range of [ β−α×e , α] where α and β are the learn- slope in linear function for negative inputs. Different slopes
able parameters. The use of additional parameters increases the might be suited for different problems and different networks.
Thus, it is extended to Parametric ReLU (PReLU) by consid- in the output range of [0, ∞) where a is a random number. At
ering the slope for negative input as a trainable parameter [35]. test time, the offset is set to zero. A data dependent Average
The PReLU is given as, Biased ReLU (AB-ReLU) [44] is also investigated to tackle the
 negative values by a horizontal shifting based on the average of
 x,
 x≥0 features. The ABReLU can be written as,
PReLU(x) = 

(16)
 p × x, x < 0
 
 x − β, x − β ≥ 0

ABReLU(x) = 

(22)
in the output range of (−∞, ∞) where p is the trainable parame- 0,
 x−β<0
ter. However, it can lead to overfitting easily which is the down-
side of PReLU. The Maxout layer, which computes the maxi- having the output range in [0, ∞) where β is computed as the
mum of several linear units, is also used as AF [49]. Both ReLU average of input activation map to the activation function. The
and Leaky ReLU can be seen as the special cases of Maxout. batch dependent threshold for the ReLU is used by the Dynamic
The randomized ReLU (RReLU) considers the slope of LReLU ReLU (D-ReLU) [60]. The Dual ReLU (DualReLU) [42] is a
randomly during training sampled from an uniform distribution two dimensional AF for recurrent neural networks. The Dual-
U(l, u) [50]. The RReLU is defined as, ReLU is given as,
 DualReLU(a, b) = max(0, a) − max(0, b) (23)
 x,
 x≥0
RReLU(x) = 

(17)
R × x, x < 0

in the output range of (−∞, ∞) where a and b are the inputs in
different dimensions. Similar to the CReLU, the PairedReLU
in the output range of (−∞, ∞) where R ∼ U(l, u), l < u and AF is used for image super-resolution [43]. The PairedReLU is
l+u
l, u ∈ [0, 1). It uses a deterministic value x/ 2 during test given as,
time.
The ReLU is not able to utilize the potential useful informa- PairedReLU(x) = [max(s × x − θ, 0), max(s p × x − θ p , 0)] (24)
tion from the negative values. In most of the networks, the fea-
ture map given as the input to AF is dense near zero. Thus, in the output range of (−∞, ∞). However, the computational
a small jitter in the rectification point can lead to difficulty complexity of PairedReLU is increased as compared to CReLU.
in training. Concatenated ReLU (CReLU) [36] concatenates In another attempt, V-shaped ReLU (vReLU) AF [61] is defined
the ReLU’s output over original input and negated input. The as, 
CReLU can be given as,  x,
 x≥0
vReLU(x) = 

(25)
−x, x < 0

CReLU(x) = [ReLU(x), ReLU(−x)] (18)
having the output range in [0, ∞]. The vReLU activation func-
in the output range of [0, ∞). The CReLU is derived from the tion suffers from the non-symmetric output. The SignReLU
fact that the lower layer kernels in CNN models form pairs with AF utilizes the negative values using the Softsign function [62].
opposite phases. The shifting of the feature map with multiple The positive part of SignReLU is the same as the ReLU.
biases is also performed before the ReLU layer [51]. How- A Displaced ReLU (DisReLU) [63] is designed as a gener-
ever, it increases the model complexity as more ReLUs are re- alization of Shifted ReLU [39]. The DisReLU displaces the
quired. A Parametric Tan Hyperbolic Linear Unit (P-TELU) is rectification point to consider the negative values, given as,
also used as an AF [38]. The P-TELU is defined as, 
 x,
 x ≥ −δ
DisReLU(x) = 

(26)
−δ, x < −δ

x≥0

 x,

PT ELU(x) = 

(19)
α × Tanh(β × x), x < 0

having the output range in [−δ, ∞]. A Bendable Linear Unit
(BLU) AF is investigated as,
in the output range of [−α, ∞) where {α, β} ≥ 0 are the learnable
parameters. √
BLU(x) = β × ( x2 + 1 − 1) + x (27)
The Flexible ReLU (FReLU) [39] captures the negative val-
ues with a rectified point which is considered as trainable in the where −1 ≤ β ≤ 1 is a learnable parameter to adapt the shape
Shifted ReLU [39]. The FReLU is given as, between the identity function and a rectifier function [64]. A
Lipschitz ReLU (L-ReLU) AF uses the piecewise linear func-
FReLU(x) = ReLU(x) + b (20) tions to model the degree of presence and the degree of absence
of features [47]. The L-ReLU is defined as,
in the output range of [b, ∞). A similar arrangement is also

followed by Random Translation ReLU (RTReLU) [41] by uti- max(φ(x), 0), x ≥ 0

L-ReLU(x) = 

lizing an offset, sampled from a Gaussian distribution, given (28)
min(η(x), 0), x < 0

as, 
 x + a, x + a > 0

where φ and η are non-linear functions. Moreover, the range of
RT ReLU(x) = 

(21)
0,
 x+a≤0 L-ReLU also depends upon the values of φ and η functions.
Table 4: Summary of Exponential Linear Unit based activation functions.

Name Parametric Monotonic Smooth Bounded

Exponential Linear Unit (ELU), 2016 [27] Yes Yes Yes For negative inputs
Scaled ELU (SELU), 2017 [52] Yes Yes Yes For negative inputs
Continuously Differentiable ELU (CELU), 2017 [53] Yes Yes No For negative inputs
Parametric ELU (PELU), 2017 [54] Yes Yes No For negative inputs
Multiple PELU (MPELU), 2018 [55] Yes Yes No For negative inputs
Fast ELU (FELU), 2019 [56] Yes Yes No For negative inputs
Parametric Rectified Exponential Unit (PREU), 2019 [57] Yes No Yes For negative inputs
Elastic ELU (EELU), 2020 [58] Yes Yes No For negative inputs
Parametric Deformable ELU (PDELU), 2020 [59] Yes Yes Yes For negative inputs

4.2. On the Limited Non-linearity of ReLU based on the convergence. A Piecewise Linear Unit (PLU) [69]
S-shaped ReLU (SReLU) increases the non-linearity in is defined as,
ReLU by combining three linear functions with four learnable
parameters [65]. On a similar line, Multi-bin Trainable Linear PLU(x) = max(α × (x + c) − c, min(α × (x − c) + c, x)) (32)
Unit (MTLU) [46] considers multiple bins to increase the non-
having the output range in [−∞, +∞], where α and c are the
linear capacity. The MTLU can be written as,
constants. Basically, the PLU activation function consists of
 three linear functions in pieces, but continuous. Hence, it avoids

 a0 × x + b0 , x ≤ c0

 the saturation and leads to a good amount of gradient flow
ak × x + bk , ck−1 < x ≤ ck



MT LU(x) =  through the activation function during backpropagation in order

(29)



 ... to resolve the vanishing gradient problems of ReLU and Tanh.

aK × x + bK , cK−1 < x

 However, the PLU activation is unbounded in both positive and
negative directions.
having the output range in (−∞, ∞). The number of bins and
the range of bins are the hyperparameters, whereas the linear 4.3. On the Unbounded Output of ReLU
function of a bin is trainable (i.e., a0 , ..., aK b0 , ..., bK are the The unbounded outputs of ReLU and many of its variants
learnable parameters). The non-differentiable nature at mul- may lead to training instability. Moreover, the bounded AF is
tiple points is the drawback of the MTLU. An Elastic ReLU needed for the dedicated hardware based embedded system ap-
(EReLU) considers a slope randomly drawn from a uniform plications. ReLU is extended to Bounded ReLU (BReLU) [37]
distribution during the training for the positive inputs to con- defined as,
trol the amount of non-linearity [40]. The EReLU is defined BReLU(x) = min(max(0, x), A) (33)
as,
EReLU(x) = max(R × x, 0) (30) having the output range in [0, A]). The training stability is im-
proved in BReLU due to two rectifications (i.e., at 0 and A).
in the output range of [0, ∞) where R is a random number. ReLU is a common choice in practice in deep learning. ReLU
At the test time, the EReLU becomes the identity function for based AFs are generally efficient. The major drawbacks of
positive inputs. The Linearized Sigmoidal Activation (LiSHA) ReLU, such as gradient diminishing for negative inputs, limited
function considers three linear functions to increase the non- non-linearity and unboundedness, are improved in the different
linearity characteristics [66]. It is also extended to adaptive AFs. However, the ReLU variants are not able to resolve all the
linear sigmoidal AF by learning the slope of upper and lower issues of ReLU.
linear functions. The ReLU is combined with Tanh as Recti-
fied Linear Tanh (ReLTanh) [67] to increase the non-linearity
of ReLU and to overcome the vanishing gradient problem of 5. Exponential Activation Functions
Tanh. However, the ReLTanh is unbounded in both the positive
and negative directions. Natural-Logarithm ReLU (NLReLU) The exponential AFs tackle the gradient diminishing prob-
modifies the ReLU’s output for positive inputs using the loga- lem of ReLU. Table 4 lists the properties of the exponential
rithm function to increase the degree of nonlinearity [45]. The AFs. The Exponential Linear Unit (ELU) [27] is given as,
NLReLU is defined as, 
 x,
 x>0
ELU(x) = 

(34)
NLReLU(x) = ln(β × max(0, x) + 1.0) (31) α × (e x − 1), x ≤ 0


having the output range in [0, ∞) where β is a constant. The having the output range in [−1, ∞) where α is a learnable pa-
NLReLU does not affect the negative regime, thus suffers from rameter. The ELU function exhibits all the benefits of the ReLU
vanishing gradient. The concept of Leaky ReLU (LReLU) is function. The ELU is differentiable, saturates for large nega-
further improved to Dynamic ReLU [68] by considering a mean tive inputs and reduces the bias shift. The negative saturation
square error (MSE) based additional hyperparameter. Thus, regime of ELU adds some robustness to noise as compared to
it can control the slope of the Dynamic ReLU in every epoch the Leaky ReLU and Parametric ReLU. The ELU is extended
to Scaled ELU (SELU) [52] by using a scaling hyperparameter utilized to design an Elastic ELU (EELU) AF [58]. The EELU
to make the slope larger than one for positive inputs. The SELU is defined as,
can be defined as, 
k × x,
 x>0
EELU(x) = 

(41)

 x,
 x>0 α × (eβ×x − 1), x ≤ 0
S ELU(x) = λ × 
 
(35)
α × (e x − 1), x ≤ 0

having the output range in [−α, ∞) where α and β are the train-
having the output range in [−λ, ∞) where α is a hyperparame- able parameters. The EELU preserves a small non-zero gradi-
ter. Basically, the SELU induces self-normalization to automat- ent for the negative input and exhibits an elastic slope for the
ically converge towards zero mean and unit variance. The Para- positive input. A Parametric Deformable ELU (PDELU) AF
metric ELU (PELU) [54] changes the saturation point and expo- tries to shift the mean value of output closer to zero using the
nential decay and also regulates the slope of the linear function flexible map shape [59]. The PDELU is defined as,
for the positive inputs for differentiability. The PELU AF can 
be written as,  x,
 x>0
PDELU(x) = 

(42)
 α × ([1 + (1 − t) × x] 1−t1 − 1), x ≤ 0

 ab × x,
 x≥0
PELU(x) = λ × 

(36)
a × (e x/b − 1), x < 0

having the output range in [−1, ∞) where α is a learnable pa-
rameter. A ReLU-Memristor-like AF (RMAF) [72] uses two
having [−a, ∞) output range, where a and b are the trainable hyperparameters to have ReLU like shape for positive input and
parameters. The parametric ELU is also explored in Continu- to give more importance to the negative values near to zero. An
ously differentiable ELU (CELU) [53] for the negative inputs. Exponential Linear Sigmoid SquasHing (ELiSH) is defined in
The CELU is given as, [73] as,
 
 x, x≥0  x/(1 + e−x ), x≥0

CELU(x) = 
 
(37) ELiS H(x) = 

α × (e x/α − 1), x < 0 (43)
 (e − 1)/(1 + e ), x < 0
 x −x

having the output range in [−α, ∞) where α is a learnable Moreover, it is also extended to HardELiSH which is a mul-
parameter. The PELU is also extended to multiple PELU tiplication of HardSigmoid and Linear in the positive part and
(MPELU) [55] by using two learnable parameters to repre- HardSigmoid and ELU in the negative part. Here, HardSigmoid
sent MPELU as either rectified, exponential or combined. The is defined as,
MPELU can be expressed as,
 HardELish(x) = max(0, min(1, (x + 1)/2)). (44)
 x,
 x>0
MPELU(x) = 

(38)
αc × (eβc ×x − 1), x ≤ 0
 The ELU based AFs exploit the negative inputs without com-
promising with the non-linearity. Some ELU variants also mod-
having the output range in [−αc , ∞), where αc and βc are the ify the function for positive inputs to make it bounded.
trainable parameters.
A soft exponential AF interpolates between the exponential,
linear and logarithmic functions using the trainable parameter 6. Learning/Adaptive Activation Functions
[70]. A Shifted ELU (ShELU) AF is also explored as a locally
optimal function [71]. A Parametric Rectified Exponential Unit Most of the aforementioned AFs are not adaptive and might
(PREU) [57] is designed as, not be able to adjust based on the dataset complexity. This prob-
lem is tackled using learning/adaptive AFs as summarized in
Table 5. Some of the earlier mentioned AFs are also adaptive,

α × x,
 x>0
PREU(x) = 

(39) such as PReLU [57], SReLU [65], PTELU [38], MTLU [46],
α × x × eβ×x , x ≤ 0

PELU [54], MPELU [55], PREU [57], EELU [58], PDELU
having the output range in [−1, ∞), where α and β are the train- [59], SRS [26], etc.
able parameters. The PREU utilizes the negative information The Adaptive Piecewise Linear (APL) is defined as a sum of
near to zero effectively. The efficiency of ELU is improved in hinge-shape functions [28]. It is given as,
Fast ELU (FELU) AF [56] with the help of the simple displace- S
X
ment bits and integer algebra operations. The FELU is defined APL(x) = max(0, x) + a s × max(0, b s − x), (45)
as,  s=1
 x,
 x>0
FELU(x) = 

(40) where a and b are the trainable parameters and S is a hyperpa-
α × (e
 x/ln(2)
− 1), x ≤ 0
rameter representing the number of hinges. The output range of
having the output range in [−α, ∞) with α as a learnable pa- APL is [0, ∞). Due to the trainable parameters, different neu-
rameter. Recently, the properties of ELU and RELU have been rons can learn different AFs.
Table 5: Summary of adaptive and learning based activation functions.

Name Parametric Monotonic Smooth Bounded

Adaptive Piecewise Linear Unit (APL), 2015 [28] Yes No No No
Spline AF (SAF), 2016 [74] Yes Yes Yes No
Bi-Modal Derivative Adaptive Activation (BDAA), 2017 [75] Yes Yes Yes Yes
Adaptive AF (AAF), 2018 [76] Yes Yes No No
Swish, 2018 [29] Yes No Yes No
ESwish, 2018 [77] Yes No Yes No
Trainable AF (TAF), 2018 [78] Yes No Yes No
Self-Learnable AF (SLAF), 2019 [79] Yes No Yes No
Mexican ReLU (MeLU), 2019 [80] Yes No No No

Ramachandran et al. [29] have performed an automatic in the output range of (−∞, ∞), where ai is the trainable pa-
search, which resulted in a Swish AF. It is defined as, rameter. A Mexican ReLU (MeLU) AF is proposed in [80] by
using a “Mexican hat type” function and given as,
S wish(x) = x × S igmoid(β × x) (46)
k
where β is a learnable parameter. The output range of Swish
X
MeLU(x) = PReLU(x) + c j × max(λ j − |x − a j |, 0) (50)
is (−∞, ∞). Based on the learnt value of β the shape of the j=1
Swish AF is adjusted between the linear and ReLU functions.
The smaller and higher values of β lead towards the linear and in the output range of (−∞, ∞), where c j is the trainable param-
ReLU functions, respectively. Thus, it can control the amount eter and λ j & a j are the real numbers.
of non-linearity based on the dataset and network complexity. A cubic spline interpolation is also used to learn the AF from
Swish is also extended to E-Swish by multiplying the Swish data [74] which is given as,
with a learnable parameter to control the slope in the positive
direction [77]. The E-Swish is defined as, S AF(x) = Φ(s; q) (51)

ES wish(x) = β × x × S igmoid(x) (47) having the output range in (−∞, ∞) where Φ(.) is parameterized
by a vector q cubic in nature. Fourier series basis expansion is
having the output the range in (−∞, ∞) and β is trainable pa-
used for nonparametrically learning AFs (NPF) [88]. Hyperac-
rameter. A flatten-T Swish considers zero function for negative
tivations utilize a hypernetwork on top of an activation network,
inputs similar to the ReLU [81]. The Adaptive Richard’s Curve
which are used to explore the AFs search space [89]. A shal-
weighted Activation (ARiA) is also motivated from Swish and
low neural network is used in the activation network to produce
replaces the sigmoidal function with Richard’s Curve [82]. The
the output for each input, whereas a neural network is used in
ARiA AF uses five hyper-parameters to control the shape of the
the hypernetwork to produce weights for another network. A
non-linearity.
bi-modal derivative adaptive activation (BDAA) function uses
The basic AFs are combined with learnable weights in adap-
twin maxima derivative sigmoidal function [75] by controlling
tive AFs [76]. The Adaptive AF (AAF) designed over PReLU
the maxima’s position with an adaptive parameter. The BDAA
[35] and PELU [54] is given as,
is given as,
AAF(x) = σ(w × x) × PRELU(x) + (1 − σ(w × x)) × PELU(x) !
1 1 1
(48) BDAA(x) = × − (52)
having the output range in [0, 1], where σ is the sigmoidal func- 2 1 + e−x 1 + e−x−a
tion and w is a learnable parameter. In practice, AAF is costly in the output range of [0, 1] where a is a learnable parameter.
as multiple AFs are involved. In [83], the AF for each neuron is The authors have exploited the Bi-modal derivatives on four
selected from a library of AFs. In [84], different combinations AFs. Linear regression is used in [78] to train AF for each
of the identity function, ReLU, and Tanh are learnt automati- neuron which results in different AFs for the different neurons.
cally. In another attempt, an Adaptive Blending Unit (ABU) is The TAF is defined as,
defined to allow the networks to learn its preferred AFs [85]. p
The ABU combines a set of AFs with trainable weights. A T AF(x) = (x − a)2 + b2 (53)
Lookup Table Unit (LuTU) function [86] uses a single period
cosine mask based smoothing and linear interpolation using a in the output range of [b, ∞), where a and b are the trainable pa-
set of anchor points. Activation ensembles are used at each rameters. Recently, a trainable parameter was used in different
layer in [87] with the contribution of each AF controlled by non-adaptive AFs such as Sigmoid, Tanh, and ReLU to make it
the trainable weights. Similarly, the Self-Learnable AF (SLAF) adaptive [90].
computes the sum of the different functions in an ensemble with The adaptive and trainable AFs are the recent trend to ad-
the learnt coefficients [79]. The SLAF can be expressed as, just the non-linearity based on the data and network complex-
N−1
ity. However, the minimal burden is increased in terms of the
increased number of parameters. Though the complexity of tun-
X
S LAF(x) = ai × xi (49)
i=0
able AFs is relatively increased w.r.t. non-tunable AFs, it is
negligible w.r.t. all parameters of the entire network in practice. where P is the probability. The complexity of GELU increases
The same is also observed experimentally as reported in Table due to use of probabilistic nature. The GELU is also extended to
10 in terms of the training time. the Symmetrical Gaussian Error Linear Unit (SGELU) [102] to
enhance its ability of bidirectional convergence. Doubly trun-
cated Gaussian distributions [103] is a family of nonlinearities
7. Miscellaneous Activation Functions
which can generate different AFs such as Sigmoid, Tanh and
This section covers other attempts in AFs such as Softplus, ReLU by setting the appropriate truncation points. Probabilis-
Probabilistic, Polynomial, Subnetwork and Kernel. tic AF (ProbAct) introduces the adaptable and trainable vari-
ance in the ReLU’s output [104]. It leads to the generalization
7.1. Softplus Activation Functions of the models. However, all other drawbacks of ReLU exist
The softplus function [91] was proposed in 2001 as log(e x +1) with ProbAct also.
and mostly used in statistical applications. After the break-
through of deep learning the softmax function is used as the 7.3. Polynomial Activation Functions
AF [92]. Softmax function produces the categorical probability Smooth Adaptive AF (SAAF) is defined as the piecewise
distribution equivalent output. Softplus unit based AF is also polynomial function [105]. Two power functions symmetric
used in deep neural networks [93]. The smooth nature of the to the linear part of ReLU are combined in [106] to improve
Softplus facilitates the differentiability. The noisy softplus AF the performance of ReLU. A piecewise polynomial approxima-
[94] is suitable for the spiking neural networks (SNNs). A Soft- tion based AF is also learnt from the data [107]. This activation
plus Linear Unit (SLU) is also proposed by considering softplus leads to the light-weight models suitable for the FPGAs and mi-
with rectified unit [95]. The SLU AF is defined as, crocontrollers. The AF is also treated as the cumulative distri-
 bution function [108]. The ReLU is also extended to a Rectified
α × x,
 x≥0
S LU(x) = 

(54) Power Unit (RePU) for positive inputs as,
β × log(e + 1) − γ, x < 0
 x

where α, β and γ are the trainable parameters with α controlling  xs , x ≥ 0

RePU(x) = 

(58)
the slope in the positive direction, β controlling the saturation 0, x < 0

points in the negative direction and γ controlling the offset in
the negative direction w.r.t. the horizontal axis. The Rectified where s is a hyperparameter [109]. The RePU is suitable for
Softplus (ReSP) AF introduces the rectification for positive in- smoother gradients near zero. However, vanishing gradient, un-
put in Softplus activation [96]. In order to make the softplus bounded and asymmetric nature are the downsides of RePU.
function to follow the zero mean, a shifting and scaling of the The rational function of polynomials is better suited as com-
outputs is performed in [97]. A Rand Softplus (RSP) AF mod- pared to the polynomial functions in order to approximate the
els the stochasticity-adaptability of biological neurons as, ReLU [110]. Recently, a Padé approximation is used to develop
a non-smooth Padé Activation Unit (PAU) [111] as,
RS P(x) = (1 − ρ) × max(0, x) + ρ × log(1 + e x ) (55)
P(x)
PAU(x) = (59)
where ρ is a stochastic hyperparameter [98]. It improves the Q(x)
capability of the network towards the noise. The softplus func-
tion is also used with Tanh function in Mish activation function where P(x) and Q(x) are two polynomials of order m and n,
[99], which is given as, respectively. The PAUs can approximate the commonly used
hand-designed AFs. Moreover, it can also learn the new AFs
Mish(x) = x × T anh(S o f tplus(x)). (56) with compact representations. Recently, a Rational AF (RAF)
[112] was proposed to tackle the problem of non-smooth nature
The Mish is a non-monotonic and smooth AF. It has recently of the PAU function.
been used by the YOLOv4 model for object detection [100].
However, the increased complexity in Mish due to the multiple
7.4. Activations as a Subnetwork
functions can be a limitation for the deep networks.
A Variable AF (VAF) is used as a subnetwork of ReLUs
7.2. Probabilistic Activation Functions [113]. It uses the ensemble of ReLUs in a subnetwork using
So far, stochastic AFs have not been much explored due learnable parameters. In a very similar approach, the maximum
to expensive sampling processes. Few AFs exist in this cate- of multiple linear functions is used in the Dynamic ReLU (DY-
gory such as Randomized ReLU (RReLU) [50], Elastic ReLU ReLU) [114]. In Wide Hidden Expansion (WHE) [115], each
(EReLU) [40], Randomly Translational ReLU (RTReLU) [41] WHE intermediate channel is followed by one AF before con-
and Gaussian Error Linear Unit (GELU) [101]. GELU [101] necting to the output channel to increase the non-linearity of the
considers nonlinearity as the stochastic regularization driven network. An AF Unit (AFU) [116] uses a small neural network
transformation and defined as, to model the activation. All neurons in the original network
share the weights in AFU. The advantage of the AFU is that
GELU(x) = x × P(X ≤ x). (57) different AFs can be learnt by different layers.
7.5. Kernel Activation Functions 9.1. Comparison with Existing Survey/Performance Analysis
A performance analysis of AFs was conducted using mul-
A Kernel-based non-parametric AF (KAF) [117] uses an in- tilayer perceptron network in [134]. Among compared AFs,
expensive kernel expansion to make the activation flexible. The the Tanh has shown better performance. A comparative per-
KAF is further extended to multikernel AFs (multi-KAF) [118]. formance analysis of different AFs suggests an Elliott func-
Several AFs are also introduced for complex valued neural net- tion as better suited for classification using LSTM networks
works [119], [120], [121]. [25]. The ELU outperforms the ReLU, LReLU, and SELU
AFs over MNIST classification task using Deep Neural Net-
works [135]. As per [136], the ELU is reported in [135] to
8. Aspects of Activation Functions outperform the ReLU, LReLU, PReLU and PELU over suf-
ficiently large datasets for speech recognition. However, for
smaller datasets, the ReLU is preferred. A similar trend is also
This section summarizes the effect of weight initialization, reported in [137] with a note that the ELU and SELU AFs ex-
understanding of AFs and suitability with different types of hibit faster convergence as compared to the ReLU and LReLU
data. The learning of the network speeds up drastically by us- AFs. In [138], 21 AFs are listed without experimental results
ing the orthogonal weight initialization based on the dynamical comparison. In contrast to [138], this paper presents a compre-
isometry [122]. A set of conditions in parameter initialization hensive survey of AFs. The ReLU based deep networks per-
also boosts the performance of networks with sigmoidal acti- form superior or mildly worse than the spline methods [139]. A
vations [123]. The symmetric probability distribution based review of adaptive functions is conducted in [140] by consider-
weights and biases initialization leads the network to suffer ing 9 functions, including Sigmoid, Tanh, PReLU, and adapt-
with the dying ReLU problem. However, the asymmetric ini- Tanh. In [141], the comparison between ReLU and LReLU is
tialization resolves the dying ReLU problem [124]. The over- performed using CNN on MNIST dataset. An empirical study
parameterization during initialization also benefits in the train- is also done for the variations of ReLU activation by general-
ing [125]. The data-dependent weight initialization using a sub- izing it with the help of parameters [142]. The comparison of
set of data minimizes the issues of the ReLU [126], whereas an AFs is also performed for generalized learning vector quanti-
initial parameter sharing based initialization guarantees the dy- zation [143]. The ReLU activation has performed better for
namical isometry for the ReLU [127]. object, face, and text datasets [144]. However, the SELU and
Several researchers have tried to understand the working and Maxout have performed better for medical and sound datasets,
impact of AFs through different strategies. The lower and up- respectively [144]. The piecewise AF is better suited for fa-
per bounds are established for network complexity to realize cial expression recognition in [145]. A survey of adaptive AFs
that the ReLU in deep networks approximates the smooth func- is conducted in [146] without experimental comparison. The
tions more efficiently as compared to shallow networks [128]. evaluation of seven AFs is conducted in [147] using a simple
A ReLU network with only one hidden layer is trained to reach network over CIFAR10 dataset, whereas in our survey we cover
the global optimum in polynomial time even with exponen- different AFs and also perform the experimental comparison.
tially growing input dimension [129]. The ReLU type AF A summary of the comparison with existing surveys and per-
based neural networks produce the overconfident predictions formance analysis of AF is shown in Table 7. Following are the
far away from the training data [130]. However, this can be re- observations:
solved by employing adversarial confidence enhanced training.
A Gaussian margin driven time and accuracy tradeoff analysis • This survey presents a detailed classification to cover the
is also done on the ReLU’s learning [131]. The singular val- wide range of AFs as compared to the existing surveys and
ues for ReLU layers are analyzed to understand the interaction performance analysis.
of ReLU with the linear components [132]. The approxima- • This survey covers exhaustive state-of-the-art AFs to date,
tion of Gaussian posterior distribution over the ReLU network whereas the existing survey/performance analysis covers
weight’s fixes the overconfidence problem [133]. either a limited number of AFs or only basic AFs.
Despite most of the AFs are tested over image data, there are
few research papers dealing with the AFs over other types of • The performance analysis conducted in this paper consid-
data. Table 6 summarizes the insights and remarks of state-of- ers a wide range of neural networks over different types
the-art AFs for various networks and datasets. of data for eighteen AFs, whereas the existing analysis is
limited to a single type of data and network.
• This survey highlights the trends to help the researchers to
9. Performance Comparison and Analysis further explore the better AFs and practitioners to choose
based on the data and network types.
This survey is compared with the existing sur-
vey/performance analysis and the experimental performance 9.2. Experimental Performance Analysis
analysis of selected AFs is performed over Image, Text and In order to compare the AFs, three experiments are con-
Speech data. ducted in this paper, including image classification, language
Table 6: Summary of the existing state-of-the-art activation functions.

Activation Models Datasets Insights and Remarks

On Image Datasets
Wide Hidden Expansion ResNet, SENet, and Mo- CIFAR100 and ImageNet classi- Upto 2% higher Top-1 accuracy than baseline mod-
(WHE) - 2020 [115] bileNet fication, Pascal VOC 2007 and els of recognition and detection
COCO detection
Soft-Root-Sign (SRS) - 2020 VGG and MobileNet CIFAR10 and CIFAR100 classi- The SRS is better with MobileNet over both datasets
[26] fication and with VGG over CIFAR100. The LReLU is bet-
ter with VGG over CIFAR10.
Relu-Memristor-Like AF ResNet, AlexNet, CIFAR10, CIFAR100, MNIST The RMAF performs better than the ReLU, ELU,
(RMAF) - 2020 [72] SqueezeNet, and DenseNet and ImageNet classification SELU, PReLU, Tanh and Swish.
Parametric Deformable ELU NIN and ResNet CIFAR10 and CIFAR100 classi- The PDELU performs better than the ReLU, ELU
(PDELU) - 2020 [59] fication and FReLU.
Pade Activation Unit (PAU) - VGG8, MobileNetV2, MNIST, Fashion-MNIST, CI- The PAU encode AFs as rational functions and per-
2020 [111] ResNet and DenseNet FAR10 and ImageNet classifica- forms better than many existing AFs.
tion
Elastic Exponential Linear A simple CNN model and CIFAR10, CIFAR100, Ima- The EELU shows better results than the ReLU, ELU,
Unit (EELU) - 2020 [58] VGG16 geNet, and Tiny ImageNet EPReLU and Swish.
classification
Dynamic ReLU (DY-ReLU) - MobileNetV2 ImageNet classification and The DY-ReLU is suitable for light-weight networks.
2020 [114] COCO detection
Variable AF (VAF) - 2019 Shallow CNN models MNIST, Fashion MNIST and The VAF shows promising performance.
[113] CIFAR10 classification
Multi-bin Trainable Linear FDnet and FSRnet Image denoising and Super- The MTLU is significantly faster having comparable
Unit (MTLU) - 2019 [46] resolution results with the state-of-the-arts.
Swish - 2018 [29] MobileNet, ResNet, WRN CIFAR10, CIFAR100 and Ima- The learnable parameter in Swish leads to improved
and DenseNet geNet classification performance than Softplus.
On Time Series Datasets
Variable AF (VAF) - 2019 Multi-Layered Neural Net- Regression tasks (Kinematics, Better performance over Kinematics, Energy Cool-
[113] work Energy Cooling, Yatch, etc.) ing and Yatch datasets.
Self-Learnable AFs (SLAF) - Multi-Layered Neural Net- Boston Housing and Learning The newer parameter space makes the optimization
2019 [79] work Sparse Polynomial regression easier.
On Text Datasets
Soft-Root-Sign (SRS) - 2020 A 6 layer transformer net- IWSLT 2016 German-English The SRS is better over tst2011 and tst2012 test
[26] work translation sets, whereas the SELU and LReLU are better over
tst2013 and tst2014 test sets, respectively.
Swish - 2018 [29] A 12 layer transformer net- WMT 2014 English-German The performance of Swish is comparable to state-of-
work dataset the-arts.
PenalizedTanh - 2018 [33] MLP, CNN and RNN Sentence classification, Docu- The PenalizedTanh exhibits the stability across the
ment classification and Sentence different tasks in contrast to the Swish function.
tagging
On Signal Datasets
Rectified Linear Tanh Stacked autoencoder (SAE) Vibration signals for rotating The ReLTanh leads to larger gradients for faster
(ReLTanh) - 2019 [67] based DNN machinery fault diagnosis learning and reduces the vanishing gradient.
On Game Datasets
Sigmoid-weighted Linear Deep reinforcement learn- SZ-Tetris, 10 × 10 Tetris, and The SiLU AF outperforms the ReLU function for re-
Unit (SiLU) - 2018 [23] ing algorithm Atari 2600 games inforcement learning.

translation and speech recognition. Eighteen state-of-the-art ages from 10 object categories. The CIFAR100 dataset contains
AFs are considered for analysis, including Logistic Sigmoid, 50, 000 training images and 10, 000 test images from 100 object
Tanh, Elliott [25], ReLU [8], LReLU [34] PReLU [35], ELU categories. We also utilize the language translation and speech
[27], SELU [52], GELU [101], CELU [53], Softplus [93], recognition datasets for the experiments. For the experiments
Swish [29], ABReLU [44], LiSHT [24], Soft-Root-Sign (SRS) over CIFAR-10 and CIFAR-100 datasets, training is performed
[26], Mish [99], PAU [111] and PDELU [59]. Note that Swish, for 100 Epochs. The batch size is 128 for CIFAR-10 and 64 for
ABReLU, LiSHT, SRS, Mish, PAU and PDELU are the most CIFAR-100. The learning rate is 0.001 for first 80 Epochs and
recent functions. Google Colab based computational resource 0.0001 for last 20 Epochs. Random crop and random horizontal
is used in most of the experiments. Few experiments are also flip are the data augmentation used during training. Data nor-
performed over a desktop system consisting of 8 GB GPU. The malization is performed both during train and test times. Adam
PyTorch framework is used in all the experiments. optimizer is used for the training with cross entropy loss. All
The CIFAR10 and CIFAR100 datasets1 [148] are used for the existing activation functions except softmax are replaced with
image classification experiment in this paper. The CIFAR10 the corresponding activation function in different networks.
dataset contains 50, 000 training images and 10, 000 test im-
The test accuracy is reported in Tables 8 and 9 on CIFAR10
and CIFAR100 datasets, respectively. In these Tables, the mean
1 https://www.cs.toronto.edu/ and standard deviation of image classification accuracy over
˜kriz/cifar.html
Table 7: Comparison of this survey with the existing surveys and performance evaluations.

Method Models Activations Datasets Remarks

Karlik and Ol- Multilayer Perceptron 5 AFs, including Bi-polar sig- Classification The Tanh performs better compared
gac [134] (MLP) moid, Uni-polar sigmoid, Tanh, to other traditional AFs.
etc.
Vydana and Hidden Markov Model- 5 AFs, including ReLU, LReLU, TIMIT and WSJ speech The ELU is better over sufficiently
Vuppala (2017) Deep Neural Network PReLU, ELU, and PELU recognition larger size datasets. However,
[136] (HMM-DNN) the ReLU is preferred for smaller
datasets.
Alcantara A neural network with 4 AFs, including ReLU, LReLU, MNIST classification The ELU AF outperforms others.
(2017) [135] 2 hidden layers having ELU, and SELU
100 neurons/layer
Pedamonti A neural network with 5 AFs, including Sigmoid, MNIST classification The ELU and SELU AFs exhibit the
(2018) [137] 2 hidden layers having ReLU, LReLU, ELU, and SELU faster convergence as compared to
100 neurons/layer the ReLU and LReLU AFs.
Lau and Lim Deep Neural Network ReLU and Adaptive ReLU MNIST classification The adaptive AFs improve the gen-
(2018) [140] (DNN) eralization of the network.
Farzad et al. Long Short Term Mem- 23 AFs, including Elliott, Gaus- IMDB, Movie Review, Elliott function is better suited to the
(2019) [25] ory (LSTM) sian, Logarithmic, Loglog, etc. MNIST classification LSTM network.
Dubey and Jain Simple Convolutional 2 AFs, including ReLU and MNIST classification The ReLU performed better than
(2019) [141] Neural Network (CNN) Leaky ReLU Leaky ReLU (LReLU).
Banerjee et al. Convolutional Neural Generalized ReLU MNIST classification Network learns the parameters for
(2019) [142] Network (CNN) different ReLU variations.
Villmann et al. Generalized learning 12 AFs, including Sigmoid, Tecator, Indian Pine and The Sigmoid, Swish and Softplus
(2019) [143] vector quantization Swish, ReLU, Softplus, etc. Wisconsin-Breast-Cancer AFs are better suited with GLVQ.
(GLVQ) classification
Castaneda et al. 6 different models for 3 AFs, including ReLU, SELU Object, Face, Text, Medical The ReLU is better for object, face
(2019) [144] different applications and Maxout and Sound datasets and text datasets, whereas SELU
and Maxout are better for medical
and sound datasets, respectively.
Wang et al. Inception-v3 model 6 AFs, including Sigmoid, Tanh, JAFFE and FER2013 facial The combination of log, softdesign
(2020) [145] ReLu, etc. expression recognition and ReLU AFs provides improved
performance.
Szandala (2020) A simple network 7 AFs, including Sigmoid, Tanh, CIFAR10 classification The LReLU performs better. The
[147] ReLU, LReLU, Swish, etc. ReLU is efficient.
Our survey and MobileNet, VGG, Exhaustive list of AFs, includ- CIFAR10 classification, A classification to categorize and
performance GoogLeNet, ResNet, ing performance analysis over Language translation, analyze the AFs and a performance
analysis SENet, DenseNet, etc. 18 state-of-the-art activations Speech recognition comparison of the state-of-the-art
activations.

Table 8: Experimental results comparison over CIFAR10 dataset.

Accuracy CNN Models

Activations MobileNet VGG16 GoogleNet ResNet50 SENet18 DenseNet121
Sigmoid 88.60 ± 0.17 87.69 ± 0.49 87.33 ± 2.48 80.13 ± 3.33 90.29 ± 0.29 89.92 ± 1.96
Tanh 87.21 ± 0.24 90.49 ± 0.11 90.16 ± 1.86 89.09 ± 1.47 90.44 ± 0.09 91.80 ± 0.69
Elliott [25] 88.48 ± 0.18 87.94 ± 0.49 89.84 ± 3.43 81.60 ± 3.91 90.25 ± 0.25 91.53 ± 1.04
ReLU [8] 90.10 ± 0.22 92.84 ± 0.19 93.43 ± 0.48 93.74 ± 0.34 93.70 ± 0.16 93.96 ± 0.51
LReLU [17] 90.10 ± 0.19 91.09 ± 0.09 89.28 ± 0.82 93.83 ± 0.42 93.66 ± 0.19 93.85 ± 0.48
PReLU [35] 90.43 ± 0.18 92.19 ± 0.08 92.85 ± 0.55 92.99 ± 0.62 92.76 ± 0.26 92.82 ± 0.63
ELU [27] 90.92 ± 0.25 88.55 ± 1.17 92.47 ± 0.76 93.53 ± 0.66 93.39 ± 0.20 92.89 ± 0.62
SELU [52] 90.11 ± 0.32 92.25 ± 0.28 91.87 ± 0.84 93.53 ± 0.52 89.96 ± 0.31 92.71 ± 0.73
GELU [101] 90.71 ± 0.20 92.42 ± 0.09 93.16 ± 0.61 93.81 ± 0.46 93.72 ± 0.18 93.90 ± 0.41
CELU [53] 91.04 ± 0.17 88.11 ± 0.14 92.60 ± 0.60 94.09 ± 0.17 91.63 ± 0.22 93.46 ± 0.35
Softplus [93] 91.05 ± 0.22 92.69 ± 0.20 92.66 ± 0.66 93.34 ± 0.65 93.25 ± 0.11 93.07 ± 0.70
Swish [29] 90.66 ± 0.34 92.32 ± 0.20 92.68 ± 0.53 93.02 ± 0.85 93.24 ± 0.19 93.16 ± 0.51
ABReLU [44] 88.97 ± 0.47 92.36 ± 0.15 93.34 ± 0.23 93.29 ± 0.52 93.35 ± 0.14 93.26 ± 0.55
LiSHT [24] 86.53 ± 0.49 89.83 ± 0.28 90.27 ± 0.80 90.89 ± 0.66 90.25 ± 0.84 87.91 ± 0.93
SRS [26] 89.43 ± 0.81 92.06 ± 0.26 91.36 ± 1.19 92.28 ± 0.48 78.05 ± 1.37 90.64 ± 1.93
Mish [99] 90.82 ± 0.15 92.85 ± 0.25 93.29 ± 0.61 93.69 ± 0.63 93.66 ± 0.12 93.62 ± 0.62
PAU [111] 90.67 ± 0.17 92.00 ± 0.26 92.80 ± 0.65 93.67 ± 0.52 93.08 ± 0.20 93.05 ± 0.53
PDELU [59] 90.18 ± 0.19 92.80 ± 0.13 93.49 ± 0.30 93.42 ± 0.71 93.71 ± 0.07 93.96 ± 0.59

5 trials are reported for each AF. Moreover, the better results and skip/residual connection based models (i.e., ResNet50
are highlighted. Different types of CNN models are used in [152], SENet18 [153], and DenseNet121 [154]). The Mo-
this experiment, such as plain models (i.e., MobileNet [149] bileNet, GoogLeNet and SENet18 are light models, whereas
and VGG16 [150]), inception model (i.e., GoogLeNet [151]) the VGG16, ResNet50 and DenseNet121 are heavy models in
Table 9: Experimental results comparison over CIFAR100 dataset.

Accuracy CNN Models

Activations MobileNet VGG16 GoogleNet ResNet50 SENet18 DenseNet121
Sigmoid 61.88 ± 0.18 37.75 ± 0.59 70.31 ± 0.54 46.78 ± 5.42 66.17 ± 1.16 68.31 ± 2.41
Tanh 53.10 ± 0.51 58.43 ± 0.38 67.66 ± 2.32 64.32 ± 1.69 60.13 ± 1.86 69.53 ± 1.68
Elliott [25] 60.70 ± 0.34 33.20 ± 0.97 64.85 ± 6.28 49.88 ± 4.03 66.30 ± 0.28 69.58 ± 2.40
ReLU [8] 61.33 ± 0.34 67.47 ± 0.44 74.05 ± 1.69 71.96 ± 0.94 70.45 ± 0.73 72.99 ± 1.35
LReLU [17] 61.13 ± 0.41 65.72 ± 0.14 63.79 ± 2.38 72.77 ± 0.49 70.58 ± 0.45 73.33 ± 1.25
PReLU [35] 59.86 ± 0.35 65.26 ± 0.40 69.57 ± 1.50 71.08 ± 1.70 69.77 ± 0.48 68.23 ± 1.55
ELU [27] 61.97 ± 0.24 51.35 ± 3.01 72.57 ± 1.76 71.41 ± 1.63 71.27 ± 0.58 72.06 ± 1.93
SELU [52] 59.62 ± 0.39 64.55 ± 0.43 71.47 ± 1.39 69.94 ± 1.92 55.01 ± 0.98 70.15 ± 1.04
GELU [101] 61.20 ± 0.61 67.25 ± 0.38 74.27 ± 0.70 71.58 ± 0.87 71.14 ± 0.29 73.31 ± 1.70
CELU [53] 61.90 ± 0.21 55.78 ± 0.69 72.87 ± 1.52 70.95 ± 1.40 63.43 ± 0.81 72.68 ± 1.16
Softplus [93] 62.59 ± 0.21 67.70 ± 0.19 73.08 ± 1.66 71.99 ± 2.03 71.16 ± 0.46 72.54 ± 1.73
Swish [29] 59.40 ± 0.41 66.05 ± 0.82 71.56 ± 1.66 71.12 ± 2.08 68.42 ± 1.62 71.34 ± 1.10
ABReLU [44] 56.21 ± 0.53 66.95 ± 0.09 71.83 ± 2.26 71.96 ± 1.43 70.47 ± 0.91 73.79 ± 1.45
LiSHT [24] 54.09 ± 1.54 58.87 ± 0.81 66.66 ± 2.50 65.28 ± 1.33 66.01 ± 1.04 65.61 ± 1.10
SRS [26] 54.93 ± 0.80 58.22 ± 1.09 70.39 ± 1.09 67.11 ± 1.46 36.95 ± 0.93 64.52 ± 1.39
Mish [99] 61.81 ± 0.54 68.13 ± 0.40 73.76 ± 1.48 71.89 ± 1.12 70.80 ± 0.68 73.49 ± 1.39
PAU [111] 59.81 ± 0.61 64.14 ± 0.62 70.48 ± 1.53 68.59 ± 2.15 68.29 ± 0.77 67.83 ± 0.35
PDELU [59] 61.35 ± 0.56 67.92 ± 0.32 74.48 ± 1.23 72.11 ± 1.60 70.81 ± 0.47 73.71 ± 1.64

Figure 3: Convergence plots over CIFAR100 dataset.

terms of the number of trainable parameters. Overall, it is ob- SENet18 model. Sigmoid and Elliott AFs showed the poorest
served that the Softplus, ELU and CELU are better suited with convergence. The time taken for the training is also computed
MobileNet. The ReLU, Mish and PDELU exhibit good perfor different AFs using different CNN models on CIFAR100
formance with VGG16, GoogleNet and DenseNet. The ReLU, dataset and reported in Table 10. These results are computed
LReLU, ELU, GELU, CELU, ABReLU, and PDELU activa- using a desktop computer system having 32 GB RAM and 8
tion functions are better for the networks having residual con- GB Nvidia GPU Card for 100 epochs of training. The time is
nections, such as ResNet50, SENet18 and DenseNet121. In or- represented in hh:mm:ss format. It is clear that PDELU AF
der to demonstrate the convergence of different AFs, the train- is very inefficient. Moreover, SRS and Elliott also take more
ing loss vs epochs is plotted in Fig. 3 on CIFAR100 dataset time for training. The activations such as ReLU, ELU, CELU,
using different models. The PAU has emerged as a promis- and Softplus depict a good tradeoff between the accuracy and
ing AF with fastest convergence in most of the cases. The training time.
PReLU, GELU and PDELU AFs are also consistent with good
The results for language translation and speech recognition
convergence. Note that the training diverges with SRS for the
for different AFs are illustrated in Table 11. The German to
Table 10: Training time (hh:mm:ss) comparison over CIFAR100 dataset.

Training Time CNN Models

Activations MobileNet VGG16 GoogleNet ResNet50 SENet18 DenseNet121
Sigmoid 00:33:15 00:49:16 04:55:54 03:36:03 01:13:14 04:12:24
Tanh 00:33:18 00:49:55 04:58:02 03:33:03 01:13:18 04:09:24
Elliott [25] 00:49:52 00:59:13 06:53:55 05:38:49 01:41:38 07:46:55
ReLU [8] 00:31:22 00:47:19 04:55:10 03:32:30 01:15:33 04:15:06
LReLU [34] 00:31:48 00:49:03 05:01:30 03:33:00 01:18:38 04:14:09
PReLU [35] 00:44:24 00:49:01 05:42:18 03:55:57 01:27:05 04:55:47
ELU [27] 00:31:05 00:47:38 04:57:37 03:36:47 01:13:25 04:08:39
SELU [52] 00:29:40 00:47:31 04:54:57 03:33:47 01:13:27 04:09:17
GELU [101] 00:29:43 00:47:22 04:55:53 03:32:32 01:13:32 04:11:26
CELU [53] 00:29:36 00:46:47 05:00:44 03:31:40 01:14:08 04:18:11
Softplus [93] 00:29:44 00:47:06 04:58:55 03:32:03 01:14:02 04:12:08
Swish [29] 00:43:13 00:55:37 06:18:38 04:58:38 01:32:15 06:41:14
ABReLU [44] 00:38:51 00:53:49 05:43:59 04:27:02 01:25:30 05:42:53
LiSHT [24] 00:37:01 00:54:10 05:40:00 04:25:57 01:23:59 05:38:15
SRS [26] 01:06:38 01:11:36 08:43:09 07:35:35 02:05:33 11:10:27
Mish [99] 00:40:19 00:54:23 05:59:48 04:46:45 01:28:53 06:10:27
PAU [111] 00:41:59 00:54:10 05:54:22 04:12:31 01:25:37 05:39:57
PDELU [59] 05:23:38 04:01:55 34:22:00 36:48:48 08:32:40 50:23:00

Table 11: Experimental results for German to English language translation and AF. It is noticed that the Tanh and SELU AFs are better suit-
speech recognition tasks.
able for language translation. The PReLU, LiSHT, SRS and
Language Speech Recognition PAU AFs also perform better for language translation.
Translation The speech recognition experiment is also performed to show
Activations Bleu Score Average Average the performance of the different AFs for time-series signal data.
CER WER
Sigmoid 14.59 ± 0.47 0.53 ± 0.18 1.19 ± 0.39 The end-to-end speech recognition based Deep Speech 2 frame-
Tanh 20.93 ± 0.91 0.26 ± 0 0.68 ± 0 work available from assemblyai3 is used. The model consists of
Elliott [25] 14.49 ± 0.96 0.40 ± 0.01 0.93 ± 0.01 2 layers of residual convolution layers to learn the relevant au-
ReLU [8] 18.88 ± 0.86 0.24 ± 0.01 0.66 ± 0.01 dio features, and 2 layers of bidirectional gated recurrent units
LReLU [34] 18.89 ± 0.82 0.24 ± 0 0.66 ± 0.01
PReLU [35] 20.04 ± 0.98 0.24 ± 0 0.65 ± 0 (GRUs) to use the learned residual convolutional audio fea-
ELU [27] 19.40 ± 1.33 0.25 ± 0 0.67 ± 0 tures. The 100 hours of transcribed audio English data from
SELU [52] 20.85 ± 0.64 0.26 ± 0 0.69 ± 0.01 LibriSpeech dataset is used for the experiment. For the speech
GELU [101] 18.75 ± 1.83 0.24 ± 0 0.65 ± 0
recognition experiments, torchaudio 0.4.0 and torch 1.4.0 are
CELU [53] 18.71 ± 0.55 0.25 ± 0 0.67 ± 0
Softplus [93] 16.78 ± 0.84 0.30 ± 0.01 0.76 ± 0.02 used. The model consists of 2 CNN layers and 2 RNN layers.
Swish [29] 19.51 ± 0.97 0.24 ± 0.01 0.65 ± 0.01 The dimension of a RNN layer is 512. Number of classes is
ABReLU [44] 17.55 ± 0.63 0.25 ± 0 0.68 ± 0 29 in the dataset. Dropout factor is 0.5. The learning rate is
LiSHT [24] 20.39 ± 0.93 0.29 ± 0.01 0.74 ± 0.01
SRS [26] 20.66 ± 0.78 0.28 ± 0 0.72 ± 0
0.0005, batch size is 10 and the number of Epochs is 10. The
Mish [99] 19.56 ± 1.15 0.24 ± 0 0.65 ± 0 mean and standard deviation over 5 trials of character error rate
PAU [111] 20.11 ± 1.24 0.24 ± 0 0.65 ± 0.01 (CER) and word error rate (WER) are reported in Table 11 for
PDELU [59] 19.07 ± 0.95 0.25 ± 0 0.67 ± 0.01 speech recognition. The recent AFs such as PReLU, GELU,
Swish, Mish and PAU AFs are found as the most suitable for
speech recognition in this experiment.
English translation is used to test the performance of the AFs
over text data. Benchmark Seq2Seq model consisting of a Long
Short Term Memory (LSTM) based autoencoder network is 10. Conclusion and Recommendations
used for the experiment. The model and dataset are downloaded
from Kaggle2 . The AF is applied to the feature embedding
An extensive and up to date survey of activation functions is
before the dropout layer. For the language translation experi-
conducted in this paper. Different types of AFs are considered,
ments, the number of Epochs is set to 50 with 0.001 learning
including Logistic Sigmoid and Tanh based, ReLU based, ELU
rate and 256 batch size. The embedding size of encoder and
based, and Learning based. However, the main focus is given to
decoder is 300. The dropout factor is 0.5 for both encoder and
the recent developments in AFs in view of the deep learning ap-
decoder. Adam optimizer is used for the training with cross
plications of neural networks. The overview of AFs presented
entropy loss. The Bleu score [155] with 4-gram is reported in
in this paper focuses on the aspects including the detailed cov-
Table 11 in 2nd column for different AFs. The mean and stan-
erage of AFs, classification and performance comparison over
dard deviation of Bleu score over 5 trials are reported for each
image, text and speech data.

2 https://www.kaggle.com/parthplc/pytorch-seq2seq-machine-

translation/notebook 3 https://www.assemblyai.com/blog/end-to-end-speech-recognition-pytorch
Following are the concluding remarks of the survey and per- • The ReLU, Mish and PDELU activation functions have
formance analysis conducted through this paper: shown a good performance with VGG16 and GoogleNet.
The ReLU, LReLU, ELU, GELU, CELU, and PDELU
• Most of the improvements in Logistic Sigmoid and Tanh
functions are better for the networks having residual con-
targets to tackle the non zero-mean and zero-gradient
nections for image classification.
problems. However, these improvements carry forward
the drawback of increased complexity. • In general, the parametric AFs show better convergence as
it can adapt the data faster by learning the parameter from
• The ReLU variants try to tackle the three major prob-
the data. Specially, PAU, PReLU and PDELU have shown
lems of ReLU, namely under-utilization of negative val-
better convergence.
ues, limited nonlinearity and unbounded output. These ac-
tivations perform well for some applications, e.g. LReLU • Some AFs lead to increased training time complexity.
and ABReLU works better with residual networks. How- PDELU and SRS are such examples. However, AFs such
ever, most of these activations fail to perform better than as ReLU, SELU, GELU, and Softplus depict a promising
ReLU, e.g. LReLU, PReLU and ABReLU do not improve tradeoff between the accuracy and training time.
for MobileNet, VGG and GoogleNet models. Note that,
the ReLU, Leaky ReLU and PReLU AFs are the most • The exponential AFs generally lead to the increased non-
common choice among researchers due to its simplicity. linearity due to utilization of the negative values.
Moreover, many networks consider the ReLU as a default
choice for the AF. • The Tanh and SELU AFs are found better for language
translation along with PReLU, LiSHT, SRS and PAU.
• The exponential based AFs also focus over the better uti-
lization of the negative values and to avoid the saturation • It is suggested to use the PReLU, GELU, Swish, Mish and
for important features. However, most of the exponential PAU AFs for speech recognition.
activations suffer due to the non-smooth functions.
• The learning based adaptive AFs try to find the best pa- References
rameters to represent the non-linearity needed for the given
[1] F. Shao, L. Chen, J. Shao, W. Ji, S. Xiao, L. Ye, Y. Zhuang, J. Xiao,
dataset. This category of AF has gained more popularity Deep learning for weakly-supervised object detection and localization:
in recent years. However, the major problem associated A survey, Neurocomputing (2022).
with such AF is to find the better base function and num- [2] Y. Mo, Y. Wu, X. Yang, F. Liu, Y. Liao, Review the state-of-the-art tech-
ber of trainable parameters. Some AFs diverge during the nologies of semantic segmentation based on deep learning, Neurocom-
puting (2022).
training if not initialized properly. [3] Y. Guo, F. Feng, X. Hao, X. Chen, Jac-net: Joint learning with adap-
tive exploration and concise attention for unsupervised domain adaptive
• In contrast to existing surveys, this survey covers an ex- person re-identification, Neurocomputing (2022).
haustive list different types of AFs. Moreover, a perfor- [4] S. R. Dubey, A decade survey of content based image retrieval using
mance analysis on different types of data using several AFs deep learning, IEEE Transactions on Circuits and Systems for Video
Technology (2021).
provides new insights for future research. [5] X. Xia, X. Pan, N. Li, X. He, L. Ma, X. Zhang, N. Ding, Gan-based
anomaly detection: A review, Neurocomputing (2022).
Following are the recommendations curated from this survey [6] H. Li, Y. Pan, J. Zhao, L. Zhang, Skin disease diagnosis with deep learn-
and performance analysis: ing: a review, Neurocomputing 464 (2021) 364–393.
[7] C. H. Dagli, Artificial neural networks for intelligent manufacturing,
• In order to speed up the training, both negative & positive Springer Science & Business Media, 2012.
values should be used to ensure the near zero mean. [8] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
• The most important aspect in deep learning is to find the Processing Systems, 2012, pp. 1097–1105.
network having matching complexity as the dataset com- [9] A. Graves, A.-r. Mohamed, G. Hinton, Speech recognition with deep
recurrent neural networks, in: IEEE International Conference on Acous-
plexity. If the complexity of the model is high then it may tics, Speech and Signal Processing, 2013, pp. 6645–6649.
lead to overfitting and if the complexity of the model is [10] K. K. Babu, S. R. Dubey, Pcsgan: Perceptual cyclic-synthesized gener-
low then it may lead to under convergence. Thus, the ative adversarial networks for thermal and nir to visible image transfor-
AF should bridge this gap based on the model and dataset mation, Neurocomputing (2020).
[11] J. Liu, Y. Liu, Q. Zhang, A weight initialization method based on neural
complexity during training automatically. network with asymmetric activation function, Neurocomputing (2022).
[12] Y. Srivastava, V. Murali, S. R. Dubey, A performance evaluation of
• The Logistic Sigmoid and Tanh AFs should be avoided for loss functions for deep face recognition, in: National Conference on
Convolutional Neural Networks as it leads to poor con- Computer Vision, Pattern Recognition, Image Processing, and Graph-
vergence. However, this type of AF is commonly used as ics, Springer, 2019, pp. 322–332.
[13] S. S. Basha, S. R. Dubey, V. Pulabaigari, S. Mukherjee, Impact of fully
gates in recurrent neural networks.
connected layers on performance of convolutional neural networks for
image classification, Neurocomputing 378 (2020) 112–119.
• Despite the ReLU being a popular choice, recently pro- [14] Q. Xu, M. Zhang, Z. Gu, G. Pan, Overfitting remedy by sparsifying
posed AFs such as Swish, Mish, and PAU are also worth regularization on fully-connected layers of cnns, Neurocomputing 328
trying for different problems. (2019) 69–74.
[15] S. R. Dubey, S. Chakraborty, S. K. Roy, S. Mukherjee, S. K. Singh, B. B. Recognition, 2018, pp. 1223–1228.
Chaudhuri, diffgrad: An optimization method for convolutional neural [40] X. Jiang, Y. Pang, X. Li, J. Pan, Y. Xie, Deep neural networks with
networks, IEEE transactions on neural networks and learning systems elastic rectified linear units for object recognition, Neurocomputing 275
31 (11) (2019) 4500–4511. (2018) 1132–1139.
[16] W. Duch, N. Jankowski, Survey of neural transfer functions, Neural [41] J. Cao, Y. Pang, X. Li, J. Liang, Randomly translational activation in-
Computing Surveys 2 (1) (1999) 163–212. spired by the input distributions of relu, Neurocomputing 275 (2018)
[17] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann 859–868.
machines, in: International Conference on Machine Learning, 2010, pp. [42] F. Godin, J. Degrave, J. Dambre, W. De Neve, Dual rectified linear units
807–814. (drelus): A replacement for tanh activation functions in quasi-recurrent
[18] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap- neural networks, Pattern Recognition Letters 116 (2018) 8–14.
plied to document recognition, Proceedings of the IEEE 86 (11) (1998) [43] Z. Tang, L. Luo, H. Peng, S. Li, A joint residual network with paired
2278–2324. relus activation for image super-resolution, Neurocomputing 273 (2018)
[19] A. N. S. Njikam, H. Zhao, A novel activation function for multilayer 37–46.
feed-forward neural networks, Applied Intelligence 45 (1) (2016) 75– [44] S. R. Dubey, S. Chakraborty, Average biased relu based cnn descriptor
82. for improved face retrieval, arXiv preprint arXiv:1804.02051 (2018).
[20] B. Xu, R. Huang, M. Li, Revise saturated activation functions, Interna- [45] Y. Liu, J. Zhang, C. Gao, J. Qu, L. Ji, Natural-logarithm-rectified activa-
tional Conference on Learning Representations Workshop (2016). tion function in convolutional neural networks, in: International Confer-
[21] S. Kong, M. Takatsuka, Hexpo: A vanishing-proof activation function, ence on Computer and Communications, 2019, pp. 2000–2008.
in: International Joint Conference on Neural Networks, 2017, pp. 2562– [46] S. Gu, W. Li, L. V. Gool, R. Timofte, Fast image restoration with multi-
2567. bin trainable linear units, in: IEEE International Conference on Com-
[22] Y. Qin, X. Wang, J. Zou, The optimized deep belief networks with im- puter Vision, 2019, pp. 4190–4199.
proved logistic sigmoid units and their application in fault diagnosis for [47] M. Basirat, P. Roth, L* relu: Piece-wise linear activation functions for
planetary gearboxes of wind turbines, IEEE Transactions on Industrial deep fine-grained visual categorization, in: IEEE Winter Conference on
Electronics 66 (5) (2018) 3814–3824. Applications of Computer Vision, 2020, pp. 1218–1227.
[23] S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neu- [48] C. Gulcehre, M. Moczulski, M. Denil, Y. Bengio, Noisy activation func-
ral network function approximation in reinforcement learning, Neural tions, in: International Conference on Machine Learning, 2016, pp.
Networks 107 (2018) 3–11. 3059–3068.
[24] S. K. Roy, S. Manna, S. R. Dubey, B. B. Chaudhuri, Lisht: Non- [49] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio,
parametric linearly scaled hyperbolic tangent activation function for Maxout networks, arXiv preprint arXiv:1302.4389 (2013).
neural networks, arXiv preprint arXiv:1901.05894 (2019). [50] B. Xu, N. Wang, T. Chen, M. Li, Empirical evaluation of rectified activa-
[25] A. Farzad, H. Mashayekhi, H. Hassanpour, A comparative performance tions in convolutional network, arXiv preprint arXiv:1505.00853 (2015).
analysis of different activation functions in lstm networks for classifica- [51] H. Li, W. Ouyang, X. Wang, Multi-bias non-linear activation in deep
tion, Neural Computing and Applications 31 (7) (2019) 2507–2521. neural networks, in: International Conference on Machine Learning,
[26] Y. Zhou, D. Li, S. Huo, S.-Y. Kung, Soft-root-sign activation function, 2016, pp. 221–229.
arXiv preprint arXiv:2003.00547 (2020). [52] G. Klambauer, T. Unterthiner, A. Mayr, S. Hochreiter, Self-normalizing
[27] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep net- neural networks, in: Advances in Neural Information Processing Sys-
work learning by exponential linear units (elus), in: International Con- tems, 2017, pp. 971–980.
ference on Learning Representations, 2016. [53] J. T. Barron, Continuously differentiable exponential linear units, arXiv
[28] F. Agostinelli, M. Hoffman, P. Sadowski, P. Baldi, Learning activation (2017) arXiv–1704.
functions to improve deep neural networks, International Conference on [54] L. Trottier, P. Gigu, B. Chaib-draa, et al., Parametric exponential lin-
Learning Representations Workshops (2015). ear unit for deep convolutional neural networks, in: IEEE International
[29] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation func- Conference on Machine Learning and Applications, 2017, pp. 207–214.
tions, International Conference on Learning Representations Workshops [55] Y. Li, C. Fan, Y. Li, Q. Wu, Y. Ming, Improving deep neural network
(2018). with multiple parametric exponential linear units, Neurocomputing 301
[30] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2018) 11–24.
(2015) 436–444. [56] Z. Qiumei, T. Dan, W. Fenghua, Improved convolutional neural network
[31] P. Chandra, Y. Singh, An activation function adapting training algorithm based on fast exponentially linear unit activation function, IEEE Access
for sigmoidal feedforward networks, Neurocomputing 61 (2004) 429– 7 (2019) 151359–151367.
437. [57] Y. Ying, J. Su, P. Shan, L. Miao, X. Wang, S. Peng, Rectified exponential
[32] S. S. Sodhi, P. Chandra, Bi-modal derivative activation function for sig- units for convolutional neural networks, IEEE Access 7 (2019) 101633–
moidal feedforward networks, Neurocomputing 143 (2014) 182–196. 101640.
[33] S. Eger, P. Youssef, I. Gurevych, Is it time to swish? compar- [58] D. Kim, J. Kim, J. Kim, Elastic exponential linear units for convolutional
ing deep learning activation functions across nlp tasks, arXiv preprint neural networks, Neurocomputing 406 (2020) 253–266.
arXiv:1901.02671 (2019). [59] Q. Cheng, H. Li, Q. Wu, L. Ma, N. N. King, Parametric deformable
[34] A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier nonlinearities improve exponential linear units for deep neural networks, Neural Networks 125
neural network acoustic models, in: International Conference on Ma- (2020) 281–289.
chine Learning, Vol. 30, 2013, p. 3. [60] J. Si, S. L. Harris, E. Yfantis, A dynamic relu on neural network, in:
[35] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing IEEE Dallas Circuits and Systems Conference, 2018, pp. 1–6.
human-level performance on imagenet classification, in: IEEE interna- [61] H. Hu, Vrelu activation functions for artificial neural networks, in: In-
tional conference on computer vision, 2015, pp. 1026–1034. ternational Conference on Natural Computation, Fuzzy Systems and
[36] W. Shang, K. Sohn, D. Almeida, H. Lee, Understanding and improving Knowledge Discovery, 2018, pp. 856–860.
convolutional neural networks via concatenated rectified linear units, in: [62] G. Lin, W. Shen, Research on convolutional neural network based on
International Conference on Machine Learning, 2016, pp. 2217–2225. improved relu piecewise activation function, Procedia Computer Science
[37] S. S. Liew, M. Khalil-Hani, R. Bakhteri, Bounded activation functions 131 (2018) 977–984.
for enhanced training stability of deep neural networks on visual pattern [63] D. Macêdo, C. Zanchettin, A. L. Oliveira, T. Ludermir, Enhancing batch
recognition problems, Neurocomputing 216 (2016) 718–734. normalized convolutional networks using displaced rectifier linear units:
[38] R. Duggal, A. Gupta, P-telu: Parametric tan hyperbolic linear unit acti- A systematic comparative study, Expert Systems with Applications 124
vation for deep neural networks, in: IEEE International Conference on (2019) 271–281.
Computer Vision Workshops, 2017, pp. 974–978. [64] L. B. Godfrey, An evaluation of parametric activation functions for deep
[39] S. Qiu, X. Xu, B. Cai, Frelu: Flexible rectified linear units for improving learning, in: IEEE International Conference on Systems, Man and Cy-
convolutional neural networks, in: International Conference on Pattern bernetics, 2019, pp. 3006–3011.
[65] X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, S. Yan, Deep learning with Workshop on Meta-learning, 2017.
s-shaped rectified linear activation units, in: AAAI Conference on Arti- [90] A. D. Jagtap, K. Kawaguchi, G. E. Karniadakis, Adaptive activation
ficial Intelligence, 2016. functions accelerate convergence in deep and physics-informed neural
[66] V. S. Bawa, V. Kumar, Linearized sigmoidal activation: A novel acti- networks, Journal of Computational Physics 404 (2020) 109136.
vation function with tractable non-linear characteristics to boost repre- [91] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, R. Garcia, Incorporating
sentation capability, Expert Systems with Applications 120 (2019) 346– second-order functional knowledge for better option pricing, in: Ad-
356. vances in Neural Information Processing Systems, 2001, pp. 472–478.
[67] X. Wang, Y. Qin, Y. Wang, S. Xiang, H. Chen, Reltanh: An activation [92] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks,
function with vanishing gradient resistance for sae-based dnns and its in: International Conference on Artificial Intelligence and Statistics,
application to rotating machinery fault diagnosis, Neurocomputing 363 2011, pp. 315–323.
(2019) 88–98. [93] H. Zheng, Z. Yang, W. Liu, J. Liang, Y. Li, Improving deep neural net-
[68] X. Hu, P. Niu, J. Wang, X. Zhang, A dynamic rectified linear activation works using softplus units, in: International Joint Conference on Neural
units, IEEE Access 7 (2019) 180409–180416. Networks, 2015, pp. 1–4.
[69] A. Nicolae, Plu: The piecewise linear unit activation function, arXiv [94] Q. Liu, S. Furber, Noisy softplus: a biology inspired activation function,
preprint arXiv:1809.09534 (2018). in: International Conference on Neural Information Processing, 2016,
[70] L. B. Godfrey, M. S. Gashler, A continuum among logarithmic, linear, pp. 405–412.
and exponential functions, and its potential to improve generalization in [95] H. Zhao, F. Liu, L. Li, C. Luo, A novel softplus linear unit for deep
neural networks, in: International Joint Conference on Knowledge Dis- convolutional neural networks, Applied Intelligence 48 (7) (2018) 1707–
covery, Knowledge Engineering and Knowledge Management, Vol. 1, 1720.
2015, pp. 481–486. [96] C. Xu, J. Huang, S.-p. Wang, A.-q. Hu, A novel parameterized activation
[71] B. Grelsson, M. Felsberg, Improved learning in convolutional neural net- function in visual geometry group, in: International Conference on Data
works with shifted exponential linear units (shelus), in: International Science and Business Analytics, 2018, pp. 386–389.
Conference on Pattern Recognition, 2018, pp. 517–522. [97] K. Sun, J. Yu, L. Zhang, Z. Dong, A convolutional neural network model
[72] Y. Yu, K. Adu, N. Tashi, P. Anokye, X. Wang, M. A. Ayidzoe, Rmaf: based on improved softplus activation function, in: International Confer-
Relu-memristor-like activation function for deep learning, IEEE Access ence on Applications and Techniques in Cyber Security and Intelligence,
8 (2020) 72727–72741. 2019, pp. 1326–1335.
[73] M. Basirat, P. M. Roth, The quest for the golden activation function, [98] Y. Chen, Y. Mai, J. Xiao, L. Zhang, Improving the antinoise ability of
arXiv preprint arXiv:1808.00783 (2018). dnns via a bio-inspired noise adaptive activation function rand softplus,
[74] S. Scardapane, M. Scarpiniti, D. Comminiello, A. Uncini, Learning ac- Neural Computation 31 (6) (2019) 1215–1233.
tivation functions from data using cubic spline interpolation, in: Italian [99] D. Misra, Mish: A self regularized non-monotonic neural activation
Workshop on Neural Nets, 2017, pp. 73–83. function, arXiv preprint arXiv:1908.08681 (2019).
[75] A. Mishra, P. Chandra, U. Ghose, S. S. Sodhi, Bi-modal derivative adap- [100] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4: Optimal speed and
tive activation function sigmoidal feedforward artificial neural networks, accuracy of object detection, arXiv preprint arXiv:2004.10934 (2020).
Applied Soft Computing 61 (2017) 983–994. [101] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv
[76] S. Qian, H. Liu, C. Liu, S. Wu, H. San Wong, Adaptive activation preprint arXiv:1606.08415 (2016).
functions in convolutional neural networks, Neurocomputing 272 (2018) [102] C. Yu, Z. Su, Symmetrical gaussian error linear units (sgelus), arXiv
204–212. preprint arXiv:1911.03925 (2019).
[77] E. Alcaide, E-swish: Adjusting activations to different network depths, [103] Q. Su, L. Carin, et al., A probabilistic framework for nonlinearities in
arXiv preprint arXiv:1801.07145 (2018). stochastic neural networks, in: Advances in Neural Information Pro-
[78] Ö. F. Ertuğrul, A novel type of activation function in artificial neural cessing Systems, 2017, pp. 4486–4495.
networks: Trained activation function, Neural Networks 99 (2018) 148– [104] J. Lee, K. Shridhar, H. Hayashi, B. K. Iwana, S. Kang, S. Uchida,
157. Probact: A probabilistic activation function for deep neural networks,
[79] M. Goyal, R. Goyal, B. Lall, Learning activation functions: A arXiv preprint arXiv:1905.10761 (2019).
new paradigm of understanding neural networks, arXiv preprint [105] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. H. Saltz, Convnets with
arXiv:1906.09529 (2019). smooth adaptive activation functions for regression, Proceedings of Ma-
[80] G. Maguolo, L. Nanni, S. Ghidoni, Ensemble of convolutional neu- chine Learning Research 54 (2017) 430.
ral networks trained with different activation functions, arXiv preprint [106] Y. Berradi, Symmetric power activation functions for deep neural net-
arXiv:1905.02473 (2019). works, in: International Conference on Learning and Optimization Al-
[81] H. H. Chieng, N. Wahid, P. Ong, S. R. K. Perla, Flatten-t swish: a gorithms: Theory and Applications, 2018, pp. 1–6.
thresholded relu-swish-like activation function for deep learning, arXiv [107] E. López-Rubio, F. Ortega-Zamorano, E. Domı́nguez, J. Muñoz-Pérez,
preprint arXiv:1812.06247 (2018). Piecewise polynomial activation functions for feedforward neural net-
[82] N. Patwardhan, M. Ingalhalikar, R. Walambe, Aria: Utilizing richard’s works, Neural Processing Letters 50 (1) (2019) 121–147.
curve for controlling the non-monotonicity of the activation function in [108] F. Farhadi, V. P. Nia, A. Lodi, Activation adaptation in neural networks,
deep neural nets, arXiv preprint arXiv:1805.08878 (2018). arXiv preprint arXiv:1901.09849 (2019).
[83] M. Dushkoff, R. Ptucha, Adaptive activation functions for deep net- [109] B. Li, S. Tang, H. Yu, Powernet: Efficient representations of polynomi-
works, Electronic Imaging 2016 (19) (2016) 1–5. als and smooth functions by deep neural networks with rectified power
[84] F. Manessi, A. Rozza, Learning combinations of activation functions, in: units, arXiv preprint arXiv:1909.05136 (2019).
IEEE International Conference on Pattern Recognition, 2018, pp. 61–66. [110] M. Telgarsky, Neural networks and rational functions, in: International
[85] L. R. Sütfeld, F. Brieger, H. Finger, S. Füllhase, G. Pipa, Adaptive blend- Conference on Machine Learning, 2017, pp. 3387–3393.
ing units: Trainable activation functions for deep neural networks, arXiv [111] A. Molina, P. Schramowski, K. Kersting, Padé activation units: End-to-
preprint arXiv:1806.10064 (2018). end learning of flexible activation functions in deep networks, Interna-
[86] M. Wang, B. Liu, H. Foroosh, Look-up table unit activation function tional Conference on Learning Representations (2020).
for deep convolutional neural networks, in: IEEE Winter Conference on [112] A. T. Nicolas Boullé, Yuji Nakatsukasa, Rational neural networks, arXiv
Applications of Computer Vision, 2018, pp. 1225–1233. preprint arXiv:2004.01902 (2020).
[87] D. Klabjan, M. Harmon, Activation ensembles for deep neural networks, [113] A. Apicella, F. Isgrò, R. Prevete, A simple and efficient architecture for
in: IEEE International Conference on Big Data, 2019, pp. 206–214. trainable activation functions, Neurocomputing 370 (2019) 1–15.
[88] C. Eisenach, Z. Wang, H. Liu, Nonparametrically learning activation [114] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, Z. Liu, Dynamic relu, arXiv
functions in deep neural nets, in: International Conference on Learning preprint arXiv:2003.10027 (2020).
Representations Workshops, 2017. [115] M. Wang, B. Liu, H. Foroosh, Wide hidden expansion layer for deep
[89] C. J. Vercellino, W. Y. Wang, Hyperactivations for activation function convolutional neural networks, in: IEEE Winter Conference on Appli-
exploration, in: Conference on Neural Information Processing Systems cations of Computer Vision, 2020, pp. 934–942.
[116] A. Asif, et al., Learning neural activations, arXiv preprint [140] M. M. Lau, K. H. Lim, Review of adaptive activation function in deep
arXiv:1912.12187 (2019). neural network, in: IEEE-EMBS Conference on Biomedical Engineer-
[117] S. Scardapane, S. Van Vaerenbergh, S. Totaro, A. Uncini, Kafnets: ing and Sciences, 2018, pp. 686–690.
Kernel-based non-parametric activation functions for neural networks, [141] A. K. Dubey, V. Jain, Comparative study of convolution neural network’s
Neural Networks 110 (2019) 19–32. relu and leaky-relu activation functions, in: Applications of Computing,
[118] S. Scardapane, E. Nieddu, D. Firmani, P. Merialdo, Multikernel activa- Automation and Wireless Systems in Electrical Engineering, Springer,
tion functions: formulation and a case study, in: INNS Big Data and 2019, pp. 873–880.
Deep Learning conference, 2019, pp. 320–329. [142] C. Banerjee, T. Mukherjee, E. Pasiliao Jr, An empirical study on general-
[119] S. Scardapane, S. Van Vaerenbergh, A. Hussain, A. Uncini, Complex- izations of the relu activation function, in: ACM Southeast Conference,
valued neural networks with nonparametric activation functions, IEEE 2019, pp. 164–167.
Transactions on Emerging Topics in Computational Intelligence (2018). [143] T. Villmann, J. Ravichandran, A. Villmann, D. Nebel, M. Kaden, Ac-
[120] S. Scardapane, S. Van Vaerenbergh, D. Comminiello, A. Uncini, Widely tivation functions for generalized learning vector quantization-a perfor-
linear kernels for complex-valued kernel activation functions, in: IEEE mance comparison, arXiv preprint arXiv:1901.05995 (2019).
International Conference on Acoustics, Speech and Signal Processing, [144] G. Castaneda, P. Morris, T. M. Khoshgoftaar, Evaluation of maxout acti-
2019, pp. 8528–8532. vations in deep learning across several big data domains, Journal of Big
[121] M. Kobayashi, Singularities of three-layered complex-valued neural net- Data 6 (1) (2019) 72.
works with split activation function, IEEE Transactions on Neural Net- [145] Y. Wang, Y. Li, Y. Song, X. Rong, The influence of the activation func-
works and Learning Systems 29 (5) (2017) 1900–1907. tion in a convolution neural network model of facial expression recogni-
[122] J. Pennington, S. Schoenholz, S. Ganguli, Resurrecting the sigmoid in tion, Applied Sciences 10 (5) (2020) 1897.
deep learning through dynamical isometry: theory and practice, in: Ad- [146] A. Apicella, F. Donnarumma, F. Isgrò, R. Prevete, A survey on modern
vances in Neural Information Processing Systems, 2017, pp. 4785–4795. trainable activation functions, arXiv preprint arXiv:2005.00817 (2020).
[123] E. Sansone, F. G. De Natale, Training feedforward neural networks with [147] T. Szandała, Review and comparison of commonly used activation func-
standard logistic activations is feasible, arXiv preprint arXiv:1710.01013 tions for deep neural networks, in: Bio-inspired Neurocomputing, 2020,
(2017). pp. 203–224.
[124] L. Lu, Y. Shin, Y. Su, G. E. Karniadakis, Dying relu and initializa- [148] A. Krizhevsky, Learning multiple layers of features from tiny images,
tion: Theory and numerical examples, arXiv preprint arXiv:1903.06733 Tech Report, Univ. of Toronto (2009).
(2019). [149] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[125] D. Arpit, Y. Bengio, The benefits of over-parameterization at initializa- M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural net-
tion in deep relu networks, arXiv preprint arXiv:1901.03611 (2019). works for mobile vision applications, arXiv preprint arXiv:1704.04861
[126] D. Aguirre, O. Fuentes, Improving weight initialization of relu and out- (2017).
put layers, in: International Conference on Artificial Neural Networks, [150] K. Simonyan, A. Zisserman, Very deep convolutional networks for
2019, pp. 170–184. large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
[127] R. Burkholz, A. Dubatovka, Initialization of relus for dynamical isome- [151] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
try, in: Advances in Neural Information Processing Systems, 2019, pp. V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE
2382–2392. Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[128] D. Yarotsky, Error bounds for approximations with deep relu networks, [152] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
Neural Networks 94 (2017) 103–114. nition, in: IEEE Conference on Computer Vision and Pattern Recogni-
[129] R. Arora, A. Basu, P. Mianjy, A. Mukherjee, Understanding deep neural tion, 2016, pp. 770–778.
networks with rectified linear units, arXiv preprint arXiv:1611.01491 [153] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: IEEE Con-
(2016). ference on Computer Vision and Pattern Recognition, 2018, pp. 7132–
[130] M. Hein, M. Andriushchenko, J. Bitterwolf, Why relu networks yield 7141.
high-confidence predictions far away from the training data and how to [154] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con-
mitigate the problem, in: IEEE Conference on Computer Vision and nected convolutional networks, in: Proceedings of the IEEE conference
Pattern Recognition, 2019, pp. 41–50. on computer vision and pattern recognition, 2017, pp. 4700–4708.
[131] S. Goel, S. Karmalkar, A. Klivans, Time/accuracy tradeoffs for learn- [155] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for au-
ing a relu with respect to gaussian marginals, in: Advances in Neural tomatic evaluation of machine translation, in: Proceedings of the 40th
Information Processing Systems, 2019, pp. 8582–8591. annual meeting of the Association for Computational Linguistics, 2002,
[132] S. Dittmer, J. Emily, P. Maass, Singular values for relu layers, IEEE pp. 311–318.
Transactions on Neural Networks and Learning Systems (2019).
[133] A. Kristiadi, M. Hein, P. Hennig, Being bayesian, even just a bit,
fixes overconfidence in relu networks, arXiv preprint arXiv:2002.10118
(2020).
[134] B. Karlik, A. V. Olgac, Performance analysis of various activation func-
tions in generalized mlp architectures of neural networks, International
Journal of Artificial Intelligence and Expert Systems 1 (4) (2011) 111–
122.
[135] G. Alcantara, Empirical analysis of non-linear activation functions
for deep neural networks in classification tasks, arXiv preprint
arXiv:1710.11272 (2017).
[136] H. K. Vydana, A. K. Vuppala, Investigative study of various activation
functions for speech recognition, in: National Conference on Commu-
nications, 2017, pp. 1–5.
[137] D. Pedamonti, Comparison of non-linear activation functions for
deep neural networks on mnist classification task, arXiv preprint
arXiv:1804.02763 (2018).
[138] C. Nwankpa, W. Ijomah, A. Gachagan, S. Marshall, Activation func-
tions: Comparison of trends in practice and research for deep learning,
arXiv preprint arXiv:1811.03378 (2018).
[139] K. Eckle, J. Schmidt-Hieber, A comparison of deep networks with relu
activation function and linear spline-type methods, Neural Networks 110
(2019) 232–242.

Dubey et al. 2022_Activation functions in deep learning_ A comprehensive survey and benchmark
No ratings yet
Dubey et al. 2022_Activation functions in deep learning_ A comprehensive survey and benchmark
17 pages
Activation Function - Lect 1
No ratings yet
Activation Function - Lect 1
5 pages
Activation Functions For Neural Networks: Application and Performance-Based Comparison
No ratings yet
Activation Functions For Neural Networks: Application and Performance-Based Comparison
5 pages
Activation
No ratings yet
Activation
7 pages
Lecture 2.1.2activation Function
No ratings yet
Lecture 2.1.2activation Function
15 pages
Performance Analysis of Various Activation Functio
No ratings yet
Performance Analysis of Various Activation Functio
7 pages
UNIT-III Activation-function
No ratings yet
UNIT-III Activation-function
6 pages
Lec08-1Activation Functions
No ratings yet
Lec08-1Activation Functions
19 pages
4 4 Choosing The Right Activation Function For Neural Networks
No ratings yet
4 4 Choosing The Right Activation Function For Neural Networks
25 pages
Pr1_ANN_Writeup.docx
No ratings yet
Pr1_ANN_Writeup.docx
7 pages
SoftComp 02
No ratings yet
SoftComp 02
33 pages
Need and Use of Activation Functions in Anndeep Learning
No ratings yet
Need and Use of Activation Functions in Anndeep Learning
7 pages
Nonlinear Activation Functions in CNN Based On Fluid Dynamics and Its Applications
No ratings yet
Nonlinear Activation Functions in CNN Based On Fluid Dynamics and Its Applications
14 pages
Activation Functions in Neural Networks
No ratings yet
Activation Functions in Neural Networks
10 pages
Activation Functions and Their Characteristics in Deep Neural Networks
No ratings yet
Activation Functions and Their Characteristics in Deep Neural Networks
6 pages
Functii de Activare1
No ratings yet
Functii de Activare1
89 pages
5 TH
No ratings yet
5 TH
22 pages
Act_Fun
No ratings yet
Act_Fun
7 pages
Activation Function: Deep Neural Networks
No ratings yet
Activation Function: Deep Neural Networks
47 pages
ML Mentorship Prahitha Movva V1
No ratings yet
ML Mentorship Prahitha Movva V1
5 pages
3-Activation Function, Loss Function-24-07-2024
No ratings yet
3-Activation Function, Loss Function-24-07-2024
19 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
Activation Functions
No ratings yet
Activation Functions
6 pages
Activation Function
No ratings yet
Activation Function
43 pages
JETIR2006041
No ratings yet
JETIR2006041
8 pages
Neural Network example and Activation Functions Summary
No ratings yet
Neural Network example and Activation Functions Summary
2 pages
Activation Function
No ratings yet
Activation Function
31 pages
DL Answers
No ratings yet
DL Answers
24 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
Activation Function
No ratings yet
Activation Function
9 pages
Activation Functions in Neural Networks: What Is Activation Function?
No ratings yet
Activation Functions in Neural Networks: What Is Activation Function?
11 pages
Activation Functions
No ratings yet
Activation Functions
9 pages
activatn fn 2
No ratings yet
activatn fn 2
10 pages
DL Unit2 HD
No ratings yet
DL Unit2 HD
141 pages
Activation Function
No ratings yet
Activation Function
13 pages
ActivationFun Survey Arxiv
No ratings yet
ActivationFun Survey Arxiv
49 pages
Deep Learning
No ratings yet
Deep Learning
10 pages
4 - Activation Functions in Neural Networks
No ratings yet
4 - Activation Functions in Neural Networks
12 pages
Unit 3 Deep Learning
No ratings yet
Unit 3 Deep Learning
11 pages
NNN Proect
No ratings yet
NNN Proect
12 pages
Activation functions 2
No ratings yet
Activation functions 2
5 pages
Ijisae 4865
No ratings yet
Ijisae 4865
8 pages
Study of Ensemble of Activation Functions in Deep Learning
No ratings yet
Study of Ensemble of Activation Functions in Deep Learning
10 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
Activation Function
No ratings yet
Activation Function
36 pages
Module1
No ratings yet
Module1
124 pages
what are the activation functions, how do i deter...
No ratings yet
what are the activation functions, how do i deter...
3 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
3. Activation Function
No ratings yet
3. Activation Function
14 pages
Deep Learning Tutorial 3
No ratings yet
Deep Learning Tutorial 3
12 pages
9 DL_ANN_ActivationFunctions
No ratings yet
9 DL_ANN_ActivationFunctions
20 pages
Research Paper
No ratings yet
Research Paper
13 pages
Experiment No. 1 SL-II (ANN)
No ratings yet
Experiment No. 1 SL-II (ANN)
3 pages
Multilayer_Feedforward_Network- Activation Functions (1)
No ratings yet
Multilayer_Feedforward_Network- Activation Functions (1)
9 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Performance Analysis of Various Activation Functions in Neural
No ratings yet
Performance Analysis of Various Activation Functions in Neural
20 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
From Everand
Enhancing Deep Learning Performance Using Displaced Rectifier Linear Unit
David Macêdo
No ratings yet
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
From Everand
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
William Smith
No ratings yet
2EL1730 ML Lecture07 Neural Networks
No ratings yet
2EL1730 ML Lecture07 Neural Networks
65 pages
Lecture 16-Multilayer Perceptron
No ratings yet
Lecture 16-Multilayer Perceptron
24 pages
Intro To Machine Learning With TensorFlow Nanodegree Program Syllabus
No ratings yet
Intro To Machine Learning With TensorFlow Nanodegree Program Syllabus
15 pages
AI For EV
No ratings yet
AI For EV
22 pages
AI Vs ML Vs DL
No ratings yet
AI Vs ML Vs DL
8 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Nvidia Learning Learning Path Developers It Administrators
No ratings yet
Nvidia Learning Learning Path Developers It Administrators
19 pages
Advanced NLP With Spacy Chapter4
No ratings yet
Advanced NLP With Spacy Chapter4
26 pages
Random Forest
No ratings yet
Random Forest
25 pages
Q Bank2
No ratings yet
Q Bank2
4 pages
Discovering Knee Osteoarthritis Using CNN Enhanced With Alexnet
No ratings yet
Discovering Knee Osteoarthritis Using CNN Enhanced With Alexnet
26 pages
Neovarsity DSML Brochure
No ratings yet
Neovarsity DSML Brochure
7 pages
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
No ratings yet
Geophysical Prospecting - 2024 - Li - One‐dimensional deep learning inversion of marine controlled‐source electromagnetic (1)
21 pages
A Survey On Deep Learning Event Extraction: Approaches and Applications
No ratings yet
A Survey On Deep Learning Event Extraction: Approaches and Applications
22 pages
Seminar Pgdca
No ratings yet
Seminar Pgdca
13 pages
Chap-2 Agents & Environments
No ratings yet
Chap-2 Agents & Environments
37 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Intro to Deep Learning final exam IT3320E HUST
No ratings yet
Intro to Deep Learning final exam IT3320E HUST
8 pages
be_computer-engineering_semester-8_2024_march_deep-learning-2019-pattern
No ratings yet
be_computer-engineering_semester-8_2024_march_deep-learning-2019-pattern
2 pages
1
No ratings yet
1
6 pages
Image Inpainting For Irregular Holes Using Partial Convolutions
No ratings yet
Image Inpainting For Irregular Holes Using Partial Convolutions
23 pages
Master Spilak Bruno
No ratings yet
Master Spilak Bruno
73 pages
Carla Jensen Resume
No ratings yet
Carla Jensen Resume
1 page
9 ai one mark
No ratings yet
9 ai one mark
9 pages
RoBERTa-LSTM A Hybrid Model For Sentiment Analysis With Transformer and Recurrent Neural Network
No ratings yet
RoBERTa-LSTM A Hybrid Model For Sentiment Analysis With Transformer and Recurrent Neural Network
9 pages
Artificial Intelligence in 5G
No ratings yet
Artificial Intelligence in 5G
34 pages
How To Implement A Multilayer Neural Network in Matlab?: Ask Question
No ratings yet
How To Implement A Multilayer Neural Network in Matlab?: Ask Question
2 pages
Stock Market Prediction Using LSTM Recurrent Neural Network Stock Market Prediction Using LSTM Recurrent Neural Network
No ratings yet
Stock Market Prediction Using LSTM Recurrent Neural Network Stock Market Prediction Using LSTM Recurrent Neural Network
6 pages
Module 3
No ratings yet
Module 3
42 pages
NN Assignment PDF
No ratings yet
NN Assignment PDF
3 pages