4
4
Abstract
arXiv:2109.14545v3 [cs.LG] 28 Jun 2022
Neural networks have shown tremendous growth in recent years to solve numerous problems. Various types of neural networks
have been introduced to deal with different types of problems. However, the main goal of any neural network is to transform
the non-linearly separable input data into more linearly separable abstract features using a hierarchy of layers. These layers are
combinations of linear and nonlinear functions. The most popular and common non-linearity layers are activation functions (AFs),
such as Logistic Sigmoid, Tanh, ReLU, ELU, Swish and Mish. In this paper, a comprehensive overview and survey is presented
for AFs in neural networks for deep learning. Different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based,
ELU based, and Learning based are covered. Several characteristics of AFs such as output range, monotonicity, and smoothness are
also pointed out. A performance comparison is also performed among 18 state-of-the-art AFs with different networks on different
types of data. The insights of AFs are presented to benefit the researchers for doing further research and practitioners to select
among different choices. The code used for experimental comparison is released at: https://github.com/shivram1987/
ActivationFunctions.
In recent years, deep learning has shown a tremondous 1. This survey provides a detailed classification for a wide
growth to solve the challenging problems such as object de- range of AFs. It also includes the AFs very comprehen-
tection [1], semantic segmentation [2], person re-identification sively, including Logistic Sigmoid/Tanh, Rectified Unit,
[3], image retrieval [4], anomaly detection [5], skin disease di- Exponential Unit, and Adaptive AFs.
agnosis [6], and many more. Various types of neural networks 2. This survey enriches the reader with the state-of-the-art
have been defined in deep learning to learn abstract features AFs with analysis from various perspectives. It specifi-
from data, such as Multilayer Perceptron (MLP) [7], Convolu- cally covers the progress in AFs for deep learning.
tional Neural Networks (CNN) [8], Recurrent Neural Networks
3. This survey also summarizes the AFs with brief highlights
(RNN) [9], and Generative Adversarial Networks (GAN) [10].
and important discussions to depict its suitability for dif-
The important aspects of neural networks include weight ini-
ferent types of data (Refer to Table 6).
tialization [11], loss functions [12], different layers [13], over-
fitting [14], and optimization [15]. 4. This survey is compared with the existing survey and per-
The activation functions (AFs) play a very crucial role in neu- formance analysis to show its importance (Refer to Table
ral networks [16] by learning the abstract features through non- 7).
linear transformations. Some common properties of the AFs are 5. This paper also presents the performance comparisons
as follows: a) it should add the non-linear curvature in the opti- on 4 benchmark datasets of different modalities using 18
mization landscape to improve the training convergence of the state-of-the-art AFs with different types of networks (Re-
network; b) it should not increase the computational complexity fer to Tables 8, 9 and 11).
of the model extensively; c) it should not hamper the gradient
flow during training; d) it should retain the distribution of data The evolution of AFs is illustrated in Section 2. The progress
to facilitate the better training of the network. Several AFs have in Logistic Sigmoid and Tanh, rectified, exponential, adaptive
been explored in recent years for deep learning to achieve the and miscellaneous AFs are summarized in Section 3, 4, 5, 6,
above mentioned properties. This survey is dedicated to the de- and 7, respectively. Some aspects of AFs are discussed in Sec-
velopments in the area of AFs in neural networks. The insights tion 8. A comprehensive performance analysis is conducted in
of the different AFs are presented along with the reasoning to Section 9. A summary with conclusions and recommendations
benefit the deep learning community. The major contributions is provided in Section 10.
Activation Functions (AFs)
problem is minimized by the Hexpo function [21] which is sim- complexity of the SRS function. Most of the variants of Sig-
ilar to Tanh with a scaled gradient. It is given as, moid/Tanh AFs have tried to overcome the vanishing gradient
issue. However, this issue is still present in most of these AFs.
−a × (e−x/b − 1), x ≥ 0
Hexpo(x) =
(8)
c × (e x/d − 1),
x<0
4. Rectified Activation Functions
in the output range of [−c, a]. The output of the sigmoid func-
tion is multiplied with its input in sigmoid-weighted linear unit A summary of rectified AFs is illustrated in Table 3. Recti-
(SiLU) AF [23] as fied Linear Unit (ReLU) is a simple function which is the iden-
tity function for positive input and zero for negative input and
S iLU(x) = x × S igmoid(x) (9)
given as,
in the output range of (−0.5, ∞). At the same time an improved
logistic Sigmoid (ISigmoid) AF [22] is proposed to solve the x, if x ≥ 0
ReLU(x) = max(0, x) = .
(14)
vanishing gradient problem of Sigmoid with the help of a piece- 0, otherwise
wise combination of sigmoidal and linear functions. It is de-
fined as, Hence, the range of ReLU is [0, ∞). The gradient for positive
and negative inputs is one and zero, respectively. The ReLU
α × (x − a) + S igmoid(a), x ≥ a function solves the problem of computational complexity of the
IS igmoid(x) = −a < x < a
S igmoid(x), Logistic Sigmoid and Tanh functions. The downside of ReLU
α × (x + a) + S igmoid(a), x ≤ −a
is with the vanishing gradient problem for the negative inputs.
(10) In spite of having the vanishing gradient problem, the ReLU AF
in the output range of (−∞, ∞). The Linearly scaled hyper- has been used very extensively with the deep learning models.
bolic tangent (LiSHT) AF scales the Tanh in a linear fashion to The advancements in ReLU based AFs are discussed in the rest
overcome the vanishing gradient issue [24]. The LiSHT can be of this section.
defined as,
LiS HT (x) = x × T anh(x) (11) 4.1. On the Non-utilization of Negative Values of ReLU
in the output range of [0, ∞). The LiSHT function is symmetric, Vanishing gradient is the main problem with ReLU AF which
but is has the shortcoming of including unbounded and non- is caused due to the non-utilization of negative values. A Leaky
negative outputs only. The Elliott AF [25] is similar to Sigmoid Rectified Linear Unit (LReLU) is the extension of ReLU by
function in terms of the characteristics diagram and defined as, utilizing the negative values [34]. The LReLU is defined as,
0.5 × x
Elliott(x) = + 0.5
(12) x, x≥0
1 + |x|
LReLU(x) =
(15)
0.01 × x, x < 0
in the output range of [0, 1]. The Soft-Root-Sign (SRS) AF [26]
is defined as, in the output range of (−∞, ∞). The LReLU has been used
x
S RS (x) = x (13) in many applications with promising performance. One ma-
α + e−x/β
jor problem associated with LReLU is the finding of the right
α×β
in the output range of [ β−α×e , α] where α and β are the learn- slope in linear function for negative inputs. Different slopes
able parameters. The use of additional parameters increases the might be suited for different problems and different networks.
Thus, it is extended to Parametric ReLU (PReLU) by consid- in the output range of [0, ∞) where a is a random number. At
ering the slope for negative input as a trainable parameter [35]. test time, the offset is set to zero. A data dependent Average
The PReLU is given as, Biased ReLU (AB-ReLU) [44] is also investigated to tackle the
negative values by a horizontal shifting based on the average of
x,
x≥0 features. The ABReLU can be written as,
PReLU(x) =
(16)
p × x, x < 0
x − β, x − β ≥ 0
ABReLU(x) =
(22)
in the output range of (−∞, ∞) where p is the trainable parame- 0,
x−β<0
ter. However, it can lead to overfitting easily which is the down-
side of PReLU. The Maxout layer, which computes the maxi- having the output range in [0, ∞) where β is computed as the
mum of several linear units, is also used as AF [49]. Both ReLU average of input activation map to the activation function. The
and Leaky ReLU can be seen as the special cases of Maxout. batch dependent threshold for the ReLU is used by the Dynamic
The randomized ReLU (RReLU) considers the slope of LReLU ReLU (D-ReLU) [60]. The Dual ReLU (DualReLU) [42] is a
randomly during training sampled from an uniform distribution two dimensional AF for recurrent neural networks. The Dual-
U(l, u) [50]. The RReLU is defined as, ReLU is given as,
DualReLU(a, b) = max(0, a) − max(0, b) (23)
x,
x≥0
RReLU(x) =
(17)
R × x, x < 0
in the output range of (−∞, ∞) where a and b are the inputs in
different dimensions. Similar to the CReLU, the PairedReLU
in the output range of (−∞, ∞) where R ∼ U(l, u), l < u and AF is used for image super-resolution [43]. The PairedReLU is
l+u
l, u ∈ [0, 1). It uses a deterministic value x/ 2 during test given as,
time.
The ReLU is not able to utilize the potential useful informa- PairedReLU(x) = [max(s × x − θ, 0), max(s p × x − θ p , 0)] (24)
tion from the negative values. In most of the networks, the fea-
ture map given as the input to AF is dense near zero. Thus, in the output range of (−∞, ∞). However, the computational
a small jitter in the rectification point can lead to difficulty complexity of PairedReLU is increased as compared to CReLU.
in training. Concatenated ReLU (CReLU) [36] concatenates In another attempt, V-shaped ReLU (vReLU) AF [61] is defined
the ReLU’s output over original input and negated input. The as,
CReLU can be given as, x,
x≥0
vReLU(x) =
(25)
−x, x < 0
CReLU(x) = [ReLU(x), ReLU(−x)] (18)
having the output range in [0, ∞]. The vReLU activation func-
in the output range of [0, ∞). The CReLU is derived from the tion suffers from the non-symmetric output. The SignReLU
fact that the lower layer kernels in CNN models form pairs with AF utilizes the negative values using the Softsign function [62].
opposite phases. The shifting of the feature map with multiple The positive part of SignReLU is the same as the ReLU.
biases is also performed before the ReLU layer [51]. How- A Displaced ReLU (DisReLU) [63] is designed as a gener-
ever, it increases the model complexity as more ReLUs are re- alization of Shifted ReLU [39]. The DisReLU displaces the
quired. A Parametric Tan Hyperbolic Linear Unit (P-TELU) is rectification point to consider the negative values, given as,
also used as an AF [38]. The P-TELU is defined as,
x,
x ≥ −δ
DisReLU(x) =
(26)
−δ, x < −δ
x≥0
x,
PT ELU(x) =
(19)
α × Tanh(β × x), x < 0
having the output range in [−δ, ∞]. A Bendable Linear Unit
(BLU) AF is investigated as,
in the output range of [−α, ∞) where {α, β} ≥ 0 are the learnable
parameters. √
BLU(x) = β × ( x2 + 1 − 1) + x (27)
The Flexible ReLU (FReLU) [39] captures the negative val-
ues with a rectified point which is considered as trainable in the where −1 ≤ β ≤ 1 is a learnable parameter to adapt the shape
Shifted ReLU [39]. The FReLU is given as, between the identity function and a rectifier function [64]. A
Lipschitz ReLU (L-ReLU) AF uses the piecewise linear func-
FReLU(x) = ReLU(x) + b (20) tions to model the degree of presence and the degree of absence
of features [47]. The L-ReLU is defined as,
in the output range of [b, ∞). A similar arrangement is also
followed by Random Translation ReLU (RTReLU) [41] by uti- max(φ(x), 0), x ≥ 0
L-ReLU(x) =
lizing an offset, sampled from a Gaussian distribution, given (28)
min(η(x), 0), x < 0
as,
x + a, x + a > 0
where φ and η are non-linear functions. Moreover, the range of
RT ReLU(x) =
(21)
0,
x+a≤0 L-ReLU also depends upon the values of φ and η functions.
Table 4: Summary of Exponential Linear Unit based activation functions.
4.2. On the Limited Non-linearity of ReLU based on the convergence. A Piecewise Linear Unit (PLU) [69]
S-shaped ReLU (SReLU) increases the non-linearity in is defined as,
ReLU by combining three linear functions with four learnable
parameters [65]. On a similar line, Multi-bin Trainable Linear PLU(x) = max(α × (x + c) − c, min(α × (x − c) + c, x)) (32)
Unit (MTLU) [46] considers multiple bins to increase the non-
having the output range in [−∞, +∞], where α and c are the
linear capacity. The MTLU can be written as,
constants. Basically, the PLU activation function consists of
three linear functions in pieces, but continuous. Hence, it avoids
a0 × x + b0 , x ≤ c0
the saturation and leads to a good amount of gradient flow
ak × x + bk , ck−1 < x ≤ ck
MT LU(x) = through the activation function during backpropagation in order
(29)
... to resolve the vanishing gradient problems of ReLU and Tanh.
aK × x + bK , cK−1 < x
However, the PLU activation is unbounded in both positive and
negative directions.
having the output range in (−∞, ∞). The number of bins and
the range of bins are the hyperparameters, whereas the linear 4.3. On the Unbounded Output of ReLU
function of a bin is trainable (i.e., a0 , ..., aK b0 , ..., bK are the The unbounded outputs of ReLU and many of its variants
learnable parameters). The non-differentiable nature at mul- may lead to training instability. Moreover, the bounded AF is
tiple points is the drawback of the MTLU. An Elastic ReLU needed for the dedicated hardware based embedded system ap-
(EReLU) considers a slope randomly drawn from a uniform plications. ReLU is extended to Bounded ReLU (BReLU) [37]
distribution during the training for the positive inputs to con- defined as,
trol the amount of non-linearity [40]. The EReLU is defined BReLU(x) = min(max(0, x), A) (33)
as,
EReLU(x) = max(R × x, 0) (30) having the output range in [0, A]). The training stability is im-
proved in BReLU due to two rectifications (i.e., at 0 and A).
in the output range of [0, ∞) where R is a random number. ReLU is a common choice in practice in deep learning. ReLU
At the test time, the EReLU becomes the identity function for based AFs are generally efficient. The major drawbacks of
positive inputs. The Linearized Sigmoidal Activation (LiSHA) ReLU, such as gradient diminishing for negative inputs, limited
function considers three linear functions to increase the non- non-linearity and unboundedness, are improved in the different
linearity characteristics [66]. It is also extended to adaptive AFs. However, the ReLU variants are not able to resolve all the
linear sigmoidal AF by learning the slope of upper and lower issues of ReLU.
linear functions. The ReLU is combined with Tanh as Recti-
fied Linear Tanh (ReLTanh) [67] to increase the non-linearity
of ReLU and to overcome the vanishing gradient problem of 5. Exponential Activation Functions
Tanh. However, the ReLTanh is unbounded in both the positive
and negative directions. Natural-Logarithm ReLU (NLReLU) The exponential AFs tackle the gradient diminishing prob-
modifies the ReLU’s output for positive inputs using the loga- lem of ReLU. Table 4 lists the properties of the exponential
rithm function to increase the degree of nonlinearity [45]. The AFs. The Exponential Linear Unit (ELU) [27] is given as,
NLReLU is defined as,
x,
x>0
ELU(x) =
(34)
NLReLU(x) = ln(β × max(0, x) + 1.0) (31) α × (e x − 1), x ≤ 0
having the output range in [0, ∞) where β is a constant. The having the output range in [−1, ∞) where α is a learnable pa-
NLReLU does not affect the negative regime, thus suffers from rameter. The ELU function exhibits all the benefits of the ReLU
vanishing gradient. The concept of Leaky ReLU (LReLU) is function. The ELU is differentiable, saturates for large nega-
further improved to Dynamic ReLU [68] by considering a mean tive inputs and reduces the bias shift. The negative saturation
square error (MSE) based additional hyperparameter. Thus, regime of ELU adds some robustness to noise as compared to
it can control the slope of the Dynamic ReLU in every epoch the Leaky ReLU and Parametric ReLU. The ELU is extended
to Scaled ELU (SELU) [52] by using a scaling hyperparameter utilized to design an Elastic ELU (EELU) AF [58]. The EELU
to make the slope larger than one for positive inputs. The SELU is defined as,
can be defined as,
k × x,
x>0
EELU(x) =
(41)
x,
x>0 α × (eβ×x − 1), x ≤ 0
S ELU(x) = λ ×
(35)
α × (e x − 1), x ≤ 0
having the output range in [−α, ∞) where α and β are the train-
having the output range in [−λ, ∞) where α is a hyperparame- able parameters. The EELU preserves a small non-zero gradi-
ter. Basically, the SELU induces self-normalization to automat- ent for the negative input and exhibits an elastic slope for the
ically converge towards zero mean and unit variance. The Para- positive input. A Parametric Deformable ELU (PDELU) AF
metric ELU (PELU) [54] changes the saturation point and expo- tries to shift the mean value of output closer to zero using the
nential decay and also regulates the slope of the linear function flexible map shape [59]. The PDELU is defined as,
for the positive inputs for differentiability. The PELU AF can
be written as, x,
x>0
PDELU(x) =
(42)
α × ([1 + (1 − t) × x] 1−t1 − 1), x ≤ 0
ab × x,
x≥0
PELU(x) = λ ×
(36)
a × (e x/b − 1), x < 0
having the output range in [−1, ∞) where α is a learnable pa-
rameter. A ReLU-Memristor-like AF (RMAF) [72] uses two
having [−a, ∞) output range, where a and b are the trainable hyperparameters to have ReLU like shape for positive input and
parameters. The parametric ELU is also explored in Continu- to give more importance to the negative values near to zero. An
ously differentiable ELU (CELU) [53] for the negative inputs. Exponential Linear Sigmoid SquasHing (ELiSH) is defined in
The CELU is given as, [73] as,
x, x≥0 x/(1 + e−x ), x≥0
CELU(x) =
(37) ELiS H(x) =
α × (e x/α − 1), x < 0 (43)
(e − 1)/(1 + e ), x < 0
x −x
having the output range in [−α, ∞) where α is a learnable Moreover, it is also extended to HardELiSH which is a mul-
parameter. The PELU is also extended to multiple PELU tiplication of HardSigmoid and Linear in the positive part and
(MPELU) [55] by using two learnable parameters to repre- HardSigmoid and ELU in the negative part. Here, HardSigmoid
sent MPELU as either rectified, exponential or combined. The is defined as,
MPELU can be expressed as,
HardELish(x) = max(0, min(1, (x + 1)/2)). (44)
x,
x>0
MPELU(x) =
(38)
αc × (eβc ×x − 1), x ≤ 0
The ELU based AFs exploit the negative inputs without com-
promising with the non-linearity. Some ELU variants also mod-
having the output range in [−αc , ∞), where αc and βc are the ify the function for positive inputs to make it bounded.
trainable parameters.
A soft exponential AF interpolates between the exponential,
linear and logarithmic functions using the trainable parameter 6. Learning/Adaptive Activation Functions
[70]. A Shifted ELU (ShELU) AF is also explored as a locally
optimal function [71]. A Parametric Rectified Exponential Unit Most of the aforementioned AFs are not adaptive and might
(PREU) [57] is designed as, not be able to adjust based on the dataset complexity. This prob-
lem is tackled using learning/adaptive AFs as summarized in
Table 5. Some of the earlier mentioned AFs are also adaptive,
α × x,
x>0
PREU(x) =
(39) such as PReLU [57], SReLU [65], PTELU [38], MTLU [46],
α × x × eβ×x , x ≤ 0
PELU [54], MPELU [55], PREU [57], EELU [58], PDELU
having the output range in [−1, ∞), where α and β are the train- [59], SRS [26], etc.
able parameters. The PREU utilizes the negative information The Adaptive Piecewise Linear (APL) is defined as a sum of
near to zero effectively. The efficiency of ELU is improved in hinge-shape functions [28]. It is given as,
Fast ELU (FELU) AF [56] with the help of the simple displace- S
X
ment bits and integer algebra operations. The FELU is defined APL(x) = max(0, x) + a s × max(0, b s − x), (45)
as, s=1
x,
x>0
FELU(x) =
(40) where a and b are the trainable parameters and S is a hyperpa-
α × (e
x/ln(2)
− 1), x ≤ 0
rameter representing the number of hinges. The output range of
having the output range in [−α, ∞) with α as a learnable pa- APL is [0, ∞). Due to the trainable parameters, different neu-
rameter. Recently, the properties of ELU and RELU have been rons can learn different AFs.
Table 5: Summary of adaptive and learning based activation functions.
Ramachandran et al. [29] have performed an automatic in the output range of (−∞, ∞), where ai is the trainable pa-
search, which resulted in a Swish AF. It is defined as, rameter. A Mexican ReLU (MeLU) AF is proposed in [80] by
using a “Mexican hat type” function and given as,
S wish(x) = x × S igmoid(β × x) (46)
k
where β is a learnable parameter. The output range of Swish
X
MeLU(x) = PReLU(x) + c j × max(λ j − |x − a j |, 0) (50)
is (−∞, ∞). Based on the learnt value of β the shape of the j=1
Swish AF is adjusted between the linear and ReLU functions.
The smaller and higher values of β lead towards the linear and in the output range of (−∞, ∞), where c j is the trainable param-
ReLU functions, respectively. Thus, it can control the amount eter and λ j & a j are the real numbers.
of non-linearity based on the dataset and network complexity. A cubic spline interpolation is also used to learn the AF from
Swish is also extended to E-Swish by multiplying the Swish data [74] which is given as,
with a learnable parameter to control the slope in the positive
direction [77]. The E-Swish is defined as, S AF(x) = Φ(s; q) (51)
ES wish(x) = β × x × S igmoid(x) (47) having the output range in (−∞, ∞) where Φ(.) is parameterized
by a vector q cubic in nature. Fourier series basis expansion is
having the output the range in (−∞, ∞) and β is trainable pa-
used for nonparametrically learning AFs (NPF) [88]. Hyperac-
rameter. A flatten-T Swish considers zero function for negative
tivations utilize a hypernetwork on top of an activation network,
inputs similar to the ReLU [81]. The Adaptive Richard’s Curve
which are used to explore the AFs search space [89]. A shal-
weighted Activation (ARiA) is also motivated from Swish and
low neural network is used in the activation network to produce
replaces the sigmoidal function with Richard’s Curve [82]. The
the output for each input, whereas a neural network is used in
ARiA AF uses five hyper-parameters to control the shape of the
the hypernetwork to produce weights for another network. A
non-linearity.
bi-modal derivative adaptive activation (BDAA) function uses
The basic AFs are combined with learnable weights in adap-
twin maxima derivative sigmoidal function [75] by controlling
tive AFs [76]. The Adaptive AF (AAF) designed over PReLU
the maxima’s position with an adaptive parameter. The BDAA
[35] and PELU [54] is given as,
is given as,
AAF(x) = σ(w × x) × PRELU(x) + (1 − σ(w × x)) × PELU(x) !
1 1 1
(48) BDAA(x) = × − (52)
having the output range in [0, 1], where σ is the sigmoidal func- 2 1 + e−x 1 + e−x−a
tion and w is a learnable parameter. In practice, AAF is costly in the output range of [0, 1] where a is a learnable parameter.
as multiple AFs are involved. In [83], the AF for each neuron is The authors have exploited the Bi-modal derivatives on four
selected from a library of AFs. In [84], different combinations AFs. Linear regression is used in [78] to train AF for each
of the identity function, ReLU, and Tanh are learnt automati- neuron which results in different AFs for the different neurons.
cally. In another attempt, an Adaptive Blending Unit (ABU) is The TAF is defined as,
defined to allow the networks to learn its preferred AFs [85]. p
The ABU combines a set of AFs with trainable weights. A T AF(x) = (x − a)2 + b2 (53)
Lookup Table Unit (LuTU) function [86] uses a single period
cosine mask based smoothing and linear interpolation using a in the output range of [b, ∞), where a and b are the trainable pa-
set of anchor points. Activation ensembles are used at each rameters. Recently, a trainable parameter was used in different
layer in [87] with the contribution of each AF controlled by non-adaptive AFs such as Sigmoid, Tanh, and ReLU to make it
the trainable weights. Similarly, the Self-Learnable AF (SLAF) adaptive [90].
computes the sum of the different functions in an ensemble with The adaptive and trainable AFs are the recent trend to ad-
the learnt coefficients [79]. The SLAF can be expressed as, just the non-linearity based on the data and network complex-
N−1
ity. However, the minimal burden is increased in terms of the
increased number of parameters. Though the complexity of tun-
X
S LAF(x) = ai × xi (49)
i=0
able AFs is relatively increased w.r.t. non-tunable AFs, it is
negligible w.r.t. all parameters of the entire network in practice. where P is the probability. The complexity of GELU increases
The same is also observed experimentally as reported in Table due to use of probabilistic nature. The GELU is also extended to
10 in terms of the training time. the Symmetrical Gaussian Error Linear Unit (SGELU) [102] to
enhance its ability of bidirectional convergence. Doubly trun-
cated Gaussian distributions [103] is a family of nonlinearities
7. Miscellaneous Activation Functions
which can generate different AFs such as Sigmoid, Tanh and
This section covers other attempts in AFs such as Softplus, ReLU by setting the appropriate truncation points. Probabilis-
Probabilistic, Polynomial, Subnetwork and Kernel. tic AF (ProbAct) introduces the adaptable and trainable vari-
ance in the ReLU’s output [104]. It leads to the generalization
7.1. Softplus Activation Functions of the models. However, all other drawbacks of ReLU exist
The softplus function [91] was proposed in 2001 as log(e x +1) with ProbAct also.
and mostly used in statistical applications. After the break-
through of deep learning the softmax function is used as the 7.3. Polynomial Activation Functions
AF [92]. Softmax function produces the categorical probability Smooth Adaptive AF (SAAF) is defined as the piecewise
distribution equivalent output. Softplus unit based AF is also polynomial function [105]. Two power functions symmetric
used in deep neural networks [93]. The smooth nature of the to the linear part of ReLU are combined in [106] to improve
Softplus facilitates the differentiability. The noisy softplus AF the performance of ReLU. A piecewise polynomial approxima-
[94] is suitable for the spiking neural networks (SNNs). A Soft- tion based AF is also learnt from the data [107]. This activation
plus Linear Unit (SLU) is also proposed by considering softplus leads to the light-weight models suitable for the FPGAs and mi-
with rectified unit [95]. The SLU AF is defined as, crocontrollers. The AF is also treated as the cumulative distri-
bution function [108]. The ReLU is also extended to a Rectified
α × x,
x≥0
S LU(x) =
(54) Power Unit (RePU) for positive inputs as,
β × log(e + 1) − γ, x < 0
x
where α, β and γ are the trainable parameters with α controlling xs , x ≥ 0
RePU(x) =
(58)
the slope in the positive direction, β controlling the saturation 0, x < 0
points in the negative direction and γ controlling the offset in
the negative direction w.r.t. the horizontal axis. The Rectified where s is a hyperparameter [109]. The RePU is suitable for
Softplus (ReSP) AF introduces the rectification for positive in- smoother gradients near zero. However, vanishing gradient, un-
put in Softplus activation [96]. In order to make the softplus bounded and asymmetric nature are the downsides of RePU.
function to follow the zero mean, a shifting and scaling of the The rational function of polynomials is better suited as com-
outputs is performed in [97]. A Rand Softplus (RSP) AF mod- pared to the polynomial functions in order to approximate the
els the stochasticity-adaptability of biological neurons as, ReLU [110]. Recently, a Padé approximation is used to develop
a non-smooth Padé Activation Unit (PAU) [111] as,
RS P(x) = (1 − ρ) × max(0, x) + ρ × log(1 + e x ) (55)
P(x)
PAU(x) = (59)
where ρ is a stochastic hyperparameter [98]. It improves the Q(x)
capability of the network towards the noise. The softplus func-
tion is also used with Tanh function in Mish activation function where P(x) and Q(x) are two polynomials of order m and n,
[99], which is given as, respectively. The PAUs can approximate the commonly used
hand-designed AFs. Moreover, it can also learn the new AFs
Mish(x) = x × T anh(S o f tplus(x)). (56) with compact representations. Recently, a Rational AF (RAF)
[112] was proposed to tackle the problem of non-smooth nature
The Mish is a non-monotonic and smooth AF. It has recently of the PAU function.
been used by the YOLOv4 model for object detection [100].
However, the increased complexity in Mish due to the multiple
7.4. Activations as a Subnetwork
functions can be a limitation for the deep networks.
A Variable AF (VAF) is used as a subnetwork of ReLUs
7.2. Probabilistic Activation Functions [113]. It uses the ensemble of ReLUs in a subnetwork using
So far, stochastic AFs have not been much explored due learnable parameters. In a very similar approach, the maximum
to expensive sampling processes. Few AFs exist in this cate- of multiple linear functions is used in the Dynamic ReLU (DY-
gory such as Randomized ReLU (RReLU) [50], Elastic ReLU ReLU) [114]. In Wide Hidden Expansion (WHE) [115], each
(EReLU) [40], Randomly Translational ReLU (RTReLU) [41] WHE intermediate channel is followed by one AF before con-
and Gaussian Error Linear Unit (GELU) [101]. GELU [101] necting to the output channel to increase the non-linearity of the
considers nonlinearity as the stochastic regularization driven network. An AF Unit (AFU) [116] uses a small neural network
transformation and defined as, to model the activation. All neurons in the original network
share the weights in AFU. The advantage of the AFU is that
GELU(x) = x × P(X ≤ x). (57) different AFs can be learnt by different layers.
7.5. Kernel Activation Functions 9.1. Comparison with Existing Survey/Performance Analysis
A performance analysis of AFs was conducted using mul-
A Kernel-based non-parametric AF (KAF) [117] uses an in- tilayer perceptron network in [134]. Among compared AFs,
expensive kernel expansion to make the activation flexible. The the Tanh has shown better performance. A comparative per-
KAF is further extended to multikernel AFs (multi-KAF) [118]. formance analysis of different AFs suggests an Elliott func-
Several AFs are also introduced for complex valued neural net- tion as better suited for classification using LSTM networks
works [119], [120], [121]. [25]. The ELU outperforms the ReLU, LReLU, and SELU
AFs over MNIST classification task using Deep Neural Net-
works [135]. As per [136], the ELU is reported in [135] to
8. Aspects of Activation Functions outperform the ReLU, LReLU, PReLU and PELU over suf-
ficiently large datasets for speech recognition. However, for
smaller datasets, the ReLU is preferred. A similar trend is also
This section summarizes the effect of weight initialization, reported in [137] with a note that the ELU and SELU AFs ex-
understanding of AFs and suitability with different types of hibit faster convergence as compared to the ReLU and LReLU
data. The learning of the network speeds up drastically by us- AFs. In [138], 21 AFs are listed without experimental results
ing the orthogonal weight initialization based on the dynamical comparison. In contrast to [138], this paper presents a compre-
isometry [122]. A set of conditions in parameter initialization hensive survey of AFs. The ReLU based deep networks per-
also boosts the performance of networks with sigmoidal acti- form superior or mildly worse than the spline methods [139]. A
vations [123]. The symmetric probability distribution based review of adaptive functions is conducted in [140] by consider-
weights and biases initialization leads the network to suffer ing 9 functions, including Sigmoid, Tanh, PReLU, and adapt-
with the dying ReLU problem. However, the asymmetric ini- Tanh. In [141], the comparison between ReLU and LReLU is
tialization resolves the dying ReLU problem [124]. The over- performed using CNN on MNIST dataset. An empirical study
parameterization during initialization also benefits in the train- is also done for the variations of ReLU activation by general-
ing [125]. The data-dependent weight initialization using a sub- izing it with the help of parameters [142]. The comparison of
set of data minimizes the issues of the ReLU [126], whereas an AFs is also performed for generalized learning vector quanti-
initial parameter sharing based initialization guarantees the dy- zation [143]. The ReLU activation has performed better for
namical isometry for the ReLU [127]. object, face, and text datasets [144]. However, the SELU and
Several researchers have tried to understand the working and Maxout have performed better for medical and sound datasets,
impact of AFs through different strategies. The lower and up- respectively [144]. The piecewise AF is better suited for fa-
per bounds are established for network complexity to realize cial expression recognition in [145]. A survey of adaptive AFs
that the ReLU in deep networks approximates the smooth func- is conducted in [146] without experimental comparison. The
tions more efficiently as compared to shallow networks [128]. evaluation of seven AFs is conducted in [147] using a simple
A ReLU network with only one hidden layer is trained to reach network over CIFAR10 dataset, whereas in our survey we cover
the global optimum in polynomial time even with exponen- different AFs and also perform the experimental comparison.
tially growing input dimension [129]. The ReLU type AF A summary of the comparison with existing surveys and per-
based neural networks produce the overconfident predictions formance analysis of AF is shown in Table 7. Following are the
far away from the training data [130]. However, this can be re- observations:
solved by employing adversarial confidence enhanced training.
A Gaussian margin driven time and accuracy tradeoff analysis • This survey presents a detailed classification to cover the
is also done on the ReLU’s learning [131]. The singular val- wide range of AFs as compared to the existing surveys and
ues for ReLU layers are analyzed to understand the interaction performance analysis.
of ReLU with the linear components [132]. The approxima- • This survey covers exhaustive state-of-the-art AFs to date,
tion of Gaussian posterior distribution over the ReLU network whereas the existing survey/performance analysis covers
weight’s fixes the overconfidence problem [133]. either a limited number of AFs or only basic AFs.
Despite most of the AFs are tested over image data, there are
few research papers dealing with the AFs over other types of • The performance analysis conducted in this paper consid-
data. Table 6 summarizes the insights and remarks of state-of- ers a wide range of neural networks over different types
the-art AFs for various networks and datasets. of data for eighteen AFs, whereas the existing analysis is
limited to a single type of data and network.
• This survey highlights the trends to help the researchers to
9. Performance Comparison and Analysis further explore the better AFs and practitioners to choose
based on the data and network types.
This survey is compared with the existing sur-
vey/performance analysis and the experimental performance 9.2. Experimental Performance Analysis
analysis of selected AFs is performed over Image, Text and In order to compare the AFs, three experiments are con-
Speech data. ducted in this paper, including image classification, language
Table 6: Summary of the existing state-of-the-art activation functions.
translation and speech recognition. Eighteen state-of-the-art ages from 10 object categories. The CIFAR100 dataset contains
AFs are considered for analysis, including Logistic Sigmoid, 50, 000 training images and 10, 000 test images from 100 object
Tanh, Elliott [25], ReLU [8], LReLU [34] PReLU [35], ELU categories. We also utilize the language translation and speech
[27], SELU [52], GELU [101], CELU [53], Softplus [93], recognition datasets for the experiments. For the experiments
Swish [29], ABReLU [44], LiSHT [24], Soft-Root-Sign (SRS) over CIFAR-10 and CIFAR-100 datasets, training is performed
[26], Mish [99], PAU [111] and PDELU [59]. Note that Swish, for 100 Epochs. The batch size is 128 for CIFAR-10 and 64 for
ABReLU, LiSHT, SRS, Mish, PAU and PDELU are the most CIFAR-100. The learning rate is 0.001 for first 80 Epochs and
recent functions. Google Colab based computational resource 0.0001 for last 20 Epochs. Random crop and random horizontal
is used in most of the experiments. Few experiments are also flip are the data augmentation used during training. Data nor-
performed over a desktop system consisting of 8 GB GPU. The malization is performed both during train and test times. Adam
PyTorch framework is used in all the experiments. optimizer is used for the training with cross entropy loss. All
The CIFAR10 and CIFAR100 datasets1 [148] are used for the existing activation functions except softmax are replaced with
image classification experiment in this paper. The CIFAR10 the corresponding activation function in different networks.
dataset contains 50, 000 training images and 10, 000 test im-
The test accuracy is reported in Tables 8 and 9 on CIFAR10
and CIFAR100 datasets, respectively. In these Tables, the mean
1 https://www.cs.toronto.edu/ and standard deviation of image classification accuracy over
˜kriz/cifar.html
Table 7: Comparison of this survey with the existing surveys and performance evaluations.
5 trials are reported for each AF. Moreover, the better results and skip/residual connection based models (i.e., ResNet50
are highlighted. Different types of CNN models are used in [152], SENet18 [153], and DenseNet121 [154]). The Mo-
this experiment, such as plain models (i.e., MobileNet [149] bileNet, GoogLeNet and SENet18 are light models, whereas
and VGG16 [150]), inception model (i.e., GoogLeNet [151]) the VGG16, ResNet50 and DenseNet121 are heavy models in
Table 9: Experimental results comparison over CIFAR100 dataset.
terms of the number of trainable parameters. Overall, it is ob- SENet18 model. Sigmoid and Elliott AFs showed the poorest
served that the Softplus, ELU and CELU are better suited with convergence. The time taken for the training is also computed
MobileNet. The ReLU, Mish and PDELU exhibit good per- for different AFs using different CNN models on CIFAR100
formance with VGG16, GoogleNet and DenseNet. The ReLU, dataset and reported in Table 10. These results are computed
LReLU, ELU, GELU, CELU, ABReLU, and PDELU activa- using a desktop computer system having 32 GB RAM and 8
tion functions are better for the networks having residual con- GB Nvidia GPU Card for 100 epochs of training. The time is
nections, such as ResNet50, SENet18 and DenseNet121. In or- represented in hh:mm:ss format. It is clear that PDELU AF
der to demonstrate the convergence of different AFs, the train- is very inefficient. Moreover, SRS and Elliott also take more
ing loss vs epochs is plotted in Fig. 3 on CIFAR100 dataset time for training. The activations such as ReLU, ELU, CELU,
using different models. The PAU has emerged as a promis- and Softplus depict a good tradeoff between the accuracy and
ing AF with fastest convergence in most of the cases. The training time.
PReLU, GELU and PDELU AFs are also consistent with good
The results for language translation and speech recognition
convergence. Note that the training diverges with SRS for the
for different AFs are illustrated in Table 11. The German to
Table 10: Training time (hh:mm:ss) comparison over CIFAR100 dataset.
Table 11: Experimental results for German to English language translation and AF. It is noticed that the Tanh and SELU AFs are better suit-
speech recognition tasks.
able for language translation. The PReLU, LiSHT, SRS and
Language Speech Recognition PAU AFs also perform better for language translation.
Translation The speech recognition experiment is also performed to show
Activations Bleu Score Average Average the performance of the different AFs for time-series signal data.
CER WER
Sigmoid 14.59 ± 0.47 0.53 ± 0.18 1.19 ± 0.39 The end-to-end speech recognition based Deep Speech 2 frame-
Tanh 20.93 ± 0.91 0.26 ± 0 0.68 ± 0 work available from assemblyai3 is used. The model consists of
Elliott [25] 14.49 ± 0.96 0.40 ± 0.01 0.93 ± 0.01 2 layers of residual convolution layers to learn the relevant au-
ReLU [8] 18.88 ± 0.86 0.24 ± 0.01 0.66 ± 0.01 dio features, and 2 layers of bidirectional gated recurrent units
LReLU [34] 18.89 ± 0.82 0.24 ± 0 0.66 ± 0.01
PReLU [35] 20.04 ± 0.98 0.24 ± 0 0.65 ± 0 (GRUs) to use the learned residual convolutional audio fea-
ELU [27] 19.40 ± 1.33 0.25 ± 0 0.67 ± 0 tures. The 100 hours of transcribed audio English data from
SELU [52] 20.85 ± 0.64 0.26 ± 0 0.69 ± 0.01 LibriSpeech dataset is used for the experiment. For the speech
GELU [101] 18.75 ± 1.83 0.24 ± 0 0.65 ± 0
recognition experiments, torchaudio 0.4.0 and torch 1.4.0 are
CELU [53] 18.71 ± 0.55 0.25 ± 0 0.67 ± 0
Softplus [93] 16.78 ± 0.84 0.30 ± 0.01 0.76 ± 0.02 used. The model consists of 2 CNN layers and 2 RNN layers.
Swish [29] 19.51 ± 0.97 0.24 ± 0.01 0.65 ± 0.01 The dimension of a RNN layer is 512. Number of classes is
ABReLU [44] 17.55 ± 0.63 0.25 ± 0 0.68 ± 0 29 in the dataset. Dropout factor is 0.5. The learning rate is
LiSHT [24] 20.39 ± 0.93 0.29 ± 0.01 0.74 ± 0.01
SRS [26] 20.66 ± 0.78 0.28 ± 0 0.72 ± 0
0.0005, batch size is 10 and the number of Epochs is 10. The
Mish [99] 19.56 ± 1.15 0.24 ± 0 0.65 ± 0 mean and standard deviation over 5 trials of character error rate
PAU [111] 20.11 ± 1.24 0.24 ± 0 0.65 ± 0.01 (CER) and word error rate (WER) are reported in Table 11 for
PDELU [59] 19.07 ± 0.95 0.25 ± 0 0.67 ± 0.01 speech recognition. The recent AFs such as PReLU, GELU,
Swish, Mish and PAU AFs are found as the most suitable for
speech recognition in this experiment.
English translation is used to test the performance of the AFs
over text data. Benchmark Seq2Seq model consisting of a Long
Short Term Memory (LSTM) based autoencoder network is 10. Conclusion and Recommendations
used for the experiment. The model and dataset are downloaded
from Kaggle2 . The AF is applied to the feature embedding
An extensive and up to date survey of activation functions is
before the dropout layer. For the language translation experi-
conducted in this paper. Different types of AFs are considered,
ments, the number of Epochs is set to 50 with 0.001 learning
including Logistic Sigmoid and Tanh based, ReLU based, ELU
rate and 256 batch size. The embedding size of encoder and
based, and Learning based. However, the main focus is given to
decoder is 300. The dropout factor is 0.5 for both encoder and
the recent developments in AFs in view of the deep learning ap-
decoder. Adam optimizer is used for the training with cross
plications of neural networks. The overview of AFs presented
entropy loss. The Bleu score [155] with 4-gram is reported in
in this paper focuses on the aspects including the detailed cov-
Table 11 in 2nd column for different AFs. The mean and stan-
erage of AFs, classification and performance comparison over
dard deviation of Bleu score over 5 trials are reported for each
image, text and speech data.
2 https://www.kaggle.com/parthplc/pytorch-seq2seq-machine-
translation/notebook 3 https://www.assemblyai.com/blog/end-to-end-speech-recognition-pytorch
Following are the concluding remarks of the survey and per- • The ReLU, Mish and PDELU activation functions have
formance analysis conducted through this paper: shown a good performance with VGG16 and GoogleNet.
The ReLU, LReLU, ELU, GELU, CELU, and PDELU
• Most of the improvements in Logistic Sigmoid and Tanh
functions are better for the networks having residual con-
targets to tackle the non zero-mean and zero-gradient
nections for image classification.
problems. However, these improvements carry forward
the drawback of increased complexity. • In general, the parametric AFs show better convergence as
it can adapt the data faster by learning the parameter from
• The ReLU variants try to tackle the three major prob-
the data. Specially, PAU, PReLU and PDELU have shown
lems of ReLU, namely under-utilization of negative val-
better convergence.
ues, limited nonlinearity and unbounded output. These ac-
tivations perform well for some applications, e.g. LReLU • Some AFs lead to increased training time complexity.
and ABReLU works better with residual networks. How- PDELU and SRS are such examples. However, AFs such
ever, most of these activations fail to perform better than as ReLU, SELU, GELU, and Softplus depict a promising
ReLU, e.g. LReLU, PReLU and ABReLU do not improve tradeoff between the accuracy and training time.
for MobileNet, VGG and GoogleNet models. Note that,
the ReLU, Leaky ReLU and PReLU AFs are the most • The exponential AFs generally lead to the increased non-
common choice among researchers due to its simplicity. linearity due to utilization of the negative values.
Moreover, many networks consider the ReLU as a default
choice for the AF. • The Tanh and SELU AFs are found better for language
translation along with PReLU, LiSHT, SRS and PAU.
• The exponential based AFs also focus over the better uti-
lization of the negative values and to avoid the saturation • It is suggested to use the PReLU, GELU, Swish, Mish and
for important features. However, most of the exponential PAU AFs for speech recognition.
activations suffer due to the non-smooth functions.
• The learning based adaptive AFs try to find the best pa- References
rameters to represent the non-linearity needed for the given
[1] F. Shao, L. Chen, J. Shao, W. Ji, S. Xiao, L. Ye, Y. Zhuang, J. Xiao,
dataset. This category of AF has gained more popularity Deep learning for weakly-supervised object detection and localization:
in recent years. However, the major problem associated A survey, Neurocomputing (2022).
with such AF is to find the better base function and num- [2] Y. Mo, Y. Wu, X. Yang, F. Liu, Y. Liao, Review the state-of-the-art tech-
ber of trainable parameters. Some AFs diverge during the nologies of semantic segmentation based on deep learning, Neurocom-
puting (2022).
training if not initialized properly. [3] Y. Guo, F. Feng, X. Hao, X. Chen, Jac-net: Joint learning with adap-
tive exploration and concise attention for unsupervised domain adaptive
• In contrast to existing surveys, this survey covers an ex- person re-identification, Neurocomputing (2022).
haustive list different types of AFs. Moreover, a perfor- [4] S. R. Dubey, A decade survey of content based image retrieval using
mance analysis on different types of data using several AFs deep learning, IEEE Transactions on Circuits and Systems for Video
Technology (2021).
provides new insights for future research. [5] X. Xia, X. Pan, N. Li, X. He, L. Ma, X. Zhang, N. Ding, Gan-based
anomaly detection: A review, Neurocomputing (2022).
Following are the recommendations curated from this survey [6] H. Li, Y. Pan, J. Zhao, L. Zhang, Skin disease diagnosis with deep learn-
and performance analysis: ing: a review, Neurocomputing 464 (2021) 364–393.
[7] C. H. Dagli, Artificial neural networks for intelligent manufacturing,
• In order to speed up the training, both negative & positive Springer Science & Business Media, 2012.
values should be used to ensure the near zero mean. [8] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
• The most important aspect in deep learning is to find the Processing Systems, 2012, pp. 1097–1105.
network having matching complexity as the dataset com- [9] A. Graves, A.-r. Mohamed, G. Hinton, Speech recognition with deep
recurrent neural networks, in: IEEE International Conference on Acous-
plexity. If the complexity of the model is high then it may tics, Speech and Signal Processing, 2013, pp. 6645–6649.
lead to overfitting and if the complexity of the model is [10] K. K. Babu, S. R. Dubey, Pcsgan: Perceptual cyclic-synthesized gener-
low then it may lead to under convergence. Thus, the ative adversarial networks for thermal and nir to visible image transfor-
AF should bridge this gap based on the model and dataset mation, Neurocomputing (2020).
[11] J. Liu, Y. Liu, Q. Zhang, A weight initialization method based on neural
complexity during training automatically. network with asymmetric activation function, Neurocomputing (2022).
[12] Y. Srivastava, V. Murali, S. R. Dubey, A performance evaluation of
• The Logistic Sigmoid and Tanh AFs should be avoided for loss functions for deep face recognition, in: National Conference on
Convolutional Neural Networks as it leads to poor con- Computer Vision, Pattern Recognition, Image Processing, and Graph-
vergence. However, this type of AF is commonly used as ics, Springer, 2019, pp. 322–332.
[13] S. S. Basha, S. R. Dubey, V. Pulabaigari, S. Mukherjee, Impact of fully
gates in recurrent neural networks.
connected layers on performance of convolutional neural networks for
image classification, Neurocomputing 378 (2020) 112–119.
• Despite the ReLU being a popular choice, recently pro- [14] Q. Xu, M. Zhang, Z. Gu, G. Pan, Overfitting remedy by sparsifying
posed AFs such as Swish, Mish, and PAU are also worth regularization on fully-connected layers of cnns, Neurocomputing 328
trying for different problems. (2019) 69–74.
[15] S. R. Dubey, S. Chakraborty, S. K. Roy, S. Mukherjee, S. K. Singh, B. B. Recognition, 2018, pp. 1223–1228.
Chaudhuri, diffgrad: An optimization method for convolutional neural [40] X. Jiang, Y. Pang, X. Li, J. Pan, Y. Xie, Deep neural networks with
networks, IEEE transactions on neural networks and learning systems elastic rectified linear units for object recognition, Neurocomputing 275
31 (11) (2019) 4500–4511. (2018) 1132–1139.
[16] W. Duch, N. Jankowski, Survey of neural transfer functions, Neural [41] J. Cao, Y. Pang, X. Li, J. Liang, Randomly translational activation in-
Computing Surveys 2 (1) (1999) 163–212. spired by the input distributions of relu, Neurocomputing 275 (2018)
[17] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann 859–868.
machines, in: International Conference on Machine Learning, 2010, pp. [42] F. Godin, J. Degrave, J. Dambre, W. De Neve, Dual rectified linear units
807–814. (drelus): A replacement for tanh activation functions in quasi-recurrent
[18] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap- neural networks, Pattern Recognition Letters 116 (2018) 8–14.
plied to document recognition, Proceedings of the IEEE 86 (11) (1998) [43] Z. Tang, L. Luo, H. Peng, S. Li, A joint residual network with paired
2278–2324. relus activation for image super-resolution, Neurocomputing 273 (2018)
[19] A. N. S. Njikam, H. Zhao, A novel activation function for multilayer 37–46.
feed-forward neural networks, Applied Intelligence 45 (1) (2016) 75– [44] S. R. Dubey, S. Chakraborty, Average biased relu based cnn descriptor
82. for improved face retrieval, arXiv preprint arXiv:1804.02051 (2018).
[20] B. Xu, R. Huang, M. Li, Revise saturated activation functions, Interna- [45] Y. Liu, J. Zhang, C. Gao, J. Qu, L. Ji, Natural-logarithm-rectified activa-
tional Conference on Learning Representations Workshop (2016). tion function in convolutional neural networks, in: International Confer-
[21] S. Kong, M. Takatsuka, Hexpo: A vanishing-proof activation function, ence on Computer and Communications, 2019, pp. 2000–2008.
in: International Joint Conference on Neural Networks, 2017, pp. 2562– [46] S. Gu, W. Li, L. V. Gool, R. Timofte, Fast image restoration with multi-
2567. bin trainable linear units, in: IEEE International Conference on Com-
[22] Y. Qin, X. Wang, J. Zou, The optimized deep belief networks with im- puter Vision, 2019, pp. 4190–4199.
proved logistic sigmoid units and their application in fault diagnosis for [47] M. Basirat, P. Roth, L* relu: Piece-wise linear activation functions for
planetary gearboxes of wind turbines, IEEE Transactions on Industrial deep fine-grained visual categorization, in: IEEE Winter Conference on
Electronics 66 (5) (2018) 3814–3824. Applications of Computer Vision, 2020, pp. 1218–1227.
[23] S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neu- [48] C. Gulcehre, M. Moczulski, M. Denil, Y. Bengio, Noisy activation func-
ral network function approximation in reinforcement learning, Neural tions, in: International Conference on Machine Learning, 2016, pp.
Networks 107 (2018) 3–11. 3059–3068.
[24] S. K. Roy, S. Manna, S. R. Dubey, B. B. Chaudhuri, Lisht: Non- [49] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio,
parametric linearly scaled hyperbolic tangent activation function for Maxout networks, arXiv preprint arXiv:1302.4389 (2013).
neural networks, arXiv preprint arXiv:1901.05894 (2019). [50] B. Xu, N. Wang, T. Chen, M. Li, Empirical evaluation of rectified activa-
[25] A. Farzad, H. Mashayekhi, H. Hassanpour, A comparative performance tions in convolutional network, arXiv preprint arXiv:1505.00853 (2015).
analysis of different activation functions in lstm networks for classifica- [51] H. Li, W. Ouyang, X. Wang, Multi-bias non-linear activation in deep
tion, Neural Computing and Applications 31 (7) (2019) 2507–2521. neural networks, in: International Conference on Machine Learning,
[26] Y. Zhou, D. Li, S. Huo, S.-Y. Kung, Soft-root-sign activation function, 2016, pp. 221–229.
arXiv preprint arXiv:2003.00547 (2020). [52] G. Klambauer, T. Unterthiner, A. Mayr, S. Hochreiter, Self-normalizing
[27] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep net- neural networks, in: Advances in Neural Information Processing Sys-
work learning by exponential linear units (elus), in: International Con- tems, 2017, pp. 971–980.
ference on Learning Representations, 2016. [53] J. T. Barron, Continuously differentiable exponential linear units, arXiv
[28] F. Agostinelli, M. Hoffman, P. Sadowski, P. Baldi, Learning activation (2017) arXiv–1704.
functions to improve deep neural networks, International Conference on [54] L. Trottier, P. Gigu, B. Chaib-draa, et al., Parametric exponential lin-
Learning Representations Workshops (2015). ear unit for deep convolutional neural networks, in: IEEE International
[29] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation func- Conference on Machine Learning and Applications, 2017, pp. 207–214.
tions, International Conference on Learning Representations Workshops [55] Y. Li, C. Fan, Y. Li, Q. Wu, Y. Ming, Improving deep neural network
(2018). with multiple parametric exponential linear units, Neurocomputing 301
[30] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2018) 11–24.
(2015) 436–444. [56] Z. Qiumei, T. Dan, W. Fenghua, Improved convolutional neural network
[31] P. Chandra, Y. Singh, An activation function adapting training algorithm based on fast exponentially linear unit activation function, IEEE Access
for sigmoidal feedforward networks, Neurocomputing 61 (2004) 429– 7 (2019) 151359–151367.
437. [57] Y. Ying, J. Su, P. Shan, L. Miao, X. Wang, S. Peng, Rectified exponential
[32] S. S. Sodhi, P. Chandra, Bi-modal derivative activation function for sig- units for convolutional neural networks, IEEE Access 7 (2019) 101633–
moidal feedforward networks, Neurocomputing 143 (2014) 182–196. 101640.
[33] S. Eger, P. Youssef, I. Gurevych, Is it time to swish? compar- [58] D. Kim, J. Kim, J. Kim, Elastic exponential linear units for convolutional
ing deep learning activation functions across nlp tasks, arXiv preprint neural networks, Neurocomputing 406 (2020) 253–266.
arXiv:1901.02671 (2019). [59] Q. Cheng, H. Li, Q. Wu, L. Ma, N. N. King, Parametric deformable
[34] A. L. Maas, A. Y. Hannun, A. Y. Ng, Rectifier nonlinearities improve exponential linear units for deep neural networks, Neural Networks 125
neural network acoustic models, in: International Conference on Ma- (2020) 281–289.
chine Learning, Vol. 30, 2013, p. 3. [60] J. Si, S. L. Harris, E. Yfantis, A dynamic relu on neural network, in:
[35] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing IEEE Dallas Circuits and Systems Conference, 2018, pp. 1–6.
human-level performance on imagenet classification, in: IEEE interna- [61] H. Hu, Vrelu activation functions for artificial neural networks, in: In-
tional conference on computer vision, 2015, pp. 1026–1034. ternational Conference on Natural Computation, Fuzzy Systems and
[36] W. Shang, K. Sohn, D. Almeida, H. Lee, Understanding and improving Knowledge Discovery, 2018, pp. 856–860.
convolutional neural networks via concatenated rectified linear units, in: [62] G. Lin, W. Shen, Research on convolutional neural network based on
International Conference on Machine Learning, 2016, pp. 2217–2225. improved relu piecewise activation function, Procedia Computer Science
[37] S. S. Liew, M. Khalil-Hani, R. Bakhteri, Bounded activation functions 131 (2018) 977–984.
for enhanced training stability of deep neural networks on visual pattern [63] D. Macêdo, C. Zanchettin, A. L. Oliveira, T. Ludermir, Enhancing batch
recognition problems, Neurocomputing 216 (2016) 718–734. normalized convolutional networks using displaced rectifier linear units:
[38] R. Duggal, A. Gupta, P-telu: Parametric tan hyperbolic linear unit acti- A systematic comparative study, Expert Systems with Applications 124
vation for deep neural networks, in: IEEE International Conference on (2019) 271–281.
Computer Vision Workshops, 2017, pp. 974–978. [64] L. B. Godfrey, An evaluation of parametric activation functions for deep
[39] S. Qiu, X. Xu, B. Cai, Frelu: Flexible rectified linear units for improving learning, in: IEEE International Conference on Systems, Man and Cy-
convolutional neural networks, in: International Conference on Pattern bernetics, 2019, pp. 3006–3011.
[65] X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, S. Yan, Deep learning with Workshop on Meta-learning, 2017.
s-shaped rectified linear activation units, in: AAAI Conference on Arti- [90] A. D. Jagtap, K. Kawaguchi, G. E. Karniadakis, Adaptive activation
ficial Intelligence, 2016. functions accelerate convergence in deep and physics-informed neural
[66] V. S. Bawa, V. Kumar, Linearized sigmoidal activation: A novel acti- networks, Journal of Computational Physics 404 (2020) 109136.
vation function with tractable non-linear characteristics to boost repre- [91] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, R. Garcia, Incorporating
sentation capability, Expert Systems with Applications 120 (2019) 346– second-order functional knowledge for better option pricing, in: Ad-
356. vances in Neural Information Processing Systems, 2001, pp. 472–478.
[67] X. Wang, Y. Qin, Y. Wang, S. Xiang, H. Chen, Reltanh: An activation [92] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks,
function with vanishing gradient resistance for sae-based dnns and its in: International Conference on Artificial Intelligence and Statistics,
application to rotating machinery fault diagnosis, Neurocomputing 363 2011, pp. 315–323.
(2019) 88–98. [93] H. Zheng, Z. Yang, W. Liu, J. Liang, Y. Li, Improving deep neural net-
[68] X. Hu, P. Niu, J. Wang, X. Zhang, A dynamic rectified linear activation works using softplus units, in: International Joint Conference on Neural
units, IEEE Access 7 (2019) 180409–180416. Networks, 2015, pp. 1–4.
[69] A. Nicolae, Plu: The piecewise linear unit activation function, arXiv [94] Q. Liu, S. Furber, Noisy softplus: a biology inspired activation function,
preprint arXiv:1809.09534 (2018). in: International Conference on Neural Information Processing, 2016,
[70] L. B. Godfrey, M. S. Gashler, A continuum among logarithmic, linear, pp. 405–412.
and exponential functions, and its potential to improve generalization in [95] H. Zhao, F. Liu, L. Li, C. Luo, A novel softplus linear unit for deep
neural networks, in: International Joint Conference on Knowledge Dis- convolutional neural networks, Applied Intelligence 48 (7) (2018) 1707–
covery, Knowledge Engineering and Knowledge Management, Vol. 1, 1720.
2015, pp. 481–486. [96] C. Xu, J. Huang, S.-p. Wang, A.-q. Hu, A novel parameterized activation
[71] B. Grelsson, M. Felsberg, Improved learning in convolutional neural net- function in visual geometry group, in: International Conference on Data
works with shifted exponential linear units (shelus), in: International Science and Business Analytics, 2018, pp. 386–389.
Conference on Pattern Recognition, 2018, pp. 517–522. [97] K. Sun, J. Yu, L. Zhang, Z. Dong, A convolutional neural network model
[72] Y. Yu, K. Adu, N. Tashi, P. Anokye, X. Wang, M. A. Ayidzoe, Rmaf: based on improved softplus activation function, in: International Confer-
Relu-memristor-like activation function for deep learning, IEEE Access ence on Applications and Techniques in Cyber Security and Intelligence,
8 (2020) 72727–72741. 2019, pp. 1326–1335.
[73] M. Basirat, P. M. Roth, The quest for the golden activation function, [98] Y. Chen, Y. Mai, J. Xiao, L. Zhang, Improving the antinoise ability of
arXiv preprint arXiv:1808.00783 (2018). dnns via a bio-inspired noise adaptive activation function rand softplus,
[74] S. Scardapane, M. Scarpiniti, D. Comminiello, A. Uncini, Learning ac- Neural Computation 31 (6) (2019) 1215–1233.
tivation functions from data using cubic spline interpolation, in: Italian [99] D. Misra, Mish: A self regularized non-monotonic neural activation
Workshop on Neural Nets, 2017, pp. 73–83. function, arXiv preprint arXiv:1908.08681 (2019).
[75] A. Mishra, P. Chandra, U. Ghose, S. S. Sodhi, Bi-modal derivative adap- [100] A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4: Optimal speed and
tive activation function sigmoidal feedforward artificial neural networks, accuracy of object detection, arXiv preprint arXiv:2004.10934 (2020).
Applied Soft Computing 61 (2017) 983–994. [101] D. Hendrycks, K. Gimpel, Gaussian error linear units (gelus), arXiv
[76] S. Qian, H. Liu, C. Liu, S. Wu, H. San Wong, Adaptive activation preprint arXiv:1606.08415 (2016).
functions in convolutional neural networks, Neurocomputing 272 (2018) [102] C. Yu, Z. Su, Symmetrical gaussian error linear units (sgelus), arXiv
204–212. preprint arXiv:1911.03925 (2019).
[77] E. Alcaide, E-swish: Adjusting activations to different network depths, [103] Q. Su, L. Carin, et al., A probabilistic framework for nonlinearities in
arXiv preprint arXiv:1801.07145 (2018). stochastic neural networks, in: Advances in Neural Information Pro-
[78] Ö. F. Ertuğrul, A novel type of activation function in artificial neural cessing Systems, 2017, pp. 4486–4495.
networks: Trained activation function, Neural Networks 99 (2018) 148– [104] J. Lee, K. Shridhar, H. Hayashi, B. K. Iwana, S. Kang, S. Uchida,
157. Probact: A probabilistic activation function for deep neural networks,
[79] M. Goyal, R. Goyal, B. Lall, Learning activation functions: A arXiv preprint arXiv:1905.10761 (2019).
new paradigm of understanding neural networks, arXiv preprint [105] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. H. Saltz, Convnets with
arXiv:1906.09529 (2019). smooth adaptive activation functions for regression, Proceedings of Ma-
[80] G. Maguolo, L. Nanni, S. Ghidoni, Ensemble of convolutional neu- chine Learning Research 54 (2017) 430.
ral networks trained with different activation functions, arXiv preprint [106] Y. Berradi, Symmetric power activation functions for deep neural net-
arXiv:1905.02473 (2019). works, in: International Conference on Learning and Optimization Al-
[81] H. H. Chieng, N. Wahid, P. Ong, S. R. K. Perla, Flatten-t swish: a gorithms: Theory and Applications, 2018, pp. 1–6.
thresholded relu-swish-like activation function for deep learning, arXiv [107] E. López-Rubio, F. Ortega-Zamorano, E. Domı́nguez, J. Muñoz-Pérez,
preprint arXiv:1812.06247 (2018). Piecewise polynomial activation functions for feedforward neural net-
[82] N. Patwardhan, M. Ingalhalikar, R. Walambe, Aria: Utilizing richard’s works, Neural Processing Letters 50 (1) (2019) 121–147.
curve for controlling the non-monotonicity of the activation function in [108] F. Farhadi, V. P. Nia, A. Lodi, Activation adaptation in neural networks,
deep neural nets, arXiv preprint arXiv:1805.08878 (2018). arXiv preprint arXiv:1901.09849 (2019).
[83] M. Dushkoff, R. Ptucha, Adaptive activation functions for deep net- [109] B. Li, S. Tang, H. Yu, Powernet: Efficient representations of polynomi-
works, Electronic Imaging 2016 (19) (2016) 1–5. als and smooth functions by deep neural networks with rectified power
[84] F. Manessi, A. Rozza, Learning combinations of activation functions, in: units, arXiv preprint arXiv:1909.05136 (2019).
IEEE International Conference on Pattern Recognition, 2018, pp. 61–66. [110] M. Telgarsky, Neural networks and rational functions, in: International
[85] L. R. Sütfeld, F. Brieger, H. Finger, S. Füllhase, G. Pipa, Adaptive blend- Conference on Machine Learning, 2017, pp. 3387–3393.
ing units: Trainable activation functions for deep neural networks, arXiv [111] A. Molina, P. Schramowski, K. Kersting, Padé activation units: End-to-
preprint arXiv:1806.10064 (2018). end learning of flexible activation functions in deep networks, Interna-
[86] M. Wang, B. Liu, H. Foroosh, Look-up table unit activation function tional Conference on Learning Representations (2020).
for deep convolutional neural networks, in: IEEE Winter Conference on [112] A. T. Nicolas Boullé, Yuji Nakatsukasa, Rational neural networks, arXiv
Applications of Computer Vision, 2018, pp. 1225–1233. preprint arXiv:2004.01902 (2020).
[87] D. Klabjan, M. Harmon, Activation ensembles for deep neural networks, [113] A. Apicella, F. Isgrò, R. Prevete, A simple and efficient architecture for
in: IEEE International Conference on Big Data, 2019, pp. 206–214. trainable activation functions, Neurocomputing 370 (2019) 1–15.
[88] C. Eisenach, Z. Wang, H. Liu, Nonparametrically learning activation [114] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, Z. Liu, Dynamic relu, arXiv
functions in deep neural nets, in: International Conference on Learning preprint arXiv:2003.10027 (2020).
Representations Workshops, 2017. [115] M. Wang, B. Liu, H. Foroosh, Wide hidden expansion layer for deep
[89] C. J. Vercellino, W. Y. Wang, Hyperactivations for activation function convolutional neural networks, in: IEEE Winter Conference on Appli-
exploration, in: Conference on Neural Information Processing Systems cations of Computer Vision, 2020, pp. 934–942.
[116] A. Asif, et al., Learning neural activations, arXiv preprint [140] M. M. Lau, K. H. Lim, Review of adaptive activation function in deep
arXiv:1912.12187 (2019). neural network, in: IEEE-EMBS Conference on Biomedical Engineer-
[117] S. Scardapane, S. Van Vaerenbergh, S. Totaro, A. Uncini, Kafnets: ing and Sciences, 2018, pp. 686–690.
Kernel-based non-parametric activation functions for neural networks, [141] A. K. Dubey, V. Jain, Comparative study of convolution neural network’s
Neural Networks 110 (2019) 19–32. relu and leaky-relu activation functions, in: Applications of Computing,
[118] S. Scardapane, E. Nieddu, D. Firmani, P. Merialdo, Multikernel activa- Automation and Wireless Systems in Electrical Engineering, Springer,
tion functions: formulation and a case study, in: INNS Big Data and 2019, pp. 873–880.
Deep Learning conference, 2019, pp. 320–329. [142] C. Banerjee, T. Mukherjee, E. Pasiliao Jr, An empirical study on general-
[119] S. Scardapane, S. Van Vaerenbergh, A. Hussain, A. Uncini, Complex- izations of the relu activation function, in: ACM Southeast Conference,
valued neural networks with nonparametric activation functions, IEEE 2019, pp. 164–167.
Transactions on Emerging Topics in Computational Intelligence (2018). [143] T. Villmann, J. Ravichandran, A. Villmann, D. Nebel, M. Kaden, Ac-
[120] S. Scardapane, S. Van Vaerenbergh, D. Comminiello, A. Uncini, Widely tivation functions for generalized learning vector quantization-a perfor-
linear kernels for complex-valued kernel activation functions, in: IEEE mance comparison, arXiv preprint arXiv:1901.05995 (2019).
International Conference on Acoustics, Speech and Signal Processing, [144] G. Castaneda, P. Morris, T. M. Khoshgoftaar, Evaluation of maxout acti-
2019, pp. 8528–8532. vations in deep learning across several big data domains, Journal of Big
[121] M. Kobayashi, Singularities of three-layered complex-valued neural net- Data 6 (1) (2019) 72.
works with split activation function, IEEE Transactions on Neural Net- [145] Y. Wang, Y. Li, Y. Song, X. Rong, The influence of the activation func-
works and Learning Systems 29 (5) (2017) 1900–1907. tion in a convolution neural network model of facial expression recogni-
[122] J. Pennington, S. Schoenholz, S. Ganguli, Resurrecting the sigmoid in tion, Applied Sciences 10 (5) (2020) 1897.
deep learning through dynamical isometry: theory and practice, in: Ad- [146] A. Apicella, F. Donnarumma, F. Isgrò, R. Prevete, A survey on modern
vances in Neural Information Processing Systems, 2017, pp. 4785–4795. trainable activation functions, arXiv preprint arXiv:2005.00817 (2020).
[123] E. Sansone, F. G. De Natale, Training feedforward neural networks with [147] T. Szandała, Review and comparison of commonly used activation func-
standard logistic activations is feasible, arXiv preprint arXiv:1710.01013 tions for deep neural networks, in: Bio-inspired Neurocomputing, 2020,
(2017). pp. 203–224.
[124] L. Lu, Y. Shin, Y. Su, G. E. Karniadakis, Dying relu and initializa- [148] A. Krizhevsky, Learning multiple layers of features from tiny images,
tion: Theory and numerical examples, arXiv preprint arXiv:1903.06733 Tech Report, Univ. of Toronto (2009).
(2019). [149] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[125] D. Arpit, Y. Bengio, The benefits of over-parameterization at initializa- M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural net-
tion in deep relu networks, arXiv preprint arXiv:1901.03611 (2019). works for mobile vision applications, arXiv preprint arXiv:1704.04861
[126] D. Aguirre, O. Fuentes, Improving weight initialization of relu and out- (2017).
put layers, in: International Conference on Artificial Neural Networks, [150] K. Simonyan, A. Zisserman, Very deep convolutional networks for
2019, pp. 170–184. large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014).
[127] R. Burkholz, A. Dubatovka, Initialization of relus for dynamical isome- [151] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
try, in: Advances in Neural Information Processing Systems, 2019, pp. V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE
2382–2392. Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[128] D. Yarotsky, Error bounds for approximations with deep relu networks, [152] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recog-
Neural Networks 94 (2017) 103–114. nition, in: IEEE Conference on Computer Vision and Pattern Recogni-
[129] R. Arora, A. Basu, P. Mianjy, A. Mukherjee, Understanding deep neural tion, 2016, pp. 770–778.
networks with rectified linear units, arXiv preprint arXiv:1611.01491 [153] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: IEEE Con-
(2016). ference on Computer Vision and Pattern Recognition, 2018, pp. 7132–
[130] M. Hein, M. Andriushchenko, J. Bitterwolf, Why relu networks yield 7141.
high-confidence predictions far away from the training data and how to [154] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely con-
mitigate the problem, in: IEEE Conference on Computer Vision and nected convolutional networks, in: Proceedings of the IEEE conference
Pattern Recognition, 2019, pp. 41–50. on computer vision and pattern recognition, 2017, pp. 4700–4708.
[131] S. Goel, S. Karmalkar, A. Klivans, Time/accuracy tradeoffs for learn- [155] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for au-
ing a relu with respect to gaussian marginals, in: Advances in Neural tomatic evaluation of machine translation, in: Proceedings of the 40th
Information Processing Systems, 2019, pp. 8582–8591. annual meeting of the Association for Computational Linguistics, 2002,
[132] S. Dittmer, J. Emily, P. Maass, Singular values for relu layers, IEEE pp. 311–318.
Transactions on Neural Networks and Learning Systems (2019).
[133] A. Kristiadi, M. Hein, P. Hennig, Being bayesian, even just a bit,
fixes overconfidence in relu networks, arXiv preprint arXiv:2002.10118
(2020).
[134] B. Karlik, A. V. Olgac, Performance analysis of various activation func-
tions in generalized mlp architectures of neural networks, International
Journal of Artificial Intelligence and Expert Systems 1 (4) (2011) 111–
122.
[135] G. Alcantara, Empirical analysis of non-linear activation functions
for deep neural networks in classification tasks, arXiv preprint
arXiv:1710.11272 (2017).
[136] H. K. Vydana, A. K. Vuppala, Investigative study of various activation
functions for speech recognition, in: National Conference on Commu-
nications, 2017, pp. 1–5.
[137] D. Pedamonti, Comparison of non-linear activation functions for
deep neural networks on mnist classification task, arXiv preprint
arXiv:1804.02763 (2018).
[138] C. Nwankpa, W. Ijomah, A. Gachagan, S. Marshall, Activation func-
tions: Comparison of trends in practice and research for deep learning,
arXiv preprint arXiv:1811.03378 (2018).
[139] K. Eckle, J. Schmidt-Hieber, A comparison of deep networks with relu
activation function and linear spline-type methods, Neural Networks 110
(2019) 232–242.