Need and Use of Activation Functions in Anndeep Learning
Need and Use of Activation Functions in Anndeep Learning
Artificial Neural Networks as the powerful and a very strong tool which mimics the human brain.
As our brain has neurons which help us send and receive signals through Axons and Synapses, similarly
machines are being developed to act like a human being. Although it could not enact the same, it is pretty
In ANN we perform the sum of products of inputs(X) and their corresponding Weights(W) and apply an
Activation function f(x) to it to get the output of that layer and feed it as an input to the next layer.
If we do not apply the Activation function, then the output signal will only be a Linear function. Although a linear
function is easy to solve, it will possess less functional complexity and mapping.
We want our Neural Network for linear functions, we want it for more complex functionalities. Also without
activation function, our Neural network would not be able to learn and model other complicated data such as
images, videos, audios, speech etc. That is why we use Artificial Neural network techniques (Deep learning) to
make sense of something complicated, high dimensional, non-linear-big datasets, where the model has lots and lots
of hidden layers in between and has a very complicated architecture which helps us to make sense and extract
knowledge from such complicated big data sets.
You will be glad to know that Neural Networks are considered as Universal Function Approximators i.e. neural
networks can compute any function.
Hence, we need the Activation function to make the Neural networks more powerful and to learn something
complex.
Step Function
Now, if we talk of a range, then the first thought that comes to our mind is setting a threshold value above which
the neuron is activated and if the value is below the threshold value, then it’s not activated. Sounds easy. Yes! It
is. This function is called as Step Function.
See the figure below to understand more:
Output is 1 i.e. activated when value > 0 and 0 i.e. not activated when value < 0.
Those things which seem easy carries many drawbacks with them. So is the case here. This function is applicable
only there where we have only two classeFrom above, we can conclude that ReLU activation function is the best
choice among all for the hidden layers. It is also a general activation function. And if we found any dead neuron
case, the Leaky ReLU must be used.
We should avoid using tanh function as it produces a vanishing gradient problem which results in degradation of
our model performance.
For binary classification problem, the sigmoid function is a very natural choice for the output layer and if
classification is not binary then softmax function should be suggested.
s. Hence, for more classifications, we can’t use this. Moreover, the gradient of the step function is zero, i.e.
during propagation when we will calculate the error and send it back for the improvement of the model, it will
reduce it to zero which will stop the improvement of the model. Hence, the best possible model will not come out.
Linear Function
We saw the problem of gradient being zero in step function. This problem can be overcome by using the linear
function.
f(x) = ax
If we take a = 4.
Here, the activation is proportional to the input. One of the problems of only 2 classes has been overcome here as
by putting any value to a, x can correspond any value and many neurons can be activated at the same time using it.
Still, we have an issue here. If we take the derivative of this function. Then it comes out to be:
f'(x) = a
Sigmoid Function
It is one of the most widely used activation functions. It is a non-linear, ‘S’ shaped curve. It is of the form:
f(x)=1/(1+e^-x)
The X values range in -2 to 2. Notice that the small changes in x produce large changes in Y, i.e. Y is steep. The
range of the function is 0 to 1.
It is a smooth function and is continuously differentiable. the main advantage it possesses over step function and the
linear function is that it is non-linear.
If we look at it’s derivative:
Sigmoid Function,
If we look in the above graph then we find that the curve is smooth and dependent on x i.e. if we calculate the
error, then we will get some value to be used for improving the model. We can see that beyond -3 to +3, the
function is almost flattening, i.e. it is approaching to zero which indicates that the error will now be minimal and
the model will stop learning. This problem is termed as ‘Vanishing Gradient’.
One more problem that sigmoid function offers is that its range is between 0 and 1 i.e. it is not zero centered i.e.
it is not symmetric about the origin and also all the values are positive.
It is usually used in the output layer of binary classification, where results correspond to either 0 or 1.
Tanh Function
The tanh function is almost similar to the sigmoid function. It is actually the scaled version of the sigmoid function.
tanh(x)=2sigmoid(2x)-1
or it can be written as:
tanh(x)=2/(1+e^(-2x)) -1
Its range is between -1 to 1. Hence, it solves the problem of the values being of the same sign. Its optimization is
easier hence, it is always preferred over Sigmoid function. But it also suffers from Vanishing Gradient Problem.
It is usually used in hidden layers of a neural network.
If we look at the derivative in the above graph, we will find that tanh function is steeper than sigmoid function.
Hence, what we need out of two totally depends on the gradient required.
ReLU Function
The ReLU function is the Rectified Linear Units. It is the most widely used activation function.
f(x)=max(0,x)
The graphical representation of ReLU is:
From above, we can conclude that ReLU activation function is the best choice among all for the hidden layers. It is
also a general activation function. And if we found any dead neuron case, the Leaky ReLU must be used.
We should avoid using tanh function as it produces a vanishing gradient problem which results in degradation of our
model performance.
For binary classification problem, the sigmoid function is a very natural choice for the output layer and if
classification is not binary then softmax function should be suggested.
ReLU Function,
It seems like the same linear function in the positive axis, but no ReLU is non-linear i.e. it can be easily used
to backpropagate the errors and can have multiple layers for activation. The range of ReLU is [0 to inf).
ReLU is less computationally expensive than tanh and sigmoid functions as it involves the simple mathematical
operations. It also rectifies the Vanishing Gradient Problem. Nowadays, almost all Deep learning models are using
ReLU.
It does not activate all the neurons at the same time which is the biggest advantage of ReLU over other activation
functions. You didn’t understand? Look here. If we look at the ReLU function, the values below zero i.e. all the
negative values are considered to be zero and neuron does not get activated. This means at a particular time only a
few neurons are activated and are resulting in the network to be sparse and more efficient.
The gradient of values below zero is zero i.e. the network is not learning or the error is not backpropagated in this
region which results in the Dead Neurons which then never gets activated.
Leaky ReLU
Leaky ReLU is a modification function. It is introduced to solve the problem of dying neurons during training of
gradients in ReLU function. A small slope is introduced to keep the neuron updates alive.
A new variant is formed which is both ReLU and Leaky ReLU called MaxOutfunction.
Softmax
Softmax is a non-linear function. It is a type of sigmoid function but is handy while trying to handle multiple classes
or classification problems.
As we know for output layer of binary classification we use the sigmoid function but for more than two types of
classification, we should use softmax function.
It tries to attain the probabilities of each input of the class. The softmax function would squeeze the outputs for
each class between 0 and 1 and would also divide by the sum of the outputs. It can be explained as:
Conclusion:
Which activation function to be used?
From above, we can conclude that ReLU activation function is the best choice among all for the hidden layers.
It is also a general activation function. And if we found any dead neuron case, the Leaky ReLU must be used.
We should avoid using tanh function as it produces a vanishing gradient problem which results in degradation of
our model performance.
For binary classification problem, the sigmoid function is a very natural choice for the output layer and if
classification is not binary then softmax function should be suggested.