0% found this document useful (0 votes)
6 views

Activation Functions

b.tech 6th sem notes

Uploaded by

ankitab7839
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Activation Functions

b.tech 6th sem notes

Uploaded by

ankitab7839
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Activation Function?

function that you use to get the output of node. It is also


known as Transfer Function.

Why we use Activation functions with Neural Networks?

It is used to determine the output of neural network like yes


or no. It maps the resulting values in between 0 to 1 or -1 to
1 etc. (depending upon the function).

The Activation Functions can be basically divided into 2 types-

1. Linear Activation Function

2. Non-linear Activation Functions

Linear or Identity Activation Function

As you can see the function is a line or linear. Therefore, the output of the
functions will not be confined between any range.

Equation : f(x) = x

Range : (-infinity to infinity)


It doesn’t help with the complexity or various parameters of usual data
that is fed to the neural networks.

Non-linear Activation Function

The Nonlinear Activation Functions are the most used activation functions.
Non linearity helps to makes the graph look something like this

Fig: Non-linear Activation Function

It makes it easy for the model to generalize or adapt with variety of data
and to differentiate between the output.

The main terminologies needed to understand for nonlinear functions are:

Derivative or Differential: Change in y-axis w.r.t. change in


x-axis.It is also known as slope.

Monotonic function: A function which is either entirely non-


increasing or non-decreasing.

The Nonlinear Activation Functions are mainly divided on the basis of


their range or curves-

1. Sigmoid or Logistic Activation Function

The Sigmoid Function curve looks like a S-shape.

The Sigmoid Function curve looks like a S-shape.


Fig: Sigmoid Function

The main reason why we use sigmoid function is because it exists


between (0 to 1). Therefore, it is especially used for models where we
have to predict the probability as an output.Since probability of anything
exists only between the range of 0 and 1, sigmoid is the right choice.

The function is differentiable.That means, we can find the slope of the


sigmoid curve at any two points.

The function is monotonic but function’s derivative is not.

The logistic sigmoid function can cause a neural network to get stuck at
the training time.

The softmax function is a more generalized logistic activation function


which is used for multiclass classification.

2. Tanh or hyperbolic tangent Activation Function

tanh is also like logistic sigmoid but better. The range of the tanh function
is from (-1 to 1). tanh is also sigmoidal (s - shaped).
Fig: tanh v/s Logistic Sigmoid

The advantage is that the negative inputs will be mapped strongly negative
and the zero inputs will be mapped near zero in the tanh graph.

The function is differentiable.

The function is monotonic while its derivative is not monotonic.

The tanh function is mainly used classification between two classes.

Both tanh and logistic sigmoid activation functions are used


in feed-forward nets.

3. ReLU (Rectified Linear Unit) Activation Function

The ReLU is the most used activation function in the world right
now.Since, it is used in almost all the convolutional neural networks or
deep learning.
Fig: ReLU v/s Logistic Sigmoid

As you can see, the ReLU is half rectified (from bottom). f(z) is zero
when z is less than zero and f(z) is equal to z when z is above or equal to
zero.

Range: [ 0 to infinity)

The function and its derivative both are monotonic.

But the issue is that all the negative values become zero immediately
which decreases the ability of the model to fit or train from the data
properly. That means any negative input given to the ReLU activation
function turns the value into zero immediately in the graph, which in turns
affects the resulting graph by not mapping the negative values
appropriately.

4. Leaky ReLU

It is an attempt to solve the dying ReLU problem


Fig : ReLU v/s Leaky ReLU

Can you see the Leak?

The leak helps to increase the range of the ReLU function. Usually, the
value of a is 0.01 or so.

When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Both Leaky and Randomized ReLU functions are monotonic in nature.


Also, their derivatives also monotonic in nature.

Perceptrons
A Perceptron is an Artificial Neuron
It is the simplest possible Neural Network
Neural Networks are the building blocks of Machine Learning.

The original Perceptron was designed to take a number of binary inputs,


and produce one binary output (0 or 1).

The idea was to use different weights to represent the importance of


each input, and that the sum of the values should be greater than
a threshold value before making a decision like true or false (0 or 1).
Perceptron Example

Imagine a perceptron (in your brain).

The perceptron tries to decide if you should go to a concert.

Is the artist good? Is the weather good?

What weights should these facts have?

The Perceptron Algorithm

Frank Rosenblatt suggested this algorithm:

1. Set a threshold value


2. Multiply all inputs with its weights

Criteria Input Weight

Artists is Good x1 = 0 or 1 w1 = 0.7

Weather is Good x2 = 0 or 1 w2 = 0.6

Friend will Come x3 = 0 or 1 w3 = 0.5


Food is Served x4 = 0 or 1 w4 = 0.3

Alcohol is Served x5 = 0 or 1 w5 = 0.4


3. Sum all the results
4. Activate the output

1. Set a threshold value:

 Threshold = 1.5

2. Multiply all inputs with its weights:


 x1 * w1 = 1 * 0.7 = 0.7
 x2 * w2 = 0 * 0.6 = 0
 x3 * w3 = 1 * 0.5 = 0.5
 x4 * w4 = 0 * 0.3 = 0
 x5 * w5 = 1 * 0.4 = 0.4

3. Sum all the results:

 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4. Activate the Output:

 Return true if the sum > 1.5 ("Yes I will go to the Concert")

Perceptron Terminology

 Perceptron Inputs
 Node values
 Node Weights
 Activation Function

Perceptron Inputs

Perceptron inputs are called nodes.

The nodes have both a value and a weight.

Node Values

In the example above, the node values are: 1, 0, 1, 0, 1

The binary input values (0 or 1) can be interpreted as (no or yes) or (false


or true).

Node Weights

Weights shows the strength of each node.

In the example above, the node weights are: 0.7, 0.6, 0.5, 0.3, 0.4

The Activation Function

The activation functions maps the result (the weighted sum) into a
required value like 0 or 1.

In the example above, the activation function is simple: (sum > 1.5)
The binary output (1 or 0) can be interpreted as (yes or no) or (true or
false).

Neural Networks

The Perceptron defines the first step into Neural Networks.

Multi-Layer Perceptrons can be used for very sophisticated decision


making.

In the Neural Network Model, input data (yellow) are processed against
a hidden layer (blue) and modified against more hidden layers (green) to
produce the final output (red).

The First Layer:


The 3 yellow perceptrons are making 3 simple decisions based on the
input evidence. Each single decision is sent to the 4 perceptrons in the
next layer.

The Second Layer:


The blue perceptrons are making decisions by weighing the results from
the first layer. This layer make more complex decisions at a more abstract
level than the first layer.

The Third Layer:


Even more complex decisions are made by the green perceptons.the first
layer. This layer make more complex decisions at a more abstract level
than the first layer.

The Third Layer:


Even more complex decisions are made by the green perceptons.
Perceptron

Although today the Perceptron is widely recognized as an algorithm, it


was initially intended as an image recognition machine. It gets its name
from performing the human-like function of perception, seeing and
recognizing images.

In particular, interest has been centered on the idea of a


machine which would be capable of conceptualizing inputs
impinging directly from the physical environment of light,
sound, temperature, etc. — the “phenomenal world” with
which we are all familiar — rather than requiring the
intervention of a human agent to digest and code the
necessary information.[4]

Rosenblatt’s perceptron machine relied on a basic unit of computation,


the neuron. Just like in previous models, each neuron has a cell that
receives a series of pairs of inputs and weights.

The major difference in Rosenblatt’s model is that inputs are combined


in a weighted sum and, if the weighted sum exceeds a predefined
threshold, the neuron fires and produces an output.

Perceptrons neuron model (left) and threshold logic (right). (Image by


author)

Threshold T represents the activation function. If the weighted sum of


the inputs is greater than zero the neuron outputs the value 1, otherwise
the output value is zero.

Multilayer Perceptron

The Multilayer Perceptron was developed to tackle this limitation. It is a


neural network where the mapping between inputs and output is non-
linear.
A Multilayer Perceptron has input and output layers, and one or
more hidden layers with many neurons stacked together. And while in
the Perceptron the neuron must have an activation function that imposes a
threshold, like ReLU or sigmoid, neurons in a Multilayer Perceptron can
use any arbitrary activation function.

Multilayer Perceptron. (Image by author)

Multilayer Perceptron falls under the category of feedforward algorithms,


because inputs are combined with the initial weights in a weighted sum
and subjected to the activation function, just like in the Perceptron. But the
difference is that each linear combination is propagated to the next layer.

Each layer is feeding the next one with the result of their computation,
their internal representation of the data. This goes all the way through the
hidden layers to the output layer.But it has more to it.

If the algorithm only computed the weighted sums in each neuron,


propagated results to the output layer, and stopped there, it wouldn’t be
able to learn the weights that minimize the cost function. If the algorithm
only computed one iteration, there would be no actual learning.

You might also like