Activation functions 2
Activation functions 2
introducing non-linearity into the network, allowing it to better fit complex patterns in the data. They are
applied after the convolution layer, which is a linear operation, to enable the network to learn non-linear
features. This is crucial because without activation functions, a CNN would be a linear model, unable to
capture intricate relationships between features.
There are various activation functions used in CNNs, with the ReLU (Rectified Linear Unit) being one of
the most common. The ReLU function sets all inputs below zero to exactly zero, which helps in treating
all below-average values as the same. This allows the convolutional layer to create features based on
important local combinations of pixels that have values greater than zero, without getting "squashed" by
the activation function[1][3].
The activation function is applied to every single output "pixel" individually, and it is the only thing that
adds non-linearity to the neural network. Without activation functions, the whole neural network would
be equal to a linear regression model[3].
In summary, activation functions in CNNs are crucial for introducing non-linearity, enabling the network
to learn complex patterns and features in the data. They are applied after the convolution layer and are
essential for the proper functioning of the CNN.
Citations:
[1] https://stats.stackexchange.com/questions/363190/why-we-use-activation-function-after-
convolution-layer-in-convolution-neural-net
[2] https://conferences.computer.org/ictapub/pdfs/ITCA2020-
6EIiKprXTS23UiQ2usLpR0/114100a429/114100a429.pdf
[3] https://datascience.stackexchange.com/questions/42187/activation-in-convolution-layer
[4] https://www.geeksforgeeks.org/activation-functions-neural-networks/
[5] https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
2. **Sigmoid**: Sigmoid is another commonly used activation function, especially in logistic regression
problems. It maps inputs to a range between 0 and 1, making it suitable for tasks where probabilities
need to be predicted[1][5].
3. **Tanh (Hyperbolic Tangent)**: Tanh is similar to the sigmoid function but maps inputs to a range
between -1 and 1. It is often used for classification tasks between two classes[5].
4. **Leaky ReLU**: Leaky ReLU is an improvement over ReLU, addressing the "dying ReLU" problem by
allowing a small gradient for negative inputs. This helps prevent neurons from becoming inactive[5].
These activation functions play a crucial role in introducing non-linearity to CNNs, enabling them to learn
complex patterns and improve performance in various tasks.
Citations:
[1] https://conferences.computer.org/ictapub/pdfs/ITCA2020-
6EIiKprXTS23UiQ2usLpR0/114100a429/114100a429.pdf
[2] https://stats.stackexchange.com/questions/423856/common-activation-function-in-fully-connected-
layer
[3] https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/
[4] https://datascience.stackexchange.com/questions/42187/activation-in-convolution-layer
[5] https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
Citations:
[1] https://www.baeldung.com/cs/ml-relu-dropout-layers
[2] https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/
[3] https://www.v7labs.com/blog/neural-networks-activation-functions
[4] https://www.youtube.com/watch?v=6MmGNZsA5nI
[5] https://conferences.computer.org/ictapub/pdfs/ITCA2020-
6EIiKprXTS23UiQ2usLpR0/114100a429/114100a429.pdf
The sigmoid function has the mathematical form f(x) = 1 / (1 + e^-x) and takes a real-valued input,
squashing it to a value between 0 and 1. The outputs can be easily interpreted as probabilities, making it
natural for binary classification problems[3].
However, sigmoid units suffer from the "vanishing gradient" problem, which hampers learning in deep
neural networks. As the input values become significantly positive or negative, the function saturates at
0 or 1, with an extremely flat slope. In these regions, the gradient is very close to zero, resulting in very
small changes in the weights during backpropagation, particularly for neurons in the earlier layers of
deep networks, which makes learning painfully slow or even halts it[3].
In summary, the sigmoid activation function is a smooth, continuously differentiable function that maps
real-valued inputs to a value between 0 and 1. It is often used in the output layer of binary classification
models to squash the input to a probability value between 0 and 1, which can be interpreted as the
probability of the input belonging to a particular class. However, sigmoid units suffer from the "vanishing
gradient" problem, which hampers learning in deep neural networks[3].
Citations:
[1] https://conferences.computer.org/ictapub/pdfs/ITCA2020-
6EIiKprXTS23UiQ2usLpR0/114100a429/114100a429.pdf
[2] https://machinelearningmastery.com/a-gentle-introduction-to-sigmoid-function/
[3] https://www.datacamp.com/tutorial/introduction-to-activation-functions-in-neural-networks
[4] https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
[5] https://www.v7labs.com/blog/neural-networks-activation-functions
When applied to the final layer of a neural network, the softmax function converts the network's raw
output scores or logits into a probability distribution[2]. It exponentiates each logit and normalizes the
results, ensuring that the output values fall between 0 and 1 and sum up to 1[2]. This makes the output
interpretable as class probabilities, simplifying decision-making and enabling efficient optimization
during backpropagation[2].
In summary, the softmax activation function is used in the final layer of Convolutional Neural Networks
(CNNs) for multiclass classification tasks, converting the network's raw output scores or logits into a
probability distribution. This allows for clear and normalized probability distributions across all possible
classes, facilitating decision-making and efficient optimization during backpropagation[2].
Citations:
[1] https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
[2] https://www.geeksforgeeks.org/why-should-softmax-be-used-in-cnn/
[3] https://www.youtube.com/watch?v=2yyA5bIGeJM
[4] https://machinelearningmastery.com/softmax-activation-function-with-python/
[5] https://stats.stackexchange.com/questions/423856/common-activation-function-in-fully-connected-
layer
The cost function is a measure of the error or difference between the predicted output of a
neural network and the actual output for a given set of inputs and desired outputs[1][2][5]. It is a scalar
value that quantifies the performance of the neural network, and the goal is to minimize this value to
improve the accuracy of the network's predictions.
In practice, when dealing with large datasets, the cost function is typically calculated as the average of
the cost functions for all the training examples[3]. This is because the neural network is trained on a
large number of input-output pairs, and the cost function needs to take into account the error for each
of these pairs.
There are different types of cost functions used in neural networks, depending on the specific task and
the nature of the output. For example, the mean squared error (MSE) is commonly used for regression
tasks, where the goal is to predict continuous values[4]. The categorical cross-entropy loss is used for
multi-class classification tasks, where the output belongs to one of several classes[4]. The binary cross-
entropy loss is used for binary classification tasks, where the output is either 0 or 1[4].
The choice of cost function is crucial, as it directly impacts the training process and the model's ability to
learn from data. It's essential to select a cost function that aligns with the task at hand, considering
factors such as data type, output format, and class distribution[4]. Experimentation and validation are
often necessary to determine the most suitable cost function for a given problem.
Citations:
[1] https://viso.ai/deep-learning/artificial-neural-network/
[2] https://towardsdatascience.com/step-by-step-the-math-behind-neural-networks-490dc1f3cfd9
[3] https://datascience.stackexchange.com/questions/108669/in-practice-what-is-the-cost-function-of-a-
neural-network
[4] https://www.geeksforgeeks.org/neural-networks-which-cost-function-to-use/
[5] https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-
alongside-applications