DL MATERIAL UNIT 2
DL MATERIAL UNIT 2
Deep learning models are capable enough to focus on the accurate features
themselves by requiring a little guidance from the programmer and are very
helpful in solving out the problem of dimensionality. Deep learning
algorithms are used, especially when we have a huge no of inputs and outputs.
Since deep learning has been evolved by the machine learning, which itself is a
subset of artificial intelligence and as the idea behind the artificial intelligence is
to mimic the human behavior, so same is "the idea of deep learning to build
such algorithm that can mimic the brain".
Deep learning is implemented with the help of Neural Networks, and the idea
behind the motivation of Neural Network is the biological neurons, which is
nothing but a brain cell.
In the example given above, we provide the raw data of images to the first layer
of the input layer. After then, these input layer will determine the patterns of
local contrast that means it will differentiate on the basis of colors, luminosity,
etc. Then the 1st hidden layer will determine the face feature, i.e., it will fixate
on eyes, nose, and lips, etc. And then, it will fixate those face features on the
correct face template. So, in the 2nd hidden layer, it will actually determine the
correct face here as it can be seen in the above image, after which it will be sent
to the output layer. Likewise, more hidden layers can be added to solve more
complex problems, for example, if you want to find out a particular kind of face
having large or light complexions. So, as and when the hidden layers increase,
we are able to solve complex problems.
Biological vision and deep learning are two different approaches to achieving
visual perception, but deep learning has been inspired by certain aspects of
biological vision.
2.2.1 BIOLOGICAL VISION:
Projecting slides onto a screen, Hubel and Wiesel began by presenting simple
shapes like the dot shown in Figure to the cats.
Their initial results were disheartening: Their efforts were met with no response
from the neurons of the primary visual cortex. They grappled with the
frustration of how these cells, which anatomically appear to be the gateway for
visual information to the rest of the cerebral cortex, would not respond to visual
stimuli. Distraught, Hubel and Wiesel tried in vain to stimulate the neurons by
jumping and waving their arms in front of the cat. Nothing. And, then as with
many of the great discoveries, from Xrays to penicillin to the microwave oven,
Hubel and Wiesel made a serendipitous observation: As they removed one of
their slides from the projector, its straight edge elicited the distinctive crackle of
their recording equipment to alert them that a primary visual cortex neuron was
firing. Overjoyed, they celebrated up and down the Johns Hopkins laboratory
corridors. The serendipitously crackling neuron was not an anomaly. Through
further experimentation, Hubel and Wiesel discovered that the neurons that
receive visual input from the eye are in general most responsive to simple,
straight edges. Fittingly then, they named these cells simple neurons.
As depicted in below Figure 1.5, Hubel and Wiesel's research revealed
that individual simple neurons exhibit optimal responsiveness to edges oriented
at specific angles. When a multitude of these specialized simple neurons, each
attuned to a distinct edge orientation, work collectively, they collectively cover
all 360 degrees of possible orientations. The task of these simple neurons, which
are adept at detecting edge orientations, is to transmit their findings to a
significant number of complex neurons. Complex neurons, in turn, receive
visual input that has already undergone processing by numerous simple cells.
Consequently, complex neurons are strategically situated to amalgamate
multiple line orientations, creating more intricate shapes such as corners or
curves.
Figure 1.5 A “simple” cell in the primary
visual cortex of a cat fires at different rates,
depending on the orientation of a line shown to
the cat. The orientation of the line is provided
in the lefthand column of the figure, while the
righthand column shows the firing (electrical
activity) in the cell over time (one second). A
vertical line (in the fifth row) causes the most
electrical activity for this particular simple
cell. Lines slightly off vertical (in the
intermediate rows) cause less activity for the
cell, while lines approaching horizontal (in the
topmost and bottommost rows) cause little to
Fig 1.5 no activity.
In Figure 1-5, we observe the behavior of a "simple" cell situated in the primary
visual cortex of a cat. This cell exhibits varying rates of firing, depending on the
orientation of a line presented to the cat. The left-hand column of the figure
provides information about the line's orientation, while the right-hand column
illustrates the cell's firing activity over a one-second interval.
Specifically, when a vertical line is presented (as seen in the fifth row of
the figure), it triggers the highest level of electrical activity in this particular
simple cell. Lines that deviate slightly from the vertical orientation (found in the
intermediate rows) elicit reduced activity in the cell. Conversely, lines
approaching a horizontal alignment (situated in the top-most and bottom-most
rows) result in minimal to no activity in the cell.
Today, through countless subsequent recordings from the cortical
neurons of brain surgery patients as well as noninvasive techniques like
magnetic resonance imaging, neuroscientists have pieced together a fairly high-
resolution map of regions that are specialized to process particular visual
stimuli, e.g., color, motion, faces (see Figure 1.7).
2.2.2 MACHINE VISION:
2.2.2.2 LeNet-5
While the neocognitron was capable of, for example, identifying handwritten
characters, the accuracy and efficiency of Yann LeCun (Figure 1.10) and
Yoshua Bengio’s (Figure 1.11) LeNet5 model made it a significant
development. LeNet5’s hierarchical architecture (Figure 1.12) built on
Fukushima’s lead and the biological inspiration uncovered by Hubel and
Wiesel. In addition, LeCun and his colleagues’ benefited from superior data for
training their model, faster processing power and, critically, the
backpropagation algorithm.
What distinguished LeNet-5, developed by Yann LeCun and his colleagues, was
its ability to accurately predict handwritten digits without requiring them to
encode explicit knowledge about handwriting in their code. This aspect
highlights a fundamental difference between deep learning and traditional
machine learning approaches.
ImageNet:
Deep Architecture: AlexNet was one of the first deep neural networks with
multiple convolutional and fully connected layers. It consisted of eight learned
layers, including five convolutional layers and three fully connected layers.
Local Response Normalization (LRN): AlexNet included LRN layers after the
first and second convolutional layers. LRN helped improve generalization by
providing a form of lateral inhibition, enhancing the response of neurons that
were most activated relative to their neighbors.
Dropout: Dropout was applied to the fully connected layers during training to
prevent overfitting. It randomly deactivated a fraction of neurons during each
training iteration.
Human language and machine language in the context of deep learning refer to
how natural human language is processed and understood by computers using
artificial intelligence techniques.
The Venn diagram in Figure 2.1, we show how deep learning resides within the
machine learning family of representation learning approaches. The
representation learning family, which contemporary deep learning dominates,
includes any techniques that learn features from data automatically. Indeed, we
can use the terms “feature” and “representation” interchangeably.
Figure 2.6 provides a cartoon of vector space. The space can have any number
of dimensions so we can call it an ndimensional vector space
Here's an example:
Suppose we have word vectors for "apple," "banana," and "cherry" as follows:
Now, the sentence "I like banana" can be represented as the sum or average of
the word vectors of its constituent words:
"I like banana" → ([0, 0, 0] + [0, 0, 0] + [0.4, 0.6, -0.2]) / 3 = [0.133, 0.2, -
0.067]
Suppose we have pre-trained word vectors for the words "king," "man,"
"woman," and "queen" in a semantic vector space. Each word is represented as
a vector of real numbers. For simplicity, let's assume these vectors are three-
dimensional:
Result: [0.4 - 0.2, 0.3 - 0.1, 0.9 - 0.5] = [0.2, 0.2, 0.4]
king - man + woman: Add the vector for "woman" to the result from step 1.
Result: [0.2 + 0.1, 0.2 + 0.4, 0.4 + 0.6] = [0.3, 0.6, 1.0]
The resulting vector [0.3, 0.6, 1.0] represents a point in the vector space. We
can interpret this vector as a word, and it is often the closest word vector to this
point in the vector space. In this case, the closest word vector is "queen."
So, the word vector arithmetic "king - man + woman" yields a vector that is
closest in meaning to "queen." This demonstrates that word vectors capture
semantic relationships, as the operation effectively computes the word "queen"
by reasoning about the relationships between "king," "man," and "woman."
1. an input layer consisting of 784 neurons, one for each of the 784 pixels in an
MNIST image
3. an output layer consisting of 10 softmax neurons, one for each of the ten
classes of digits Of these three, the input layer is the most straightforward to
detail. n the network architecture corresponds directly to the shape of the input
data.
There are many kinds of hidden layers. The most general type is the
dense layer, which can also be called a fully connected layer. Dense layers are
found in much deep learning architecture.
1. An input layer with two neurons: one for storing the vertical position of
a given dot within the grid on the far right, and the other for storing the dot’s
horizontal position.
2. A hidden layer composed of eight ReLU neurons. Visually, we can see
that this is a dense layer because each of the eight neurons in it is connected
(i.e., is receiving information) from both of the inputlayer neurons, for a total of
16 (= 8 × 2) incoming connections.
3. Another hidden layer composed of eight ReLU neurons. We can again
discern that this is a dense layer because its eight neurons each receive input
from each of the eight neurons in the previous layer, for a total of 64 (= 8 × 8)
inbound connections. Note how the neurons in this layer are nonlinearly
recombining the straightedge features provided by the neurons in the first
hidden layer to produce more elaborate features like curves and circles.
4. A third dense hidden layer, this one consisting of four ReLU neurons
for a total of 32 (= 4 × 8) connecting inputs. This layer nonlinearly recombines
the features from the previous hidden layer to learn more complex features that
begin to look directly relevant to the binary (orange versus blue) classification
problem shown in the grid on the right.
5. A fourth and final dense hidden layer. With its two ReLU neurons, it
receives a total of eight (= 2 × 4) inputs from the previous layer. The neurons in
this layer devise such elaborate features via nonlinear recombination that they
visually approximate the overall boundary dividing blue from orange on the
grid.
6. An output layer made up of a single sigmoid neuron. Sigmoid is the
typical choice of neuron for a binary classification problem like this one.The
sigmoid function outputs activations that range from zero up to one, allowing us
to 1 2 obtain the network’s estimated probability that a given input x is a
positive case (a blue dot in the current example) or inversely, the probability
that it is a negative case. Like the hidden layers, the output layer is dense too: Its
neuron receives information from both neurons of the final hidden layer for a
total of two (= 1 × 2) connections.
Ex: A HOT DOG-DETECTING DENSE NETWORK:
A HOT DOG-DETECTING DENSE NETWORKLet’s further strengthen
our comprehension of dense networks by returning to two old flames of ours
from a frivolous hot dogdetecting binary classifier and the mathematical
notation we used to define artificial neurons. As shown in Figure 7.1, our hot
dog classifier is no longer a single neuron; in this chapter, it is a dense network
of artificial neurons.
For any given instance i, we calculate the difference (the error) between the true
label y and the network’s estimated ŷ . We then square this difference, for two
reasons:
1. Squaring ensures that whether y is greater than ŷ or vice versa, the difference
between the two is stated as a positive value.
Having obtained a squared error for each instance i with (y − ŷ ) , we can at last
calculate the mean cost C across all n of our instances by:
By taking a peek inside the Quadratic Cost Jupyter notebook from the book’s
GitHub repo, you can play around with Equation 8.1 yourself. At the top of the
notebook, we define a function to calculate squared error for an instance i:
import numpy as np
squared_errors = (y - y_hat)**2
mse = np.mean(squared_errors)
return mse
# Example usage
output:
Where:
2.5.2 Cross-Entropy:
The Cross-Entropy Cost, also known as the Cross-Entropy Loss or Log Loss, is
a widely used loss function in neural networks, especially for classification
tasks. It's particularly effective when dealing with problems involving multiple
classes. The Cross-Entropy Cost measures the dissimilarity between the
predicted class probabilities and the true class probabilities.
Equation 8.3 could be expressed with a substituted in for ŷ so that the equation
generalizes neatly beyond the output layer to neurons in any layer of a network:
In gradient descent, when updating the parameters, the learning rate scales the
magnitude of the gradient vector. Here's how the update equation looks for a
single parameter w:
Choosing a learning rate that is too small might lead to slow convergence,
requiring many iterations for the algorithm to reach an optimal solution. On the
other hand, using a learning rate that is too large might result in overshooting
the optimal point and even divergence, where the algorithm fails to converge.
Small Learning Rate: A small learning rate leads to slow but stable
convergence. It allows for fine-grained adjustments, which can be useful when
the cost function landscape is complex or has many local optima.
Moderate Learning Rate: This range often provides a good balance between
convergence speed and stability. It's a reasonable starting point when trying to
find an appropriate learning rate.
Selecting the right learning rate often involves experimentation and trial and
error. Techniques such as grid search or random search over a range of learning
rates can help find a suitable value. Additionally, modern optimization
algorithms (like Adam) often incorporate adaptive learning rate mechanisms
that mitigate the need for manual tuning.
Batch size:
BACKPROPAGATION:
While stochastic gradient descent operates well on its own to adjust parameters
and minimize cost in many types of machine learning models, for deep learning
models in particular there is an extra hurdle: We need to be able to efficiently
adjust parameters through multiple layers of artificial neurons. To do this,
stochastic gradient descent is partnered up with a method called
backpropagation. Backpropagation—or backprop for short—is an elegant
application of the “chain rule” from calculus. As shown along the bottom of
Figure and as suggested by its very name, backpropagation courses through a
neural network in the opposite direction of forward propagation.
In this notebook, we’re going to simulate 784 pixel values as inputs to a single dense
layer of artificial neurons. The inspiration behind our simulation of these 784 inputs comes of
course from our beloved MNIST digits . For the number of neurons in the dense layer, we
picked a number large enough so that, when we make some plots later on, they look pretty:
n_input = 784
n_dense = 256
Now, for the primary point of this section: the initialization of the network parameters
w and b. Before we begin passing training data into our network, we’d like to start with
reasonably scaled parameters. This is because:
1. Large w and b values tend to correspond to larger z values, and therefore saturated
neurons.
2. Large parameter values would imply that the network has a strong opinion about how x is
related to y—before any training on data has occurred, any such strong opinions are wholly
unmerited.
Parameter values of zero, on the other hand, imply the weakest opinion on how x is related to
y. To bring back the fairytale yet again, we’re aiming for a Goldilocksstyle, middleoftheroad
approach that starts training off from a balanced and learnable beginning. With that in mind,
let’s use the TensorFlow zeros() method to initialize the 256 neurons in our dense layer with
b = 0:
The vast majority of the parameters in a typical network are weights; relatively few
are biases. As such, it’s acceptable (indeed, it’s the most common practice) to initialize biases
with zeros and the weights with a range of values near zero. One straightforward way to
generate random values near zero is to use TensorFlow’s random_normal() method to sample
values from a normal distribution like so:
To observe the impact of the weight initialization we’ve chosen, we write some
code to represent our dense layer of neurons:
If you decompose the first line, you can see that it is our “most important equation”
, z = w · x + b:
tf.matmul(x, W) uses the TensorFlow matrix multiplication operation to calculate the dot
product W,x
The following lines of code we use the NumPy random() method to feed 784 random
numbers as inputs into our dense layer of 256 neurons, returning 256 a activations to a
variable named layer_output:
Xavier Glorot Distributions:
In deeplearning circles, popular distributions for sampling weightinitialization values were devised by
Xavier Glorot and Yoshua Bengio . These Glorot distributions, as they are typically called, are tailored
such that sampling from them will lead to neurons initially outputting small z values. Let’s examine
them in action. By replacing the normal distributionsampling code shown in below of our First
TensorFlow Neurons notebook with the following line, we sample from a Glorot distribution instead
Vanishing Gradients:
Recall that using the cost C between the network’s predicted ŷ and the true y,
backpropagation works its way from the output layer toward the input layer, adjusting
network parameters with the aim of minimizing cost. As exemplified by the mountaineering
trilobite ,the parameters are each adjusted in proportion to their gradient with respect to cost:
If, for example, the gradient of a parameter (with respect to the cost) was large and positive,
this implies that the parameter contributes a large amount to the cost and so decreasing it
proportionally would correspond to a decrease in cost. In the hidden layer that is closest to
the output layer, the relationship between its parameters and cost is the most direct. The
further away a hidden layer is from the output layer, the more muddled the relationship
between its parameters and cost becomes. The impact of this is that, as we move from the
final hidden layer toward the first hidden layer, the gradient of a given parameter relative to
cost tends to flatten—it gradually vanishes.. Because of the vanishing gradient problem, if we
were to naïvely add more and more hidden layers to our neural network, eventually the
hidden layers furthest from the output would not be able to learn to any extent, crippling the
capacity for the network as a whole to learn to approximate y given x
Exploding Gradients:
While they occur much less frequently than vanishing gradients, certain network architectures
can induce exploding gradients. In this case, the gradient between a given parameter relative
to cost becomes increasingly steep as we move from the final hidden layer toward the first
hidden layer. As with vanishing gradients, exploding gradients can inhibit an entire neural
network’s capacity to learn by saturating the neurons with extreme values (recall that this was
a problem from our discussion about weights initialization).
Batch Normalization:
Batch norm takes the a activations output from the previous layer and subtracts
the batch mean and divides by the batch standard deviation. This acts to recenter
the distribution of the a values with a mean of 0 and a standard deviation of 1
(Figure 9.4). Thus, if there are any extreme values in the previous layer, they
won’t cause exploding or vanishing gradients in the next layer. Batch norm has
a few advantages:
2.6.3 MODEL GENERALIZATION (AVOIDING OVERFITTING)
where training cost continues to go down while validation cost goes up—is formally known
as over fitting. Overfitting is nicely illustrated in below Figure . Notice we have the same
data points scattered along x and y axes in each panel. We can imagine that there is some
distribution that describes these points, and here we have a sampling from that distribution.
Our goal is to generate a model that explains the relationship between x and y, but perhaps
most importantly that also approximates the original distribution— in this way, the model
will be able to generalize to new data points drawn from the distribution and not just model
the sampling of points we already have. In the first panel (top left), we use a singleparameter
model, which is limited to fitting a straight line to the data. This straight line underfits the
data: The cost (represented by the vertical gaps between the line and the data points) is high
and the model would not generalize well to new data points. Put simply, the line misses most
of the points because this kind of model is not complex enough. In the next panel (top right),
we use a model with two parameters, which fits a parabolashaped curve to the data. With this
parabolic model, the cost is much lower relative to the linear model and it appears the model
would also generalize well to new data—great!
In the third panel (bottom left) of Figure 9.5, we use a model with too many parameters —
more parameters than we have data points. With this approach we reduce the cost associated
with our training data points to nil: There is no perceptible gap between the curve and the
data. In the last panel (bottom right), however, we show new data points from the original
distribution in green, which were unseen by the model during training and so can be used to
validate the model. Despite eliminating training cost entirely, the model fits these validation
data poorly and so it gets a correspondingly sizeable validation cost. The manyparameter
model, dear friends, is overfit: It is a perfect model for the training data, but it doesn’t
actually capture the true relationship between x and y—rather, it has learned the exact
features of the training data too closely and subsequently it performs badly on unseen data.
To reduce overfitting. We’ll cover three of the best known such techniques shown below :
L1 regularization adds a penalty term to the loss function that encourages the
neural network to have small weights by adding the absolute values of the
weights to the loss.
Where:
L2 regularization adds a penalty term to the loss function that encourages the
neural network to have small weights by adding the squared values of the
weights to the loss.
Where:
L2 regularization tends to distribute the weight values more evenly across all
the features and does not encourage exact zero weights.
2.6.3.2 Dropout:
Dropout is a regularization technique commonly used in deep learning neural networks to
prevent overfitting and improve the generalization of the model. The idea behind dropout is
relatively simple but highly effective. During training, dropout randomly deactivates (sets to
zero) a fraction of neurons in a neural network layer during each forward and backward pass.
This means that the network does not rely too heavily on any individual neuron and becomes
less sensitive to the specific training data. Dropout helps create a more robust and generalized
model.
Let’s cover each of the three training rounds in turn:
1. In the top panel, the second neuron of the first hidden layer and the first
neuron of the second hidden layer are randomly dropped out.
2. In the middle panel, it is the first neuron of the first hidden layer and the
second one of the second hidden layer that are selected for dropout. There is no
“memory” of which neurons have been dropped out on previous training
rounds, and so it is by chance alone that the neurons dropped out in the second
round are distinct from those dropped out in the first.
3. In the bottom panel, the third neuron of the first hidden layer is dropped out
for the first time. For the second consecutive round of training, the second
neuron of the second hidden layer is also randomly selected.
Dropout does not require any additional hyperparameters or manual feature
engineering.
You can apply dropout to multiple layers in a neural network, and you can
experiment with different dropout rates for different layers.
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')])
2.6.3.3 Data augmentation:
Increase Dataset Size: One of the primary reasons for data augmentation is to
artificially expand the training dataset. In many deep learning tasks, having a
larger dataset often leads to improved model generalization and performance.
datagen = ImageDataGenerator(
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest'
Momentum Optimizer:
Momentum is an extension of the gradient descent algorithm that adds a velocity term to the
update step. This velocity term helps the optimizer overcome small gradients and converge
faster in the relevant direction.
It effectively reduces oscillations and helps the optimizer escape local minima.
Adagrad adapts the learning rate for each parameter individually based on the historical
gradient information. Parameters that receive large gradients get a smaller learning rate, while
parameters with small gradients get a larger learning rate.
It is well-suited for sparse data problems but may suffer from a decreasing learning rate that
makes convergence slow in the long run.
RMSprop is an extension of Adagrad that mitigates the decreasing learning rate problem. It
uses a moving average of squared gradients to scale the learning rates adaptively.
RMSprop helps maintain a more consistent learning rate and is widely used in training deep
neural networks.
Adadelta:
Adadelta is another adaptive learning rate method that improves upon Adagrad and RMSprop
by keeping a running average of both squared gradients and squared parameter updates.
It avoids the need for manually specifying a learning rate or a learning rate schedule.
It has become one of the most popular optimizers for deep learning due to its good
convergence properties.
AdamW:
AdamW is a variation of the Adam optimizer that adds weight decay (L2 regularization) to
the parameter updates, which can help prevent overfitting.
L-BFGS is a quasi-Newton optimization algorithm that can be used for optimizing both
convex and non-convex objective functions. It uses limited memory and approximates the
Hessian matrix.
While not as commonly used in deep learning as some other optimizers, it can be effective
for small to medium-sized networks with a limited number of parameters.
import numpy as np
import tensorflow as tf
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
model = models.Sequential()
# Input layer
model.add(layers.Flatten(input_shape=(28, 28)))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dropout(0.5))
# Output layer
model.add(layers.Dense(10, activation='softmax'))
loss='categorical_crossentropy',
metrics=['accuracy'])
validation_split=0.2, verbose=2)