0% found this document useful (0 votes)
7 views65 pages

DL MATERIAL UNIT 2

Dl unit 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views65 pages

DL MATERIAL UNIT 2

Dl unit 2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

UNIT II

Introducing Deep Learning: Biological and Machine Vision, Human and


Machine Language, Artificial Neural Networks, Training Deep Networks,
Improving Deep Networks.

2.1 Introducing Deep Learning

Deep learning is based on the branch of machine learning, which is a subset of


artificial intelligence. Since neural networks imitate the human brain and so
deep learning will do. In deep learning, nothing is programmed explicitly.
Basically, it is a machine learning class that makes use of numerous nonlinear
processing units so as to perform feature extraction as well as transformation.
The output from each preceding layer is taken as input by each one of the
successive layers.

Deep learning models are capable enough to focus on the accurate features
themselves by requiring a little guidance from the programmer and are very
helpful in solving out the problem of dimensionality. Deep learning
algorithms are used, especially when we have a huge no of inputs and outputs.

Since deep learning has been evolved by the machine learning, which itself is a
subset of artificial intelligence and as the idea behind the artificial intelligence is
to mimic the human behavior, so same is "the idea of deep learning to build
such algorithm that can mimic the brain".

Deep learning is implemented with the help of Neural Networks, and the idea
behind the motivation of Neural Network is the biological neurons, which is
nothing but a brain cell.

“Deep learning is a collection of statistical techniques of machine learning for


learning feature hierarchies that are actually based on artificial neural
networks.”

So basically, deep learning is implemented by the help of deep networks, which


are nothing but neural networks with multiple hidden layers.
Example of Deep Learning

In the example given above, we provide the raw data of images to the first layer
of the input layer. After then, these input layer will determine the patterns of
local contrast that means it will differentiate on the basis of colors, luminosity,
etc. Then the 1st hidden layer will determine the face feature, i.e., it will fixate
on eyes, nose, and lips, etc. And then, it will fixate those face features on the
correct face template. So, in the 2nd hidden layer, it will actually determine the
correct face here as it can be seen in the above image, after which it will be sent
to the output layer. Likewise, more hidden layers can be added to solve more
complex problems, for example, if you want to find out a particular kind of face
having large or light complexions. So, as and when the hidden layers increase,
we are able to solve complex problems.

2.2 Biological and Machine Vision

Biological vision and deep learning are two different approaches to achieving
visual perception, but deep learning has been inspired by certain aspects of
biological vision.
2.2.1 BIOLOGICAL VISION:

Biological vision refers to the visual perception and processing of information


by living organisms, primarily animals, and humans. It is a complex sensory
process that allows organisms to interpret their surroundings, identify objects,
navigate their environment, and make sense of the visual information they
receive.
Key components of biological vision include:
Photoreceptors: The initial stage of vision involves photoreceptor cells in the
eyes, such as rod cells and cone cells in humans. These cells capture light and
convert it into electrical signals that can be processed by the brain.
Retina: The retina is a critical structure in the eye that contains photoreceptor
cells. It processes visual information and sends it as electrical signals through
the optic nerve to the brain for further interpretation.
Visual Processing: In the brain, visual information undergoes extensive
processing in different areas, including the primary visual cortex. This
processing involves the recognition of shapes, colors, motion, and the
integration of visual cues.
Depth Perception: Biological vision also enables depth perception, allowing
organisms to perceive objects in three dimensions. This is achieved through the
combination of visual cues like binocular disparity and motion parallax.
Color Vision: Many animals, including humans, possess the ability to perceive
colors. This is made possible by the presence of cone cells in the retina, each
sensitive to different wavelengths of light.
Visual Pathways: Visual information is processed through various neural
pathways in the brain, leading to the recognition of objects, faces, and scenes.
The dorsal stream is associated with spatial processing and action, while the
ventral stream is responsible for object recognition and perception.
Adaptation: Biological vision also involves the ability to adapt to different
lighting conditions. This includes both short-term adaptation to changes in
lighting and long-term adaptation through processes like dark adaptation.
Visual Perception and Illusions: Biological vision is not always a faithful
representation of the external world. Visual perception can be influenced by
illusions, where the brain interprets visual information in ways that may not
accurately reflect reality.

Projecting slides onto a screen, Hubel and Wiesel began by presenting simple
shapes like the dot shown in Figure to the cats.

Their initial results were disheartening: Their efforts were met with no response
from the neurons of the primary visual cortex. They grappled with the
frustration of how these cells, which anatomically appear to be the gateway for
visual information to the rest of the cerebral cortex, would not respond to visual
stimuli. Distraught, Hubel and Wiesel tried in vain to stimulate the neurons by
jumping and waving their arms in front of the cat. Nothing. And, then as with
many of the great discoveries, from Xrays to penicillin to the microwave oven,
Hubel and Wiesel made a serendipitous observation: As they removed one of
their slides from the projector, its straight edge elicited the distinctive crackle of
their recording equipment to alert them that a primary visual cortex neuron was
firing. Overjoyed, they celebrated up and down the Johns Hopkins laboratory
corridors. The serendipitously crackling neuron was not an anomaly. Through
further experimentation, Hubel and Wiesel discovered that the neurons that
receive visual input from the eye are in general most responsive to simple,
straight edges. Fittingly then, they named these cells simple neurons.
As depicted in below Figure 1.5, Hubel and Wiesel's research revealed
that individual simple neurons exhibit optimal responsiveness to edges oriented
at specific angles. When a multitude of these specialized simple neurons, each
attuned to a distinct edge orientation, work collectively, they collectively cover
all 360 degrees of possible orientations. The task of these simple neurons, which
are adept at detecting edge orientations, is to transmit their findings to a
significant number of complex neurons. Complex neurons, in turn, receive
visual input that has already undergone processing by numerous simple cells.
Consequently, complex neurons are strategically situated to amalgamate
multiple line orientations, creating more intricate shapes such as corners or
curves.
Figure 1.5 A “simple” cell in the primary
visual cortex of a cat fires at different rates,
depending on the orientation of a line shown to
the cat. The orientation of the line is provided
in the lefthand column of the figure, while the
righthand column shows the firing (electrical
activity) in the cell over time (one second). A
vertical line (in the fifth row) causes the most
electrical activity for this particular simple
cell. Lines slightly off vertical (in the
intermediate rows) cause less activity for the
cell, while lines approaching horizontal (in the
topmost and bottommost rows) cause little to
Fig 1.5 no activity.

In Figure 1-5, we observe the behavior of a "simple" cell situated in the primary
visual cortex of a cat. This cell exhibits varying rates of firing, depending on the
orientation of a line presented to the cat. The left-hand column of the figure
provides information about the line's orientation, while the right-hand column
illustrates the cell's firing activity over a one-second interval.
Specifically, when a vertical line is presented (as seen in the fifth row of
the figure), it triggers the highest level of electrical activity in this particular
simple cell. Lines that deviate slightly from the vertical orientation (found in the
intermediate rows) elicit reduced activity in the cell. Conversely, lines
approaching a horizontal alignment (situated in the top-most and bottom-most
rows) result in minimal to no activity in the cell.
Today, through countless subsequent recordings from the cortical
neurons of brain surgery patients as well as noninvasive techniques like
magnetic resonance imaging, neuroscientists have pieced together a fairly high-
resolution map of regions that are specialized to process particular visual
stimuli, e.g., color, motion, faces (see Figure 1.7).
2.2.2 MACHINE VISION:

Machine vision, also known as computer vision, is a field within artificial


intelligence and computer science that focuses on enabling computers to
interpret and understand visual information from the world, much like human
vision. Deep learning techniques have had a transformative impact on machine
vision, enabling computers to perform tasks like image recognition, object
detection, image generation, and more with remarkable accuracy. The timeline
shows the development of vision in biological organisms and machines.
Figure 18 Abridged timeline of biological and machine vision, highlighting the
key historical moments in the deep learning and traditional machine learning
approaches to vision that are covered in this section.

2.2.2.1 The Neocognitron:

Taking inspiration from Hubel and Wiesel's groundbreaking discovery of


the primary visual cortex's hierarchical organization, as depicted in Figure 1.6,
the Japanese electrical engineer Kunihiko Fukushima introduced a conceptually
similar architecture for machine vision in the late 1970s. This model, which he
named the "neocognitron," is outlined in Figure 1.9. While many of the finer
details of his diagrams are not crucial at this stage, there are three key aspects
worth highlighting.

Firstly, Fukushima explicitly references Hubel and Wiesel's work, even


citing three of their seminal articles on the structure of the primary visual cortex
as foundational to his ideas. This underscores the direct influence of biological
research on his machine vision model.

Secondly, Fukushima adopts the terminology of "simple" and "complex"


cells from Hubel and Wiesel, designating deeper layers (located further to the
right in his diagrams) as "hypercomplex." This terminology alignment helps
establish a conceptual bridge between biological vision and artificial neural
networks.
The most pivotal point to note is that Fukushima's arrangement of
artificial neurons in a hierarchical fashion mirrors the way biological vision
generally represents line orientations. In his model, the layers closest to the raw
visual input (found on the far left and receiving input from the image "U" in
Figure 1.9) primarily encode line orientations. In contrast, as you move to
deeper layers (further to the right), they progressively represent more complex
and abstract objects. This essential property of the neocognitron and its
subsequent deep learning descendants will become clearer through an
interactive example discussed later in this chapter.

2.2.2.2 LeNet-5

While the neocognitron was capable of, for example, identifying handwritten
characters, the accuracy and efficiency of Yann LeCun (Figure 1.10) and
Yoshua Bengio’s (Figure 1.11) LeNet5 model made it a significant
development. LeNet5’s hierarchical architecture (Figure 1.12) built on
Fukushima’s lead and the biological inspiration uncovered by Hubel and
Wiesel. In addition, LeCun and his colleagues’ benefited from superior data for
training their model, faster processing power and, critically, the
backpropagation algorithm.

Backpropagation, commonly referred to as backprop, plays a pivotal role


in enabling efficient learning across the layers of artificial neurons within deep
learning models. Coupled with the availability of data and increased processing
power, backpropagation made LeNet-5 a highly reliable tool and marked one of
the early commercial applications of deep learning. Notably, it was employed
by the United States Postal Service for automating the reading of ZIP codes on
mail envelopes.

What distinguished LeNet-5, developed by Yann LeCun and his colleagues, was
its ability to accurately predict handwritten digits without requiring them to
encode explicit knowledge about handwriting in their code. This aspect
highlights a fundamental difference between deep learning and traditional
machine learning approaches.

Feature engineering involves the application of clever and often complex


algorithms to raw data, transforming them into input variables that can be
readily processed by traditional statistical techniques such as regression, random
forests, or support vector machines. These techniques tend to be less effective
when applied directly to unprocessed data. Consequently, the refinement of
input data has historically been a primary focus for machine learning
professionals. However, deep learning, exemplified by LeNet-5, has the
capacity to automatically extract relevant features from raw data, reducing the
need for extensive manual feature engineering. This marks a significant
departure from the traditional machine learning paradigm.
2.2.2.3The Traditional Machine Learning Approach
Following the era of LeNet-5, interest in artificial neural networks,
including deep learning, waned. The prevailing consensus was that the
automated feature generation approach, while effective for tasks like
handwritten character recognition, was not widely practical. It was perceived
that the feature-free philosophy had limitations in terms of its applicability. As a
result, traditional machine learning, with its emphasis on feature engineering,
seemed to offer greater promise, leading to a redirection of funding away from
deep learning research.

To illustrate the concept of feature engineering, Figure 1.14 presents a


renowned example from the early 2000s by Paul Viola and Michael Jones.
Viola and Jones utilized rectangular filters, as demonstrated in the figure,
including vertical and horizontal black and white bars. Features generated by
applying these filters to an image could then be input into machine learning
algorithms for reliable face detection. Their work stands out because their
algorithm was efficient enough to become the first real-time face detector
outside the realm of biology.

Creating these ingenious face-detecting filters, which converted raw pixel


data into meaningful features for machine learning, was the culmination of
years of research and collaboration focused on understanding facial
characteristics. However, it's important to note that this approach is limited to
detecting faces in general and cannot recognize specific individuals like Angela
Merkel or Oprah Winfrey. To develop features for the specific detection of, for
example, Oprah, or for identifying non-face objects like houses, cars, or
Yorkshire Terriers, would necessitate expertise in those particular categories.
Achieving this could, once again, demand years of collaboration within the
academic and research community.
ImageNet and the ILSVRC:

ImageNet and the ImageNet Large Scale Visual Recognition Challenge


(ILSVRC) are two closely related initiatives that have had a profound impact on
the field of computer vision and deep learning.

ImageNet:

ImageNet is a large-scale database of labeled images. It was created to provide


a vast and diverse dataset for training and evaluating computer vision
algorithms, particularly for object recognition tasks.

Dataset Size: ImageNet originally contained over 14 million labeled images


from various categories, making it one of the largest publicly available image
datasets.

Image Categories: The images in ImageNet are categorized into thousands of


object classes, ranging from animals and everyday objects to abstract concepts.

ImageNet Challenge: ImageNet hosted the ImageNet Large Scale Visual


Recognition Challenge (ILSVRC), which aimed to promote the development of
state-of-the-art object recognition algorithms. The challenge involved
classifying objects within the ImageNet dataset.
ILSVRC (ImageNet Large Scale Visual Recognition Challenge):

Competition: ILSVRC was an annual competition held from 2010 to 2017,


which tasked participants with developing image classification models that
could accurately categorize objects in the ImageNet dataset.

Benchmarking: ILSVRC served as a benchmark for evaluating the progress in


the field of computer vision. It encouraged the development of deep learning
models and algorithms capable of achieving high accuracy on a challenging
dataset.

Historical Significance: The ILSVRC competitions played a pivotal role in the


rise of deep learning, particularly convolutional neural networks (CNNs), which
demonstrated remarkable performance improvements on object recognition
tasks. Models like AlexNet, VGG, and later, ResNet, were developed and
refined for ILSVRC competitions and became influential in the deep learning
community.

Deep Learning Breakthrough: The success of deep learning models on the


ILSVRC challenges significantly contributed to the popularization of deep
neural networks in various computer vision applications, including image
classification, object detection, and segmentation.
AlexNet:

AlexNet is a deep convolutional neural network (CNN) architecture that played


a pivotal role in advancing the field of computer vision, particularly in image
classification tasks. It was developed by Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton and won the 2012 ImageNet Large Scale Visual Recognition
Challenge (ILSVRC), demonstrating significant improvements in accuracy
compared to previous approaches. Here are some key features and contributions
of AlexNet:

Deep Architecture: AlexNet was one of the first deep neural networks with
multiple convolutional and fully connected layers. It consisted of eight learned
layers, including five convolutional layers and three fully connected layers.

Convolutional Layers: The convolutional layers in AlexNet used small 3x3


filters with a stride of 1. These layers learned hierarchical features from low-
level edges and textures to high-level object parts and concepts.
ReLU Activation: AlexNet used rectified linear unit (ReLU) activation
functions, which helped mitigate the vanishing gradient problem and
accelerated training.

Local Response Normalization (LRN): AlexNet included LRN layers after the
first and second convolutional layers. LRN helped improve generalization by
providing a form of lateral inhibition, enhancing the response of neurons that
were most activated relative to their neighbors.

Dropout: Dropout was applied to the fully connected layers during training to
prevent overfitting. It randomly deactivated a fraction of neurons during each
training iteration.

Large-Scale Training Data: AlexNet was trained on a massive dataset,


including over a million labeled images from the ImageNet database. This
large-scale dataset contributed to the model's ability to recognize a wide range
of objects and concepts.

GPU Acceleration: The training of AlexNet was significantly accelerated by


using Graphics Processing Units (GPUs), making it one of the early examples
of the importance of GPU computing in deep learning.

Top Performer in ILSVRC 2012: In the ImageNet competition, AlexNet


achieved a top-5 error rate of around 16%, which was a substantial
improvement over previous approaches and marked the beginning of the deep
learning revolution in computer vision.

Influence on Deep Learning: AlexNet's success had a profound impact on the


adoption of deep learning techniques in various computer vision applications. It
demonstrated the potential of deep neural networks for image classification
tasks and inspired subsequent architectures like VGG, GoogLeNet, and ResNet.
2 .3 HUMAN AND MACHINE LANGUAGE:

Human language and machine language in the context of deep learning refer to
how natural human language is processed and understood by computers using
artificial intelligence techniques.

2.3.1 Deep Learning Networks Learn Representations Automatically:

The Venn diagram in Figure 2.1, we show how deep learning resides within the
machine learning family of representation learning approaches. The
representation learning family, which contemporary deep learning dominates,
includes any techniques that learn features from data automatically. Indeed, we
can use the terms “feature” and “representation” interchangeably.

Figure 1.13 succinctly highlighted the advantages of representation learning in


comparison to traditional machine learning methodologies. Traditional machine
learning often excels due to the ingenuity of human-designed code that
transforms raw data—whether it be images, audio, or text—into input features
suitable for machine learning algorithms like regression, random forest, or
support vector machines. These algorithms are proficient at weighing features
but less skilled at autonomously deriving features directly from raw data. The
manual creation of features is often a specialized and labor-intensive task,
sometimes requiring extensive training in a specific domain. For instance,
working with language data might necessitate graduate-level training in
linguistics.
One of the primary merits of deep learning lies in its capacity to alleviate
the need for domain-specific expertise. Instead of laboriously crafting input
features from raw data, deep learning models can ingest the data directly.
Across numerous instances provided to the deep learning model, the initial layer
of artificial neurons, which receives the raw input, learns to represent basic
abstractions of the data. Subsequent layers then build upon this foundation,
acquiring the ability to represent progressively complex nonlinear abstractions
compared to the preceding layer.
2.3.2Natural Language Processing:
Natural language processing (NLP) is a multidisciplinary field that resides at the
confluence of computer science, linguistics, and artificial intelligence, as
depicted in Figure 2.2. In essence, NLP entails the manipulation of human
language, both spoken and written, to facilitate automated task completion or
enhance human-computer interactions. This encompasses a wide array of
applications, from automating language-based tasks to simplifying human-
computer interactions. It's important to note that NLP focuses on naturally-
spoken or naturally-written language, such as the sentence you are currently
reading.
Examples of Natural Language Processing (NLP) applications in various
industries include:
Classifying Documents: NLP is used to analyze the language within
documents, such as emails, tweets, or film reviews, to categorize them into
specific classes. For instance, it can classify emails as high-urgency or
categorize product reviews based on sentiment. In the finance industry, NLP
can predict the direction of a company's stock price based on news articles.
Machine Translation: NLP assists in language translation, offering machine-
generated suggestions for translating content from one language (e.g., English)
to another (e.g., German or Mandarin). This includes fully-automatic translation
systems, although they may not always be perfect, they provide a valuable tool
for overcoming language barriers.
Search Engines: NLP enhances search engines by autocompleting user queries
and predicting what information or websites users are likely seeking. It helps
improve the relevance of search results and provides a smoother user
experience.
Speech Recognition: NLP is integral to speech recognition systems that
interpret voice commands to provide information or carry out actions. Virtual
assistants like Amazon's Alexa, Apple's Siri, or Microsoft's Cortana rely on
NLP to understand and respond to spoken language.
Chatbots: While modern chatbots may not convincingly engage in extended,
natural conversations, they are valuable for handling relatively linear
discussions on specific topics. Businesses often employ chatbots for routine
customer service interactions, addressing common inquiries and providing
support.
2.3.3 COMPUTATIONAL REPRESENTATIONS OF LANGUAGE:
Two popular methods for converting text into numbers are onehot encoding and word
vectors.
2.3.3.1 One-Hot Representations of Words:
The traditional approach to encoding natural language numerically for
processing it with a machine is one hot encoding (Figure 2.4). In this approach,
the words of natural language in a sentence (e.g., “the”, “cat”, “sat”, “on”, “the”,
and “mat”) are represented by the columns of a matrix. Each row in the matrix,
meanwhile, represents a unique 4 word. If there are a hundred unique words
across the corpus of documents you’re feeding into your natural language
algorithm, then your matrix of one hot encoded words will have one hundred
rows. If there are a thousand unique words across your corpus, then there will
be a thousand rows in your one hot matrix, and so on.

One-hot matrices consist of binary values, specifically, either a zero or a one.


Each column in these matrices contains at most a single one, with the rest of the
entries being zeroes. Consequently, one-hot matrices are characterized by their
sparsity. A value of one in a cell indicates the presence of a particular word
(corresponding to the row) at a specific position (corresponding to the column)
within the corpus. In Figure 2.4, our corpus comprises only six words, five of
which are unique. Therefore, representing these words in a one-hot matrix
involves having six columns (to account for all possible positions) and five rows
(one for each unique word). For instance, the first unique word, "the," appears
in both the first and fifth positions, as denoted by the cells filled with ones in the
first row of the matrix. The second unique word, "cat," only occurs in the
second position, resulting in a one in the second row of the second column.

2.3.3.2.Word Vectors (Word Embeddings):

Word vectors, also known as word embeddings, represent words as continuous-


valued vectors in a lower-dimensional space. These vectors are learned from
large text corpora using techniques like Word2Vec, GloVe, or FastText. Unlike
one-hot encoding, word vectors capture semantic relationships between words.
Words with similar meanings or contexts have similar vector representations.

Figure 2.6 provides a cartoon of vector space. The space can have any number
of dimensions so we can call it an ndimensional vector space
Here's an example:

Suppose we have word vectors for "apple," "banana," and "cherry" as follows:

"apple" → [0.5, 0.7, -0.3]

"banana" → [0.4, 0.6, -0.2]

"cherry" → [0.3, 0.5, -0.1]

Now, the sentence "I like banana" can be represented as the sum or average of
the word vectors of its constituent words:

"I like banana" → ([0, 0, 0] + [0, 0, 0] + [0.4, 0.6, -0.2]) / 3 = [0.133, 0.2, -
0.067]

Word vectors offer more compact and semantically meaningful representations


of words. They can capture complex relationships, such as word similarity and
analogies (e.g., "king" - "man" + "woman" ≈ "queen"). These representations
are widely used in natural language processing tasks like sentiment analysis,
machine translation, and text classification due to their effectiveness in
capturing semantic information.

2.3.3.3 Word Vector Arithmetic:

Word vector arithmetic is a fascinating aspect of natural language processing


(NLP) that allows us to perform operations on word vectors to capture semantic
relationships between words. One of the most famous examples of word vector
arithmetic is the "king - man + woman = queen" analogy. Here's how it works:

Word Vector Arithmetic Example:

Suppose we have pre-trained word vectors for the words "king," "man,"
"woman," and "queen" in a semantic vector space. Each word is represented as
a vector of real numbers. For simplicity, let's assume these vectors are three-
dimensional:

"king" vector: [0.4, 0.3, 0.9]

"man" vector: [0.2, 0.1, 0.5]

"woman" vector: [0.1, 0.4, 0.6]

"queen" vector: [0.3, 0.2, 0.7]


Now, we can perform word vector arithmetic to find relationships between these
words:

king - man: Subtract the vector for "man" from "king."

Result: [0.4 - 0.2, 0.3 - 0.1, 0.9 - 0.5] = [0.2, 0.2, 0.4]

king - man + woman: Add the vector for "woman" to the result from step 1.

Result: [0.2 + 0.1, 0.2 + 0.4, 0.4 + 0.6] = [0.3, 0.6, 1.0]

The resulting vector [0.3, 0.6, 1.0] represents a point in the vector space. We
can interpret this vector as a word, and it is often the closest word vector to this
point in the vector space. In this case, the closest word vector is "queen."

So, the word vector arithmetic "king - man + woman" yields a vector that is
closest in meaning to "queen." This demonstrates that word vectors capture
semantic relationships, as the operation effectively computes the word "queen"
by reasoning about the relationships between "king," "man," and "woman."

Word vector arithmetic can be used to perform various analogical reasoning


tasks and capture semantic relationships between words, making it a powerful
tool in NLP and machine learning applications.

2.3.4 ELEMENTS OF NATURAL HUMAN LANGUAGE:


Natural human language is a complex and rich form of communication that
encompasses various elements and features. These elements are fundamental to
understanding and analyzing language in the context of linguistics and natural
language processing (NLP).
Here are some key elements of natural human language:
Phonetics and Phonology:
Phonetics: Phonetics deals with the physical properties of speech sounds,
including their production, transmission, and perception. It focuses on the
articulation (how sounds are produced), acoustics (properties of sound waves),
and auditory perception of speech sounds.
Phonology: Phonology is concerned with the abstract, mental representations of
speech sounds and the rules governing their combination in a particular
language. It examines phonemes, allophones, and phonological rules.
Morphology:
Morphology studies the structure of words and how words are formed from
smaller units called morphemes. Morphemes are the smallest meaningful units
in a language. Morphology deals with issues like word formation, inflection,
and derivation.
Syntax:
Syntax focuses on the structure and arrangement of words to form meaningful
sentences. It involves understanding the rules and principles that govern
sentence formation, including word order, sentence constituents, and
grammatical relations.
Semantics:
Semantics is the study of meaning in language. It explores how words, phrases,
and sentences convey meaning and how meaning is interpreted by speakers. It
addresses issues like word meaning, ambiguity, and semantic roles.
Pragmatics:
Pragmatics deals with how language is used in context to convey meaning
beyond the literal interpretation of words. It includes the study of implicature,
presupposition, speech acts, and the interpretation of indirect speech.
Discourse Analysis:
Discourse analysis examines how language is used in longer stretches of
communication, such as conversations, narratives, and written texts. It considers
issues like coherence, cohesion, and the organization of discourse.
Semiotics:
Semiotics is the study of signs and symbols and their use in communication. It
includes the analysis of linguistic signs (words), non-linguistic signs (gestures,
symbols), and how meaning is constructed through sign systems.
Prosody:
Prosody refers to the rhythm, intonation, and stress patterns in speech. It
conveys emotional tone, emphasis, and additional meaning beyond the words
themselves.
Lexicon:
The lexicon is the mental dictionary of a language user, containing the
vocabulary and word knowledge. It includes information about word meanings,
usage, and relationships between words.
Dialect and Variation:
Languages often have dialects and regional variations that affect pronunciation,
vocabulary, and grammar. Sociolinguistics examines the social and cultural
aspects of language variation.
Writing Systems:
Many languages have written forms, and writing systems vary across cultures.
Understanding how speech is represented in written form is a crucial aspect of
language analysis.
These elements collectively contribute to the richness and complexity of natural
human language. Linguists and researchers in NLP and computational
linguistics analyze and model these elements to build language understanding
systems and develop algorithms for various language-related tasks.
2.4ARTIFICIAL NEURAL NETWORKS:
We’ll cover how individual neural units are linked together to form artificial
neural networks, including deep learning networks. we create an artificial neural
network to predict what digit a given handwritten MNIST image represents. As
shown in the rough schematic diagram in Figure 5.4, this artificial neural
network features one hidden layer of artificial neurons for a total of three layers
2.4.1THE INPUT LAYER:

1. an input layer consisting of 784 neurons, one for each of the 784 pixels in an
MNIST image

2. a hidden layer composed of 64 sigmoid neurons

3. an output layer consisting of 10 softmax neurons, one for each of the ten
classes of digits Of these three, the input layer is the most straightforward to
detail. n the network architecture corresponds directly to the shape of the input
data.

2.4.2 DENSE LAYERS:

There are many kinds of hidden layers. The most general type is the
dense layer, which can also be called a fully connected layer. Dense layers are
found in much deep learning architecture.

1. An input layer with two neurons: one for storing the vertical position of
a given dot within the grid on the far right, and the other for storing the dot’s
horizontal position.
2. A hidden layer composed of eight ReLU neurons. Visually, we can see
that this is a dense layer because each of the eight neurons in it is connected
(i.e., is receiving information) from both of the inputlayer neurons, for a total of
16 (= 8 × 2) incoming connections.
3. Another hidden layer composed of eight ReLU neurons. We can again
discern that this is a dense layer because its eight neurons each receive input
from each of the eight neurons in the previous layer, for a total of 64 (= 8 × 8)
inbound connections. Note how the neurons in this layer are nonlinearly
recombining the straightedge features provided by the neurons in the first
hidden layer to produce more elaborate features like curves and circles.
4. A third dense hidden layer, this one consisting of four ReLU neurons
for a total of 32 (= 4 × 8) connecting inputs. This layer nonlinearly recombines
the features from the previous hidden layer to learn more complex features that
begin to look directly relevant to the binary (orange versus blue) classification
problem shown in the grid on the right.
5. A fourth and final dense hidden layer. With its two ReLU neurons, it
receives a total of eight (= 2 × 4) inputs from the previous layer. The neurons in
this layer devise such elaborate features via nonlinear recombination that they
visually approximate the overall boundary dividing blue from orange on the
grid.
6. An output layer made up of a single sigmoid neuron. Sigmoid is the
typical choice of neuron for a binary classification problem like this one.The
sigmoid function outputs activations that range from zero up to one, allowing us
to 1 2 obtain the network’s estimated probability that a given input x is a
positive case (a blue dot in the current example) or inversely, the probability
that it is a negative case. Like the hidden layers, the output layer is dense too: Its
neuron receives information from both neurons of the final hidden layer for a
total of two (= 1 × 2) connections.
Ex: A HOT DOG-DETECTING DENSE NETWORK:
A HOT DOG-DETECTING DENSE NETWORKLet’s further strengthen
our comprehension of dense networks by returning to two old flames of ours
from a frivolous hot dogdetecting binary classifier and the mathematical
notation we used to define artificial neurons. As shown in Figure 7.1, our hot
dog classifier is no longer a single neuron; in this chapter, it is a dense network
of artificial neurons.

More specifically, with this network architecture:


 We have reduced the number of input neurons down to two for
simplicity
 The first input neuron, x1 , represents the volume of ketchup (in,
say, milliliters, which abbreviates to mL) on the object being
considered by the network. (We are no longer working with
perceptrons, so we are no longer restricted to binary inputs only.)
 The second input neuron, x2 , represents mL of mustard. œ We
have two dense hidden layers:
 The first hidden layer has three ReLU neurons.
 The second hidden layer has two ReLU neurons.
 The output neuron is denoted by ŷ in the network. This is a binary
classification problem, so—as outlined in the previous section—
this neuron should be sigmoid.

As in our perceptron examples y = 1 corresponds to the presence of a hot


dog and y = 0 corresponds to the presence of some other object.
2.4.3 Forward Propagation through the First Hidden Layer:
As suggested by the rightfacing arrow along the bottom of Figure 7.1, executing
the calculations through an artificial neural network from the input layer (the x
values) through to the output layer (ŷ) is called forward propagation.
Immediately above, we detailed the process for forward propagating through a
single neuron in the first hidden layer of our hot dogd etecting network. To
forward propagate through the remaining neurons of the first hidden layer—that
is, to calculate the a values for the neurons labelled a2 and a3 —we would
follow the same process as we did for the neuron labelled a1 . The inputs x1 and
x 1are identical for all three neurons, but despite being fed the same
measurements of ketchup and mustard, each neuron in the first hidden layer will
output a different activation a because the parameters w1, w2 and b vary for
each of the neurons in the layer.
2.4.4 Forward Propagation through Subsequent Layers:
2.4.5 THE SOFTMAX LAYER OF A FAST FOOD-CLASSIFYING
NETWORK
Given the above arithmetic, the etymology of the “softmax” name should
now be discernible: The function returns z with the highest value (the max), but
it does so softly. That is, instead of indicating that there’s a 100% chance the
object is pizza and a zero percent chance it’s either of the other two fast food
classes (that would be... hard), the network hedges its bets, to an extent, and
provides a likelihood that the object is each of the three classes

2.5 Training Deep Networks:

When input values are propagated through an artificial neural network,


the network produces an estimated output denoted as ŷ. In an ideal scenario
where the network is perfectly calibrated, the ŷ values it outputs would exactly
match the true labels y. For instance, in the context of a binary classifier
designed to detect dogs ,a y value of 1 signifies that the presented object is a hot
dog, while y = 0 indicates that it's something else. If the network were presented
with an actual dog, it should ideally yield ŷ = 1.
However, achieving a perfect match between ŷ and y, although desirable,
is not always feasible in practice. Striving for ŷ = y might be overly strict and
demanding as a definition of the "correct" ŷ. Instead, we might find substantial
satisfaction in observing a ŷ value like 0.9997, signifying the network's high
confidence that the object is a hot dog. An ŷ of 0.9 could still be deemed
acceptable, while ŷ = 0.6 might be seen as disappointing and ŷ = 0. could be
considered extremely unsatisfactory.
To quantify the range of sentiments regarding the evaluation of outputs,
spanning from being "quite pleased" to "downright awful," machine learning
algorithms often incorporate cost functions, also referred to as loss functions. In
this book, we will discuss two such functions: quadratic cost and cross-entropy
cost. Let's delve into each of these functions in sequence.

2.5.1 Quadratic Cost :

Quadratic cost is one of the simplest cost functions to calculate. It is


alternatively called mean squared error, which handily describes all that there is
to its calculation:

For any given instance i, we calculate the difference (the error) between the true
label y and the network’s estimated ŷ . We then square this difference, for two
reasons:

1. Squaring ensures that whether y is greater than ŷ or vice versa, the difference
between the two is stated as a positive value.

2. Squaring penalizes large differences between y and ŷ much more severely


than small differences.

Having obtained a squared error for each instance i with (y − ŷ ) , we can at last
calculate the mean cost C across all n of our instances by:
By taking a peek inside the Quadratic Cost Jupyter notebook from the book’s
GitHub repo, you can play around with Equation 8.1 yourself. At the top of the
notebook, we define a function to calculate squared error for an instance i:

import numpy as np

def mean_squared_error(y, y_hat):

squared_errors = (y - y_hat)**2

mse = np.mean(squared_errors)

return mse

# Example usage

actual_labels = np.array([1, 0, 1, 0, 1])

predicted_labels = np.array([0.9, 0.1, 0.8, 0.2, 0.7])

mse = mean_squared_error(actual_labels, predicted_labels)

print(f"Mean Squared Error: {mse}")

output:

Mean Squared Error: 0.07399999999999998


Saturated Neurons:
Neuron Saturation: In the above figure, called neuron saturation ,Neuron
saturation refers to a scenario in which the combination of inputs and
parameters (weights and biases) of a neuron results in extreme values of the
weighted sum z (calculated as z = w · x + b). In the plot described, the areas
encircled in red represent regions where z takes on very high or very low
values. In these regions, changes to the parameters W and b have minimal
impact on the neuron's activation a.
Tanh Activation Function: The discussion focuses on the tanh activation
function as an example of how neuron saturation occurs. Tanh, being an S-
shaped function, outputs values between -1 and 1. In the context of this
discussion, the saturation happens when Z becomes extremely large or
extremely negative.
Mathematically, the tanh activation function can be represented as:

Where:

 x represents the input value.


 e is a mathematical constant approximately equal to 2.71828.
Neural networks learn by adjusting the parameters (W and b) of their neurons
through methods like gradient descent and back propagation. However, when a
neuron is saturated, changes to W and b lead to very small changes in the
neuron's activation a. Consequently, the learning process slows down
significantly. This is because if adjustments to W and b have small impact on a,
they won't result in noticeable changes downstream (in terms of approximating
the target output y through forward propagation).

2.5.2 Cross-Entropy:

The Cross-Entropy Cost, also known as the Cross-Entropy Loss or Log Loss, is
a widely used loss function in neural networks, especially for classification
tasks. It's particularly effective when dealing with problems involving multiple
classes. The Cross-Entropy Cost measures the dissimilarity between the
predicted class probabilities and the true class probabilities.

Equation 8.3 could be expressed with a substituted in for ŷ so that the equation
generalizes neatly beyond the output layer to neurons in any layer of a network:

We load this dependency with from numpy import log.


2.5.3 OPTIMIZATION: LEARNING TO MINIMIZE COST:
To reducing the cost in the context of deep learning involves employing a
combination of two techniques: gradient descent and back propagation. These
methodologies, acting in concert, function as optimizers that drive the learning
process within the network
2.5.3.1 .Gradient descent:
Gradient descent is a fundamental optimization algorithm used in machine
learning and neural networks to minimize a cost function. It's an iterative
process that adjusts the parameters of a model to find the values that result in
the lowest possible value of the cost function. In the context of neural networks,
gradient descent helps in training the network to improve its predictions.
Here's a step-by-step explanation of how gradient descent works:
Initialization: Start by initializing the model's parameters (weights and biases)
with some initial values. These parameters define how the model transforms
input data into predictions.
Forward Propagation: Perform a forward pass through the network using the
current parameter values. This involves computing the predicted output (y^) for
a given input (x).
Calculate Loss: Calculate the value of the cost function (also known as the
loss) using the predicted output (y^) and the actual target (y).
Backpropagation: This is where the gradients of the cost function with respect
to the model's parameters are computed. The gradients indicate the direction and
magnitude of change required in each parameter to reduce the cost.
Update Parameters: Adjust the parameters in the opposite direction of the
gradients by a certain factor (learning rate). This step aims to move the
parameter values closer to the values that minimize the cost function.
Iteration: Repeat steps 2-5 for a defined number of iterations or until the cost
reaches a satisfactory level. Each iteration brings the parameter values closer to
the optimal values that minimize the cost.
Gradient descent relies on the concept of the gradient, which is a vector
containing the partial derivatives of the cost function with respect to each
parameter. The gradient points in the direction of the steepest increase of the
cost function. By moving in the opposite direction (subtracting the gradient), the
algorithm seeks to minimize the cost function.
There are different variations of gradient descent, including batch gradient
descent (using the entire dataset for each iteration), stochastic gradient descent
(using one random sample at a time), and mini-batch gradient descent (using a
small subset of the data for each iteration). Each variant has its trade-offs in
terms of computational efficiency and convergence speed.
Gradient descent is a key component of training neural networks. By iteratively
adjusting the model's parameters, it enables the network to learn from the data
and improve its predictive accuracy over time.

2.5.3.2 Learning Rate:


The learning rate is a hyperparameter that plays a crucial role in the gradient
descent optimization algorithm and its variants. It determines the step size at
which the model's parameters are updated during each iteration of training. The
choice of an appropriate learning rate is essential, as it can significantly impact
the training process and the convergence of the optimization algorithm.

In gradient descent, when updating the parameters, the learning rate scales the
magnitude of the gradient vector. Here's how the update equation looks for a
single parameter w:

Choosing a learning rate that is too small might lead to slow convergence,
requiring many iterations for the algorithm to reach an optimal solution. On the
other hand, using a learning rate that is too large might result in overshooting
the optimal point and even divergence, where the algorithm fails to converge.

Here are some common scenarios regarding the learning rate:

Small Learning Rate: A small learning rate leads to slow but stable
convergence. It allows for fine-grained adjustments, which can be useful when
the cost function landscape is complex or has many local optima.

Moderate Learning Rate: This range often provides a good balance between
convergence speed and stability. It's a reasonable starting point when trying to
find an appropriate learning rate.

Large Learning Rate: A large learning rate can speed up convergence,


especially at the beginning of training. However, if the learning rate is too large,
it can lead to oscillations around the optimal point or even divergence.

Learning Rate Scheduling: Some techniques involve changing the learning


rate during training. For instance, learning rate schedules might start with a
larger rate and decrease it over time to aid convergence.

Adaptive Learning Rate: Adaptive methods, like Adagrad, RMSProp, and


Adam, automatically adjust the learning rate based on historical gradient
information for each parameter. These methods help balance convergence speed
and stability.

Selecting the right learning rate often involves experimentation and trial and
error. Techniques such as grid search or random search over a range of learning
rates can help find a suitable value. Additionally, modern optimization
algorithms (like Adam) often incorporate adaptive learning rate mechanisms
that mitigate the need for manual tuning.

2.5.3.3 Batch Size and Stochastic Gradient Descent:

Batch size:

Batch size is a hyperparameter in machine learning and deep learning that


defines the number of data samples (or examples) to be used in each iteration or
forward and backward pass during training. In other words, it specifies how
many training examples are processed together in one batch before the model's
parameters are updated.
The following figure captures how rounds of training are repeated until we run out of
training images to sample. The sampling in step one is done without replacement, meaning
that at the end of an epoch each image has been seen by the algorithm only once, yet between
different epochs the minibatches are sampled randomly. After a total of 468 rounds, the last
batch contains only 96 samples.

Stochastic Gradient Descent (SGD:


Stochastic Gradient Descent (SGD) is an optimization algorithm commonly
used in machine learning and deep learning for training models, particularly in
large-scale and online learning scenarios. It is a variant of gradient descent that
updates model parameters iteratively by considering a random subset of training
data in each iteration.
Here are the key characteristics and steps of Stochastic Gradient Descent:
Random Sampling: In each iteration of SGD, a random mini-batch (subset) of
training examples is selected from the entire training dataset. This mini-batch
typically contains a small number of examples, such as 32, 64, or 128, but the
specific batch size is a hyperparameter that can be adjusted.
Gradient Computation: For the selected mini-batch, the gradient of the loss
function with respect to the model's parameters (weights and biases) is
computed. This gradient represents the direction and magnitude of the steepest
ascent in the loss function.
Parameter Update: The model's parameters are updated using the computed
gradient. The update is performed in the opposite direction of the gradient to
minimize the loss function. The learning rate hyperparameter controls the step
size of this update.
Iterative Process: Steps 1 to 3 are repeated for a fixed number of iterations
(epochs) or until convergence criteria are met. In each iteration, a different
random mini-batch is used.
Key points to understand about Stochastic Gradient Descent:
Stochastic Nature: The term "stochastic" in SGD refers to the random
sampling of mini-batches. This stochasticity introduces noise into the parameter
updates, which can help the algorithm escape local minima and explore
different regions of the loss surface.
Efficiency: SGD is computationally efficient because it processes only a small
subset of the data in each iteration, making it suitable for large datasets.
Variants: Several variants of SGD exist, including mini-batch SGD (which is
the most common), full-batch (batch) SGD (using the entire dataset in each
iteration), and online SGD (processing one training example at a time). These
variants offer trade-offs in terms of computational efficiency and convergence
behavior.
Learning Rate: The learning rate is a critical hyperparameter in SGD. It
determines the step size in parameter updates. A carefully chosen learning rate
can ensure stable convergence, while a poorly chosen one can lead to slow
convergence or divergence.
Convergence: SGD might not converge to the global minimum of the loss
function but rather oscillate around it. However, it often finds a good local
minimum in practice.
Regularization: The inherent noise introduced by mini-batch sampling in SGD
acts as a form of regularization, which can improve generalization and prevent
overfitting.
SGD is a fundamental optimization technique in machine learning and deep
learning, and it forms the basis for many advanced optimization algorithms used
to train neural networks, such as Adam, RMSprop, and Adagrad. It is
particularly well-suited for large-scale and online learning tasks.

BACKPROPAGATION:

While stochastic gradient descent operates well on its own to adjust parameters
and minimize cost in many types of machine learning models, for deep learning
models in particular there is an extra hurdle: We need to be able to efficiently
adjust parameters through multiple layers of artificial neurons. To do this,
stochastic gradient descent is partnered up with a method called
backpropagation. Backpropagation—or backprop for short—is an elegant
application of the “chain rule” from calculus. As shown along the bottom of
Figure and as suggested by its very name, backpropagation courses through a
neural network in the opposite direction of forward propagation.

While forward propagation carries information about the input x through


successive layers of neurons to approximate y with ŷ, backpropagation carries
information about the cost C backwards through the layers in reverse and, with
the overarching aim of reducing cost, adjusts neuron parameters throughout the
network. While the nittygritty of backpropagation has been relegated to an
appendix, it’s worth understanding (in broad strokes) what the backpropagation
algorithm does: As we’ve seen thus far, any given model is randomly initialized
with network parameters (w and b values). Thus, at the very beginning of
training when the first x value is fed in, the network essentially outputs a
completely random guess at ŷ. Of course, this won’t be a very good guess and
the associated cost of the random guess will be high. At this point, we need to
update the weights in order to minimize the cost—the very essence of learning
in neural networks. Backpropagation calculates the gradient of the cost function
with respect to each weight in the network. Recall from our mountaineering
analogies earlier that the cost function represents a hiking trail and our trilobite
is trying to reach basecamp. At each step along the way, the trilobite finds the
gradient (or the slope) of the cost function and moves down that gradient. That
movement that the trilobite just made is a weight update: By adjusting the
weight in proportion to the cost function’s gradient with respect to that weight,
we essentially adjust that weight in a way that reduces the cost! We know that
last sentence might be hard to digest at first, so hang with us.. Using
backpropagation, we move layer bylayer backwards through the network,
starting at the cost in the output layer, and we find the gradients of every single
parameter. We then use the product of the gradient of that parameter—i.e., the
relative amount which that parameter contributed to the total cost—and the
learning rate η to update the parameter.. Also, you wouldn’t be the first deep
learning practitioner who isn’t able to sketch out the specifics of
backpropagation on a whiteboard. So if there’s only one thing you take away
from this whole section, let it be this: Backpropagation uses the cost to calculate
the relative contribution by every single parameter to the total cost, and then
updates each parameter accordingly.
The first few stages of our Intermediate Net in Keras Jupyter notebook are
identical to its Shallow Net predecessor. We load the same Keras dependencies,
load the MNIST data set in the same way, and preprocess the data the same
way. The situation begins to get interesting at the cell where we design our
neural network architecture:
2.6 IMPROVING DEEP NETWORKS
2.6.1 WEIGHT INITIALIZATION
The parameters w and b are initialized with random values such that a network’s starting
approximation of y will be far off the mark, thereby leading to a high initial cost C. We
haven’t needed to dwell on this much because, in the background, Keras by default constructs
TensorFlow models that are initialized with sensible values for w and b.
While Keras does a sensible job of choosing default values—and that’s a key benefit of using
Keras in the first place—it’s certainly possible, and sometimes even necessary, to change
these defaults to suit your problem.
While Keras does a sensible job of choosing default values—and that’s a key benefit of using
Keras in the first place—it’s certainly possible, and sometimes even necessary, to change
these defaults to suit your problem.

In this notebook, we’re going to simulate 784 pixel values as inputs to a single dense
layer of artificial neurons. The inspiration behind our simulation of these 784 inputs comes of
course from our beloved MNIST digits . For the number of neurons in the dense layer, we
picked a number large enough so that, when we make some plots later on, they look pretty:
n_input = 784
n_dense = 256

We begin by creating a tensor for holding our 784 input values:

Now, for the primary point of this section: the initialization of the network parameters
w and b. Before we begin passing training data into our network, we’d like to start with
reasonably scaled parameters. This is because:
1. Large w and b values tend to correspond to larger z values, and therefore saturated
neurons.
2. Large parameter values would imply that the network has a strong opinion about how x is
related to y—before any training on data has occurred, any such strong opinions are wholly
unmerited.
Parameter values of zero, on the other hand, imply the weakest opinion on how x is related to
y. To bring back the fairytale yet again, we’re aiming for a Goldilocksstyle, middleoftheroad
approach that starts training off from a balanced and learnable beginning. With that in mind,
let’s use the TensorFlow zeros() method to initialize the 256 neurons in our dense layer with
b = 0:
The vast majority of the parameters in a typical network are weights; relatively few
are biases. As such, it’s acceptable (indeed, it’s the most common practice) to initialize biases
with zeros and the weights with a range of values near zero. One straightforward way to
generate random values near zero is to use TensorFlow’s random_normal() method to sample
values from a normal distribution like so:

To observe the impact of the weight initialization we’ve chosen, we write some
code to represent our dense layer of neurons:

If you decompose the first line, you can see that it is our “most important equation”

, z = w · x + b:

tf.matmul(x, W) uses the TensorFlow matrix multiplication operation to calculate the dot
product W,x

tf.add() adds b to that product, returning us z

The following lines of code we use the NumPy random() method to feed 784 random
numbers as inputs into our dense layer of 256 neurons, returning 256 a activations to a
variable named layer_output:
Xavier Glorot Distributions:
In deeplearning circles, popular distributions for sampling weightinitialization values were devised by
Xavier Glorot and Yoshua Bengio . These Glorot distributions, as they are typically called, are tailored
such that sampling from them will lead to neurons initially outputting small z values. Let’s examine
them in action. By replacing the normal distributionsampling code shown in below of our First
TensorFlow Neurons notebook with the following line, we sample from a Glorot distribution instead

2.6.2 UNSTABLE GRADIENTS:


Another issue associated with artificial neural networks, and one that becomes especially
perturbing as we add more hidden layers, is unstable gradients. Unstable gradients can either
be vanishing or explosive in nature. We’ll cover both varieties in turn here.

Vanishing Gradients:
Recall that using the cost C between the network’s predicted ŷ and the true y,
backpropagation works its way from the output layer toward the input layer, adjusting
network parameters with the aim of minimizing cost. As exemplified by the mountaineering
trilobite ,the parameters are each adjusted in proportion to their gradient with respect to cost:
If, for example, the gradient of a parameter (with respect to the cost) was large and positive,
this implies that the parameter contributes a large amount to the cost and so decreasing it
proportionally would correspond to a decrease in cost. In the hidden layer that is closest to
the output layer, the relationship between its parameters and cost is the most direct. The
further away a hidden layer is from the output layer, the more muddled the relationship
between its parameters and cost becomes. The impact of this is that, as we move from the
final hidden layer toward the first hidden layer, the gradient of a given parameter relative to
cost tends to flatten—it gradually vanishes.. Because of the vanishing gradient problem, if we
were to naïvely add more and more hidden layers to our neural network, eventually the
hidden layers furthest from the output would not be able to learn to any extent, crippling the
capacity for the network as a whole to learn to approximate y given x

Exploding Gradients:

While they occur much less frequently than vanishing gradients, certain network architectures
can induce exploding gradients. In this case, the gradient between a given parameter relative
to cost becomes increasingly steep as we move from the final hidden layer toward the first
hidden layer. As with vanishing gradients, exploding gradients can inhibit an entire neural
network’s capacity to learn by saturating the neurons with extreme values (recall that this was
a problem from our discussion about weights initialization).

Batch Normalization:

Batch norm takes the a activations output from the previous layer and subtracts
the batch mean and divides by the batch standard deviation. This acts to recenter
the distribution of the a values with a mean of 0 and a standard deviation of 1
(Figure 9.4). Thus, if there are any extreme values in the previous layer, they
won’t cause exploding or vanishing gradients in the next layer. Batch norm has
a few advantages:
2.6.3 MODEL GENERALIZATION (AVOIDING OVERFITTING)

where training cost continues to go down while validation cost goes up—is formally known
as over fitting. Overfitting is nicely illustrated in below Figure . Notice we have the same
data points scattered along x and y axes in each panel. We can imagine that there is some
distribution that describes these points, and here we have a sampling from that distribution.
Our goal is to generate a model that explains the relationship between x and y, but perhaps
most importantly that also approximates the original distribution— in this way, the model
will be able to generalize to new data points drawn from the distribution and not just model
the sampling of points we already have. In the first panel (top left), we use a singleparameter
model, which is limited to fitting a straight line to the data. This straight line underfits the
data: The cost (represented by the vertical gaps between the line and the data points) is high
and the model would not generalize well to new data points. Put simply, the line misses most
of the points because this kind of model is not complex enough. In the next panel (top right),
we use a model with two parameters, which fits a parabolashaped curve to the data. With this
parabolic model, the cost is much lower relative to the linear model and it appears the model
would also generalize well to new data—great!
In the third panel (bottom left) of Figure 9.5, we use a model with too many parameters —
more parameters than we have data points. With this approach we reduce the cost associated
with our training data points to nil: There is no perceptible gap between the curve and the
data. In the last panel (bottom right), however, we show new data points from the original
distribution in green, which were unseen by the model during training and so can be used to
validate the model. Despite eliminating training cost entirely, the model fits these validation
data poorly and so it gets a correspondingly sizeable validation cost. The manyparameter
model, dear friends, is overfit: It is a perfect model for the training data, but it doesn’t
actually capture the true relationship between x and y—rather, it has learned the exact
features of the training data too closely and subsequently it performs badly on unseen data.

To reduce overfitting. We’ll cover three of the best known such techniques shown below :

2.6.3.1. L1 and L2 Regularization:


L1 and L2 regularization are two common techniques used in deep learning and
machine learning to prevent overfitting and improve the generalization of neural
networks. They are also known as weight regularization or weight decay
techniques. Here's an overview of L1 and L2 regularization:

L1 Regularization (Lasso Regularization):

L1 regularization adds a penalty term to the loss function that encourages the
neural network to have small weights by adding the absolute values of the
weights to the loss.

The L1 regularization term can be represented as:

Where:

λ is the regularization parameter that controls the strength of the regularization.

is the sum of the absolute values of all the model's weights.


L1 regularization tends to produce sparse weight vectors, meaning it encourages
some of the weights to become exactly zero. This can help with feature
selection and simplifying the model.

2.6.3.2. L2 Regularization (Ridge Regularization):

L2 regularization adds a penalty term to the loss function that encourages the
neural network to have small weights by adding the squared values of the
weights to the loss.

The L2 regularization term can be represented as:

Where:

λ is the regularization parameter that controls the strength of the regularization.

is the sum of the squared values of all the model's weights.

L2 regularization tends to distribute the weight values more evenly across all
the features and does not encourage exact zero weights.

2.6.3.2 Dropout:
Dropout is a regularization technique commonly used in deep learning neural networks to
prevent overfitting and improve the generalization of the model. The idea behind dropout is
relatively simple but highly effective. During training, dropout randomly deactivates (sets to
zero) a fraction of neurons in a neural network layer during each forward and backward pass.
This means that the network does not rely too heavily on any individual neuron and becomes
less sensitive to the specific training data. Dropout helps create a more robust and generalized
model.
Let’s cover each of the three training rounds in turn:
1. In the top panel, the second neuron of the first hidden layer and the first
neuron of the second hidden layer are randomly dropped out.
2. In the middle panel, it is the first neuron of the first hidden layer and the
second one of the second hidden layer that are selected for dropout. There is no
“memory” of which neurons have been dropped out on previous training
rounds, and so it is by chance alone that the neurons dropped out in the second
round are distinct from those dropped out in the first.
3. In the bottom panel, the third neuron of the first hidden layer is dropped out
for the first time. For the second consecutive round of training, the second
neuron of the second hidden layer is also randomly selected.
Dropout does not require any additional hyperparameters or manual feature
engineering.

The dropout rate is a hyperparameter that controls the probability of a neuron


being dropped out. Common values range from 0.2 to 0.5, but the optimal rate
can vary depending on the specific problem and architecture.

You can apply dropout to multiple layers in a neural network, and you can
experiment with different dropout rates for different layers.

In popular deep learning frameworks like TensorFlow and PyTorch,


implementing dropout is straightforward. You can add dropout layers using
dedicated layers or functions, or you can specify dropout as a parameter when
defining a layer. Here's an example in Python using TensorFlow:
import tensorflow as tf

model = tf.keras.Sequential([

tf.keras.layers.Dense(128, activation='relu'),

tf.keras.layers.Dropout(0.5), # Adding a dropout layer with a 50% dropout rate

tf.keras.layers.Dense(64, activation='relu'),

tf.keras.layers.Dropout(0.3), # Adding another dropout layer with a 30% dropout rate

tf.keras.layers.Dense(10, activation='softmax')])
2.6.3.3 Data augmentation:

Data augmentation is a technique widely used in deep learning and computer


vision to increase the diversity of training data without collecting additional
labeled examples. It involves applying various transformations and
modifications to the existing dataset to create new, slightly altered versions of
the original data. Data augmentation serves several purposes:

Increase Dataset Size: One of the primary reasons for data augmentation is to
artificially expand the training dataset. In many deep learning tasks, having a
larger dataset often leads to improved model generalization and performance.

Regularization: Data augmentation acts as a form of regularization by


introducing randomness and variability into the training data. This helps prevent
overfitting by making the model more robust to different variations in the input
data.

Invariance Learning: Data augmentation can help the model learn to be


invariant to certain transformations or changes in the input data, such as
rotation, scaling, or cropping. This can make the model more robust to real-
world variations in the data.

Improved Generalization: By exposing the model to a wider range of data


variations during training, data augmentation can help the model generalize
better to unseen data.

Common Data Augmentation Techniques:

Image Augmentation: In computer vision tasks, common data augmentation


techniques include:

Rotation: Randomly rotate images by a certain degree.

Horizontal and Vertical Flipping: Flip images horizontally or vertically.

Translation: Shift the image horizontally and vertically.

Scaling: Resize the image by a random factor.

Shearing: Apply shearing transformations.

Color Jittering: Alter the brightness, contrast, and saturation of images.


Gaussian Noise: Add random noise to the image.

Text Augmentation: In natural language processing tasks, data augmentation


can involve:

Synonym Replacement: Replace certain words with their synonyms.

Random Insertion: Insert random words into sentences.

Random Deletion: Delete random words from sentences.

Random Swap: Swap the positions of two words in a sentence.

Audio Augmentation: In speech recognition and audio processing tasks, data


augmentation can include:

Time Stretching: Alter the speed or duration of audio samples.

Pitch Shifting: Change the pitch of audio samples.

Add Noise: Add random noise to audio samples.

Data Augmentation Libraries: Many deep learning frameworks and libraries


provide built-in support for data augmentation. For example, in computer vision
tasks, libraries like TensorFlow, Keras, and PyTorch offer image data
augmentation functions and classes.

Here's a simplified example of image data augmentation using Keras:


from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create an ImageDataGenerator with desired augmentations

datagen = ImageDataGenerator(

rotation_range=40, width_shift_range=0.2 ,height_shift_range=0.2, shear_range=0.2,

zoom_range=0.2,

horizontal_flip=True,

fill_mode='nearest'

# Generate augmented images from your dataset

augmented_images = datagen.flow_from_directory('data_directory', batch_size=32)


2.6.3.4 FANCY OPTIMIZERS
Fancy optimizers, often referred to as advanced or adaptive optimizers, are optimization
algorithms used in training machine learning and deep learning models. These optimizers go
beyond traditional gradient descent methods (such as vanilla gradient descent, stochastic
gradient descent, and mini-batch gradient descent) by incorporating additional techniques to
improve convergence speed, stability, and generalization. Some popular fancy optimizers
include:

Momentum Optimizer:

Momentum is an extension of the gradient descent algorithm that adds a velocity term to the
update step. This velocity term helps the optimizer overcome small gradients and converge
faster in the relevant direction.

It effectively reduces oscillations and helps the optimizer escape local minima.

Nesterov Accelerated Gradient (NAG):

Nesterov Momentum, also known as Nesterov Accelerated Gradient, is a modification of


momentum optimization that computes the gradient at a position slightly ahead in the
direction of the momentum. This can lead to faster convergence.

Adagrad (Adaptive Gradient Algorithm):

Adagrad adapts the learning rate for each parameter individually based on the historical
gradient information. Parameters that receive large gradients get a smaller learning rate, while
parameters with small gradients get a larger learning rate.

It is well-suited for sparse data problems but may suffer from a decreasing learning rate that
makes convergence slow in the long run.

RMSprop (Root Mean Square Propagation):

RMSprop is an extension of Adagrad that mitigates the decreasing learning rate problem. It
uses a moving average of squared gradients to scale the learning rates adaptively.

RMSprop helps maintain a more consistent learning rate and is widely used in training deep
neural networks.

Adadelta:

Adadelta is another adaptive learning rate method that improves upon Adagrad and RMSprop
by keeping a running average of both squared gradients and squared parameter updates.

It avoids the need for manually specifying a learning rate or a learning rate schedule.

Adam (Adaptive Moment Estimation):


Adam combines the benefits of momentum and RMSprop by maintaining both a moving
average of gradients and a moving average of squared gradients.

It has become one of the most popular optimizers for deep learning due to its good
convergence properties.

AdamW:

AdamW is a variation of the Adam optimizer that adds weight decay (L2 regularization) to
the parameter updates, which can help prevent overfitting.

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno):

L-BFGS is a quasi-Newton optimization algorithm that can be used for optimizing both
convex and non-convex objective functions. It uses limited memory and approximates the
Hessian matrix.

While not as commonly used in deep learning as some other optimizers, it can be effective
for small to medium-sized networks with a limited number of parameters.

A DEEP NEURAL NETWORK IN KERAS:

import numpy as np

import tensorflow as tf

from tensorflow.keras import layers, models

from tensorflow.keras.datasets import mnist

from tensorflow.keras.utils import to_categorical

# Load and preprocess the MNIST dataset

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize pixel values to be between 0 and 1

train_images = train_images.astype('float32') / 255

test_images = test_images.astype('float32') / 255


# One-hot encode the labels

train_labels = to_categorical(train_labels)

test_labels = to_categorical(test_labels)

# Define the deep neural network model

model = models.Sequential()

# Input layer

model.add(layers.Flatten(input_shape=(28, 28)))

# Hidden layers with batch normalization and dropout

model.add(layers.Dense(512, activation='relu'))

model.add(layers.BatchNormalization())

model.add(layers.Dropout(0.5)) # 50% dropout

model.add(layers.Dense(256, activation='relu'))

model.add(layers.BatchNormalization())

model.add(layers.Dropout(0.5))

model.add(layers.Dense(128, activation='relu'))

model.add(layers.BatchNormalization())

model.add(layers.Dropout(0.5))

# Output layer

model.add(layers.Dense(10, activation='softmax'))

# Compile the model


model.compile(optimizer='adam',

loss='categorical_crossentropy',

metrics=['accuracy'])

# Train the model

history = model.fit(train_images, train_labels, epochs=20, batch_size=64,

validation_split=0.2, verbose=2)

# Evaluate the model on the test set

test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=0)

print(f'Test accuracy: {test_acc * 100:.2f}%')

You might also like