DL_4
DL_4
Greater Noida
Unit: 4
Dr. Kumod Kumar Gupta, Ph.D. from Banasthali University, Jaipur. Presently,
he is working as an assistant professor NIET, Greater Noida (INDIA). He has
more than 16 years of teaching & research experience. His research interests
include VLSI Design, Machine Learning and Artificial Neural Network. He
has published more than 17 papers in refereed international journals. He has
attended various training programs in the area of electronics & communication
engineering.
CO.1 3 3 3 2 1 2 3
CO.2 3 3 3 2 1 2 3
CO.3 3 3 3 2 1 2 3
CO.4 3 3 3 2 1 2 3
CO.5 3 3 3 2 1 2 3
PSO1 PSO2
CO.1 2 3
CO.2 2 3
CO.3 2 3
CO.4 2 3
CO.5 2 3
Unit 2 https://www.youtube.com/watch?v=C5R5SdYzQBI
Unit 3 https://hevodata.com/learn/data-engineering-and-data-engineers
/
Unit 4 https://www.youtube.com/watch?v=IjEZmH7byZQ
Unit 5 https://www.youtube.com/watch?v=pWp3PhYI-OU
Learn what data analytics is and the various applications of data analytics.
Study of different types of data analytics and process steps. And perform data
analytics using Python’s NumPy, Pandas, and Matplotlib libraries.
Several neural networks can help solve different business problems. Let’s
look at a few of them.
•Feed-Forward Neural Network: Used for general
Regression and Classification problems.
•Convolutional Neural Network: Used for object detection and image
classification.
•Deep Belief Network: Used in healthcare sectors for cancer detection.
•RNN: Used for speech recognition, voice recognition, time series
prediction, and natural language processing.
The nodes in different layers of the neural network are compressed to form
a single layer of recurrent neural networks. A, B, and C are the parameters
of the network.
• The input layer ‘x’ takes in the input to the neural network and processes it
and passes it onto the middle layer.
• The middle layer ‘h’ can consist of multiple hidden layers, each with its own
activation functions and weights and biases. If you have a neural network
where the various parameters of different hidden layers are not affected by
the previous layer, ie: the neural network does not have memory, then you
can use a recurrent neural network.
• The Recurrent Neural Network will standardize the different activation
functions and weights and biases so that each hidden layer has the same
parameters. Then, instead of creating multiple hidden layers, it will create
one and loop over it as many times as required.
Machine Translation
Given an input in one language, RNNs can be used to translate the input into
different languages as output.
Now, let’s discuss the most popular and efficient way to deal with gradient
problems, i.e., Long Short-Term Memory Network (LSTMs).
First, let’s understand Long-Term Dependencies.
Suppose you want to predict the last word in the text: “The clouds are in the
______.”
The most obvious answer to this is the “sky.” We do not need any further
context to predict the last word in the above sentence.
Consider this sentence: “I have been staying in Spain for the last 10 years…I
can speak fluent ______.”
The word you predict will depend on the previous few words in context. Here,
you need the context of Spain to predict the last word in the text, and the most
suitable answer to this sentence is “Spanish.” The gap between the relevant
information and the point where it's needed may have become very large.
LSTMs help you solve this problem.
• LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) that is
specifically designed to address the vanishing gradient problem.
• The vanishing gradient problem occurs in RNNs when the gradients of the loss
function with respect to the model parameters become very small or even zero, as the
network iterates over long sequences of data. This can make it difficult for the network
to learn long-term dependencies between inputs and outputs.
• LSTMs solve the vanishing gradient problem by using three gates to control the flow
of information through the network:
• The forget gate determines how much of the previous cell state is forgotten at each
time step.
•The input gate determines how much of the current input is added to the cell state.
•The output gate determines how much of the cell state is output to the next time step.
The forget gate is particularly important for preventing the vanishing gradient problem.
The forget gate can be used to selectively forget information from the previous cell state,
even if that information is important for long-term dependencies. This allows the LSTM
to learn long-term dependencies without the gradients becoming too small.
• In addition to the forget gate, LSTMs also use a sigmoid activation function for the
input and output gates.
• Sigmoid activation functions have a range of [0, 1], which means that they can be
used to control the flow of information in a more gradual way than other activation
functions, such as the rectified linear unit (ReLU) activation function.
• This helps to prevent the gradients from becoming too large or too small, which can
also contribute to the vanishing gradient problem.
• As a result of these design features, LSTMs are able to learn long-term dependencies
much more effectively than RNNs without these features.
• This makes them well-suited for tasks that require the network to remember
information over long periods of time, such as machine translation, speech
recognition, and natural language processing.
• However, it is important to note that LSTMs do not completely solve the
vanishing gradient problem. In some cases, the gradients in LSTMs can still
become very small, which can make it difficult for the network to learn. However,
LSTMs are much more resistant to the vanishing gradient problem than RNNs
without these features. this can be solved via GRU
https://towardsdatascience.com/
understanding-rnns-lstms-and-grus-
ed62eb584d90
Forget Gate
o/p
Gate
Input Gate
LSTMs also have a chain-like structure, but the repeating module is a bit
different structure. Instead of having a single neural network layer, four
interacting layers are communicating extraordinarily.
Forget Gate
• The information that is no longer useful in the cell state is removed with the
forget gate.
• Two inputs x_t (input at the particular time) and h_t-1 (previous cell output) are
fed to the gate and multiplied with weight matrices followed by the addition of
bias.
• The resultant is passed through an activation function which gives a binary
output. If for a particular cell state, the output is 0, the piece of information is
forgotten and for output 1, the information is retained for future use.
Step 2: Decide How Much This Unit Adds to the Current State
In the second layer, there are two parts. One is the sigmoid function, and the
other is the tanh function. In the sigmoid function, it decides which values to
let through (0 or 1). tanh function gives weightage to the values which are
passed, deciding their level of importance (-1 to 1).
With the current input at x(t), the input gate analyzes the important
information — John plays football, and the fact that he was the captain of his
college team is important.
“He told me yesterday over the phone” is less important; hence it's forgotten.
This process of adding some new information can be done via the input gate.
Step 3: Decide What Part of the Current Cell State Makes It to the
Output
The third step is to decide what the output will be. First, we run a sigmoid
layer, which decides what parts of the cell state make it to the output. Then,
we put the cell state through tanh to push the values to be between -1 and 1
and multiply it by the output of the sigmoid gate.
Let’s consider this example to predict the next word in the sentence: “John
played tremendously well against the opponent and won for his team. For
his contributions, brave ____ was awarded player of the match.”
There could be many choices for the empty space. The current input brave
is an adjective, and adjectives describe a noun. So, “John” could be the best
output after brave.
• Recurrent neural networks are a powerful and widely used class of neural
network architectures for modeling sequence data.
• The basic idea behind RNN models is that each new element in the
sequence contributes some new information, which updates the
current state of the model.
• RNN models are also based on this notion of chain structure, and vary in
how exactly they maintain and update information. As their name
implies, recurrent neural nets apply some form of “loop.”
tangent function that has its range in [–1,1] and is strongly connected to the
sigmoid function,
• and xt and ht are the input and state vectors as defined previously.
• Finally, the hidden state vector is multiplied by another set of weights,
yielding the outputs that appear in Figure 2.
Google Translate
Sequential
Vanishing Gradient
And you can see that by output 5, the information from “What” and
“time” have all but disappeared, how well do you think you can predict
what comes after “is” and “it” be without these?
In the above problem, suppose we want to determine the gender of the speaker
in the new sentence. We would have to selectively forget certain things about
the previous states, namely, about who Bob is, and whether he likes apples, and
remember other things, that Alice is a woman and that she likes oranges.
• GRU (Gated Recurrent Unit) is a type of RNN that is specifically designed to address the
vanishing gradient problem.
• GRUs are similar to LSTMs in that they use gates to control the flow of information
through the network. However, GRUs do not have an output gate, which makes them
simpler and more efficient than LSTMs.
• GRUs solve the vanishing gradient problem by using a reset gate and an update gate.
The reset gate determines how much of the previous cell state is reset at each time step.
The update gate determines how much of the current input is added to the cell state.
• The reset gate is particularly important for preventing the vanishing gradient problem. The
reset gate can be used to selectively reset information from the previous cell state, even if
that information is important for long-term dependencies. This allows the GRU to learn
long-term dependencies without the gradients becoming too small.
• In addition to the reset gate, GRUs also use a tanh activation function for the update gate.
Tanh activation functions have a range of [-1, 1], which means that they can be used to
control the flow of information in a more gradual way than other activation functions,
such as the sigmoid activation function. This helps to prevent the gradients from becoming
too large or too small, which can also contribute to the vanishing gradient problem.
https://towardsdatascience.com/transformers-141e32e69591
https://medium.com/inside-machine-learning/what-is-a-
transformer-d07dd1fbec04
• In this example, the word “the band” in the second sentence refers to the band
“The Transformers” introduced in the first sentence.
• When you read about the band in the second sentence, you know that it is
referencing to the “The Transformers” band. That may be important for
translation.
• There are many examples, where words in some sentences refer to words in
previous sentences.
• For translating sentences like that, a model needs to figure out these sort of
dependencies and connections.
• Recurrent Neural Networks (RNNs) and Convolutional Neural Networks
(CNNs) have been used to deal with this problem because of their properties.
• Let’s go over these two architectures and their drawbacks.
• In this case where the difference between the relevant information and
the place that is needed is small, RNNs can learn to use past information
and figure out what is the next word for this sentence.
• But there are cases where we need more context. For example, let’s
say that you are trying to predict the last word of the text: “I grew up
in France… I speak fluent …”.
• Recent information suggests that the next word is probably a
language, but if we want to narrow down which language, we need
context of France, that is further back in the text.
• RNNs become very ineffective when the gap between the relevant information
and the point where it is needed become very large. That is due to the fact that
the information is passed at each step and the longer the chain is, the more
probable the information is lost along the chain.
In theory, RNNs could learn this long-term dependencies. In practice, they don’t
seem to learn them. LSTM, a special type of RNN, tries to solve this kind of
problem.
Transformers
• To solve the problem of parallelization, Transformers try to solve the problem
by using encoders and decoders together with attention models.
• Attention boosts the speed of how fast the model can translate from one
sequence to another.
• Let’s take a look at how Transformer works. Transformer is a model that
uses attention to boost the speed. More specifically, it uses self-attention.
The Transformer.
• Each encoder is very similar to each other. All encoders have the same
architecture. Decoders share the same property, i.e. they are also very
similar to each other.
• Each encoder consists of two layers: Self-attention and a feed Forward
Neural Network.
Self-Attention
Let’s start to look at the various vectors/tensors and how they flow between
these components to turn the input of a trained model into an output. As is the
case in NLP applications in general, we begin by turning each input word into a
vector using an embedding algorithm.
Each word is embedded into a vector of size 512. We’ll represent those vectors
with these simple boxes.
The embedding only happens in the bottom-most encoder. The abstraction that
is common to all the encoders is that they receive a list of vectors each of the
size 512.
Next, we’ll switch up the example to a shorter sentence and we’ll look at
what happens in each sub-layer of the encoder.
Self-Attention
Let’s first look at how to calculate self-attention using vectors, then
proceed to look at how it’s actually implemented — using matrices.
• The first step in calculating self-attention is to create three vectors from each of
the encoder’s input vectors (in this case, the embedding of each word). So for
each word, we create a Query vector, a Key vector, and a Value vector. These
vectors are created by multiplying the embedding by three matrices that we
trained during the training process.
• Notice that these new vectors are smaller in dimension than the embedding vector.
Their dimensionality is 64, while the embedding and encoder input/output vectors
have dimensionality of 512. They don’t HAVE to be smaller, this is an
architecture choice to make the computation of multiheaded attention (mostly)
constant.
The third and forth steps are to divide the scores by 8 (the square root of the
dimension of the key vectors used in the paper — 64. This leads to having more
stable gradients. There could be other possible values here, but this is the default),
then pass the result through a softmax operation. Softmax normalizes the scores so
they’re all positive and add up to 1.
https://towardsdatascience.com/transformers-
141e32e69591
Testing of hypothesis
https://nptel.ac.in/content/storage2/courses/103106120/LectureNotes/Lec3_1.pdf
Links:
http://www.numpy.org/
https://www.scipy.org/scipylib/
http://pandas.pydata.org/
5. The expression a{5} will match _____________ characters with the previous
regular expression.
A. 5 or less
B. exactly 5
C. 5 or more
D. exactly 4
7. Which of the following statements about the P value do you believe to be true?
a)The P value is the probability that the null hypothesis is true.
b)The P value is the probability that the alternative hypothesis is true.
c)The P value is the probability of obtaining the observed or more extreme results if the
alternative hypothesis is true.
d)The P value is the probability of obtaining the observed results or results which are
more extreme if the null hypothesis is true.
e)The P value is always less than 0.05.
Text Books:
(1) Glenn J. Myatt, Making sense of Data: A practical Guide to Exploratory Data Analysis
and Data Mining, John
Wiley Publishers, 2007.
(2) Learning TensorFlow by Tom Hope, Yehezkel S. Resheff, Itay Lieder O'Reilly Media,
Inc.
(3) Advanced Deep Learning with TensorFlow 2 and Keras: Apply DL, GANs, VAEs, deep
RL, unsupervised
learning, object detection and segmentation, and more, 2nd Edition.
Reference Books:
(4) Boris lublinsky, Kevin t. Smith, Alexey Yakubovich, “Professional Hadoop Solutions”, 1
st Edition, Wrox, 2013.
(5) Chris Eaton, Dirk Deroos et. al., “Understanding Big data”, Indian Edition, McGraw
Hill, 2015.
(6) Tom White, “HADOOP: The definitive Guide”, 3 rd Edition, O Reilly, 2012